Inferring the molecular evolutionary history with multiple-character indels: from theory to practice (JATI3)
Insertions and deletions (indels) are the second most important source of natural genomic variation. Accurate inferences of the evolutionary indel histories may be used to track new variants during epidemics, study effects on protein structure and function, for tumour classification, and prediction of drug targets, and as biomarkers in general.
Description
However, indel variation in genomes remains understudied and the evolutionary signal from indels is not explored due to a lack of tools that can infer indel dynamics. Proper indel handling requires comparative sequence analyses and realistic models of indel evolution to be included in phylogenetic analyses.
Multiple sequence alignments (MSAs), trees and ancestral sequences are routinely inferred from genomic sequences but independently, one step after another, and typically without modelling indel evolution. Yet, these objects are interconnected and decisions such as model choice and simplifications made at each step lead to buildup of systematic biases. Ideally, MSAs, trees, and character & indel histories should be inferred jointly.
A recent breakthrough in our team enables such joint inferences using a likelihood-based method based on the Poisson indel process (PIP). PIP is a modification of the classical TKF91 that allows us to compute marginal likelihoods in linear time, but both models allow only single-site indel events. Even though MSAs inferred with PIP display realistic indel patterns and lengths, the inferred MSAs tend to be more fragmented.
Building upon this work, we propose to develop a new maximum likelihood method for scalable joint inference of the complete evolutionary history (MSA-tree-ancestors) under a more realistic model that allows multiple residue insertions and deletions (aka. long indels). In order for this to be feasible under long indel models, we will not marginalise over ancestral states during the heuristic search, but rather use likelihoods computed for fixed ancestors, so that the likelihood computation remains linear.
To further speed up the inference we will use parallelization, rapid homology pre-detection with fast Fourier transform, parsimony based likelihood prediction, and predicting promising moves in the joint MSA-tree space with novel supervised machine-learning approaches. We will provide open source code and multi-platform binaries.
Our method will help those scientists that require MSA, tree and ancestral state inferences for their research, with diverse applications in different domains - from biomedicine and protein engineering to forensics and ecology. In particular, we will deploy the ensemble of our methods for joint inferences with indels (both short and long) on real data: SARS-CoV-2, Swiss HIV Cohort and visual proteins. This will allow, for example, to monitor new indel variants in SARS-COV-2 or to to detect new drug resistant HIV variants and to inform antibody design.
This project will showcase the power of our approaches in applications to real data and allow us to build close collaborations with biomedical researchers. Our method can be embedded in bioinformatic pipelines used in genomic centres, as well as within other sophisticated methods, such as gene-species tree reconciliation or structure prediction by AlphaFold.
Key data
Projectlead
Co-Projectlead
Project team
Project status
ongoing, started 05/2023
Institute/Centre
Institute of Computational Life Sciences (ICLS)
Funding partner
SNF Projektförderung
Project budget
828'868 CHF