Frequentist estimation of the evolutionary history of sequences with substitutions and indels
At a glance
- Project leader : Dr. Maria Anisimova
- Co-project leader : Dr. Manuel Gil
- Project team : Massimo Maiolo, Dr. Julija Pecerska
- Project budget : CHF 910'000
- Project status : ongoing
- Funding partner : SNSF (SNF-Projektförderung / Projekt Nr. 176316)
- Contact person : Maria Anisimova
Description
High throughput sequencing technologies have permitted a wide
range of scientists to observe an astonishing molecular diversity
across all domains of life. Since all observed molecular sequences
area result of a long evolutionary history, most informative
inferences can be made only when analysing genomic sequences from
an evolutionary perspective. Molecular sequences are routinely
aligned to define character homology based on common ancestry.
These alignments are used to infer molecular phylogenetic trees,
which are in turn used for testing various biological hypotheses,
for example, with respect to functional divergence or natural
selection. We aim to develop new computational methods for
reconstructing the past of ancient molecules. Such inferences will
be valuable in diverse molecular studies of functional properties,
with applications from biomedicine and protein engineering to
forensics and ecology.
Existing molecular evolutionary models can support inferences of
multiple sequence alignment (MSA), phylogeny, and ancestral history
of mutations and character insertions and deletions (indels). All
these inferences are typically performed as independent steps. Yet,
these objects are tightly interconnected, and decisions such as
model
choice and simplifications made at each step affect the accuracy of
estimation. Therefore, MSAs, trees, ancestral states and indels
should be inferred jointly for any given set of homologous
sequences. Some Bayesian implementations of joint inferences exist
but are currently suitable only for small datasets. This is due to
high computational complexity of explicit evolutionary models with
indels.
Recently, we have developed a new fast method for simultaneous
alignment and tree inference.
Our approach uses an explicit evolutionary model of indels
described as a Poisson process with linear likelihood computation.
This is the first fast frequentist method aligner with a rigorous
mathematical formulation of indel evolution.
Here we will advance our approach to also infer ancestral sequences
simultaneously with MSAs and trees. Indel model will be adapted to
reflect the natural variability of indel rates. Since positive
selection leaves a strong imprint on genomic sequences, we will
additionally couple our method with codon models that enable the
estimation of selection on the protein. This will alleviate the
problems stemming from approaches that infer MSA, tree and
ancestors in independent sequential steps. Considering that codon
models describe protein-coding genes more realistically by
explicitly including the structure of the genetic code and
selection, these models have the potential to substantially improve
the accuracy of the joint estimation of the comprehensive molecular
history.
Our own collaborations with industry show that our new method will
be in high demand not only in academic projects but also in
pharmaceutical and biotech industry. Reconstruction of molecular
history with substitutions and indels is of interest to a wide
variety of researchers from different domains – from evolution and
ecology to applications in biomedicine, forensics, and protein
engineering.