Frequentist estimation of the evolutionary history of sequences with substitutions and indels
At a glance
High throughput sequencing technologies have permitted a wide
range of scientists to observe an astonishing molecular diversity
across all domains of life. Since all observed molecular sequences
area result of a long evolutionary history, most informative
inferences can be made only when analysing genomic sequences from
an evolutionary perspective. Molecular sequences are routinely
aligned to define character homology based on common ancestry.
These alignments are used to infer molecular phylogenetic trees,
which are in turn used for testing various biological hypotheses,
for example, with respect to functional divergence or natural
selection. We aim to develop new computational methods for
reconstructing the past of ancient molecules. Such inferences will
be valuable in diverse molecular studies of functional properties,
with applications from biomedicine and protein engineering to
forensics and ecology.
Existing molecular evolutionary models can support inferences of multiple sequence alignment (MSA), phylogeny, and ancestral history of mutations and character insertions and deletions (indels). All these inferences are typically performed as independent steps. Yet, these objects are tightly interconnected, and decisions such as model
choice and simplifications made at each step affect the accuracy of estimation. Therefore, MSAs, trees, ancestral states and indels should be inferred jointly for any given set of homologous sequences. Some Bayesian implementations of joint inferences exist but are currently suitable only for small datasets. This is due to high computational complexity of explicit evolutionary models with indels.
Recently, we have developed a new fast method for simultaneous alignment and tree inference.
Our approach uses an explicit evolutionary model of indels described as a Poisson process with linear likelihood computation. This is the first fast frequentist method aligner with a rigorous mathematical formulation of indel evolution.
Here we will advance our approach to also infer ancestral sequences simultaneously with MSAs and trees. Indel model will be adapted to reflect the natural variability of indel rates. Since positive selection leaves a strong imprint on genomic sequences, we will additionally couple our method with codon models that enable the estimation of selection on the protein. This will alleviate the problems stemming from approaches that infer MSA, tree and ancestors in independent sequential steps. Considering that codon models describe protein-coding genes more realistically by explicitly including the structure of the genetic code and selection, these models have the potential to substantially improve the accuracy of the joint estimation of the comprehensive molecular history.
Our own collaborations with industry show that our new method will be in high demand not only in academic projects but also in pharmaceutical and biotech industry. Reconstruction of molecular history with substitutions and indels is of interest to a wide variety of researchers from different domains – from evolution and ecology to applications in biomedicine, forensics, and protein engineering.
Frontiers in Bioinformatics.
Available from: https://doi.org/10.3389/fbinf.2021.691865