Why sequence alignment is important




















Post a Comment. Previous Post Next Post. Learn more Ok. Contact form. LinkList ul li ul'. A benchmarking procedure relies on existing collections of reference alignments considered as gold standards. These reference MSAs are routinely used as predictors for the accuracy of a given aligner on a given type of data sets and have had a major influence on methodological developments.

Existing protein benchmark collections were recently extensively and critically reviewed in [ ] and [ ] where the authors propose to group benchmarks in four categories: simulation based, consistency based, structure based and phylogeny based. The latter three categories meet the criterion of reference data sets, in that they can be pre-compiled and used to quantify the relative merits of one aligner over another. The simulation-based benchmarks, however, define an objective function rather than a benchmark procedure and cannot be considered a benchmark measure in the same sense as the others.

Main benchmark methods and their most relevant properties. On the heatmap, orange entries indicate a property describing a given method. Both properties and benchmarks were clustered by similarity. The growing need for large-scale aligners has resulted in the development of a new benchmark generation able to estimate alignment accuracy when assembling large data sets.

The main issue when doing so is the scarcity of structural information. To accommodate this limitation, reference data sets were built by embedding sequences with a known structure within larger data sets made of sequences with unknown structure. This approach already used in PREFAB [ 12 ]—with two sequences of known structure embedded within a data set of 50 sequences—has been extended in HomFam [ 14 , ], so as to define much larger data sets of up to sequences in which an average of 10 sequences with known structures are embedded.

When doing so, accuracy is estimated by first aligning the large data sets. The projections of sequences with known structures are then extracted and accuracy is quantified by comparing these projections with the reference. In this procedure, the main caveat lies in the assumption that the seed sequence accuracy reflects well the global data set.

This assumption is, however, only correct if the sequences with known structures are evenly distributed within the considered data set. Structure-based benchmarking does not necessarily depend on a reference alignment, and alternative methods have also been designed that rely on structural superposition rather than structural superposition-induced alignments. These developments were mostly the consequence of work by Lackner [ ], who reported on situations where the structure-based superposition is ambiguous enough to support equally well several alternative sequence alignments.

When this occurs, the reference alignment becomes the arbitrary prioritization of one reference over another, thus biasing the benchmark process.

Most reference benchmarks deal with this problem by specifying core regions in which the reference alignment is expected to be less ambiguous, but this procedure remains dependent on the way in which core regions are defined.

A more general alternative exists that involves comparing intra-molecular distances between pairs of aligned residue pairs. This measure, named iRMSD [ ], makes it possible to quantify the structural fit implied by an alignment without having to rely on a reference. Structural benchmarks have also been developed for RNA alignment evaluation Figure 2.

Three such benchmarks exist. It makes it possible to evaluate the accuracy of a multiple aligner on RNA sequences by considering the modeling capacity of the evaluated aligner with respect to some reference secondary structure.

This dependence on sequence on which the secondary structure estimation is based slightly limits its scope, as it implies common dependencies between the reference compilation and the evaluation procedure. BraliDart [ 76 ], a newer data set, that is only based on structural information and contains sets of homologous RNA families with known experimental structures, has been recently reported.

This data set is limited by the relative scarceness of experimental RNA 3D structures. Another specificity of BraliDart is its non-reliance on a reference structural alignment but rather on the structural fit implied by the sequence alignment using a distance RMSD measure, as defined by the iRMSD method. They have not been assembled for benchmarking purposes, but rather as a consequence of the importance of accurate ribosomal RNA rRNA alignments when estimating the tree of life.

These alignments have been done manually while taking into account highly conserved rRNA secondary structures that play critical roles in the ribosome functional capacities. At the time we write this review, no reference data set has yet been published to validate the MSAs of long non-coding RNAs, a recently described population of transcripts.

Although empirical data benchmarks are the most commonly used strategies to evaluate alignment methods, they remain limited by their dependence on structural data and the lack of such data for the evaluation of certain kinds of alignments—such as non-transcribed DNA.

Furthermore, it remains to be established to which extent structure-based alignments can be expected to be evolutionarily correct. This question is especially critical considering that phylogenetic modeling is one of the main applications of MSA modeling. A major issue of the most popular aligner methods is their systematic reliance, and possible tuning on structurally correct sequence alignments. These methods are, however, often used to carry out phylogenic reconstruction.

This inconsistency has long been pointed out by the evolutionary community, which routinely relies on simulated data sets rather than empirical ones [ ]. Simulated data sets rely on models mimicking evolution to generate sequences whose diversity is expected to represent a true evolutionary process.

The main strength of this approach is to provide a perfectly traceable model, in which the relationship between nucleotides or amino acids is explicitly known. Their most obvious drawback is to rely on evolutionary models assumed to be correct, while the true extent to which they represent biologically realistic scenarios remains unknown. In any case, these approaches are useful when estimating the impact of extreme conditions on modeling capacity, for instance accelerated evolution, long-branch attraction and similar effects that can confound standard analysis.

It is worth noting that whenever simulated and structure-based reference data sets have been used to validate similar algorithms for alignment accuracy, the rankings were found to differ significantly between these two groups of benchmarks, a clear indication that different alignment characteristics are being evaluated [ 4 , ].

All phylogeny-aware aligners are currently evaluated using these simulated data sets. When doing so, the evaluation is often done on tree modeling capacity rather than on the MSA itself. Such algorithms include [ , — ]. Resolving the apparent discrepancies between structure-based and simulated reference data sets will probably require a better understanding of the complex relation between alignment accuracy and trustworthy phylogenetic reconstruction.

Moving one step in this direction, Dessimoz and Gil recently introduced tree-based tests of alignment accuracy, which not only use large and representative samples of real biological data, but also enable the evaluation of the effect of gap placement on phylogenetic inference [ ]. In an unrelated work [ 35 ], Chang and coauthors proposed the use of empirical data sets obtained by enriching collections of orthologous genes in families likely to support the Tree of Life.

When using such data sets, the discrepancy between phylogenetic and structural correctness appears to be less marked. MSA quality indexes and their features. Features with zero are not used by the specific quality index. With increasingly available structural data, the systematic use of 3D information for the monitoring of MSA accuracy is slowly becoming a realistic prospect. The first such methods [ 41 , ] were designed using the structural accuracy measured on all possible pairs of sequences with a known 3D structure as a proxy for global accuracy.

Recent efforts were therefore focused toward the use of single structures to estimate MSA accuracy. The CAO contact substitution matrix [ ] is one of the earliest work in this direction. The principle is to embed a sequence with a known structure in the MSA. Unfortunately, the estimation of this matrix is limited by the lack of available data. This problem was addressed by the STRIKE algorithm [ ], in which the contact substitution matrix is replaced with a contact potential metrics that considers the score of all potential contacts, as obtained from structural data.

When using this matrix to evaluate an MSA, column contacts—as implied by at least one embedded structure—are evaluated by summing the contact score found in the contact log-odd matrix. This approach was shown to be significantly superior to CAO as a mean to discriminate between alternative alignments.

Sequence conservation is one of the most straightforward ways of estimating MSA accuracy. A large number of tools have been developed for this purpose that roughly fall in two main categories: structural i.

The evolutionary indexes aim at identifying within an MSA all positions likely to hamper phylogenetic reconstruction. These indexes are usually focused on the removal of diverse columns or indel-enriched regions. The most commonly used tools are Gblocks [ , ] and trimAl [ ], a re-implementation of Gblocks using an automated parameterization procedure to adjust the filtering level. While these tools are extremely popular and form part of many large-scale phylogenetic pipelines, the actual value of column filtering remains a point of discussion.

Two recent reports suggest that filtering could decrease MSA phylogenetic modeling potential [ 28 , 35 ]. Similar tools have been developed to estimate the structural correctness of protein MSAs. The simplest ones like AL2CO [ ] merely measure conservation according to various physicochemical criterions. Columns and residues eventually get assigned an index value that can be used when doing modeling.

The most widely used MSA packages rely on a combination between the progressive algorithm and more or less sophisticated dynamic programming implementations, allowing pairwise alignments of sequences or profiles. These dependencies make these algorithms inherently unstable. Over the past few years, the development of methods able to quantify this instability to estimate local reliability has become a fast growing trend.

The idea of using robustness as an indicator of biological accuracy is not new and had already been used as early as [ ] in a procedure that involved removing in turn every pair of amino acid in a pair of sequences before realigning them, so as to assess local alignment stability. Later on, the T-Coffee objective function [ ] was used to show the predictive power of consistency. In general, any procedure that may be used to perturbate an alignment lends itself to the definition of a robustness index.

Such indexes can then be evaluated for their correlation with structural or phylogenetic modeling potential. The Head or Tail HoT procedure [ ] is a good example of a simple method sequences are simply inverted , yielding useful information at the cost of a moderate computational overhead. Other similar procedure albeit more costly have been described.

PSAR is one of them [ 97 ]. It is a method that involves generating several alternative MSAs while removing each sequence in turn. The main issue with these two approaches is their relatively high computational cost.

These methods are, however, much more informative than their sequence conservation alternatives. This review is an attempt to put in context and cover the developments that have taken place in the field of MSAs over the past decade or so. The unprecedented pace of development makes it difficult to be truly exhaustive.

We have nonetheless tried to provide the reader with an overview of the main aspects, and how they connect to one another. As shown in Figure 1 , the progressive alignment framework aligning the sequences following a tree-order is the main algorithmic heuristic that has been adopted by almost all existing alignment methods.

It is also worth noting that the current inflation in the number of available methods merely reflects the growing pace of data accumulation. MSA modeling is one of the most powerful ways to make sense of biological sequences. MSAMs, by their approximate nature, are doomed to follow a red-queen evolutionary strategy and will need to keep evolving, faster and faster, to keep up with the processing of standard biological data.

This review provides an overview on the development of Multiple Sequence Alignment MSA methods and their main applications. MSA method is one of the most powerful and widely used modeling methods in biology, and a series of algorithmic solutions has been proposed over the years for the alignment of evolutionarily related sequences, while taking into account evolutionary events such as mutations, insertions, deletions and rearrangement under certain conditions.

The main challenges for multiple sequence aligners will be to keep up with growing data set sizes and effectively deal with nucleic acid alignments. This work was supported by the Spanish Ministry of Economy and Competitiveness grant no. Currently, she is working at the Centre for Genomic Regulation, in Barcelona, Spain, conducting her doctoral studies in the field of Comparative Bioinformatics, with Dr Cedric Notredame as her supervisor. Her main research is about designing and deploying tools and methods that will facilitate the analysis of Big Biomedical Data, allow for biological discoveries and promote personalized medicine.

His research activities focuses on developing and evaluating bioinformatics tools for sequencing data and comparative genomics. Ionas Erb has a PhD in mathematics and a background in statistical physics. His work in the Center for Genomic Regulation CRG in Barcelona, Spain, focuses on multivariate statistical methods and their applications to the analysis of biological sequences, gene expression and behavioral data.

The top papers. Nature ; : — 3. Google Scholar. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice.

Nucleic Acids Res ; 22 : — A comprehensive benchmark study of multiple sequence alignment methods: current challenges and future perspectives. PLoS One ; 6 : e Kemena C Notredame C. Upcoming challenges for multiple sequence alignment methods in the high-throughput era. Bioinformatics ; 25 : — Edgar RC Batzoglou S. Multiple sequence alignment. Curr Opin Struct Biol ; 16 : — Notredame C Higgins DG. SAGA: sequence alignment by genetic algorithm.

Nucleic Acids Res ; 24 : — Hogeweg P Hesper B. The alignment of sets of sequences and the construction of phylogenetic trees: an integrated method. J Mol Evol ; 20 : — A general method applicable to the search for similarities in the amino acid sequence of two proteins.

J Mol Biol ; 48 : — T-Coffee: a novel method for fast and accurate multiple sequence alignment. J Mol Biol ; : — ProbCons: probabilistic consistency-based multiple sequence alignment. Genome Res ; 15 : — Evaluation of iterative alignment algorithms for multiple alignment.

Bioinformatics ; 21 : — Edgar RC. Nucleic Acids Res ; 32 : — 7. Nucleic Acids Res ; 30 : — Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol Syst Biol ; 7 : Saitou N Nei M. The neighbor-joining method: a new method for reconstructing phylogenetic trees.

Mol Biol Evol ; 4 : — Murtagh F. Complexities of hierarchic clustering algorithms: state of the art. Comput Stat Q ; 1 : — Multiple alignment by aligning alignments. Bioinformatics ; 23 : i — Bioinformatics ; 14 : — 4. Kececioglu JD.

The maximum weight trace problem in multiple sequence alignment. Lect Notes Comput Sci ; : — A polyhedral approach to sequence alignment problems. Discret Appl Math ; : — Sequence progressive alignment, a framework for practical large-scale probabilistic consistency alignment. Probabilistic models of proteins and nucleic acids. Biol Seq Anal ; 14 : — MSAProbs: multiple sequence alignment based on pair hidden Markov models and partition function posterior probabilities.

Bioinformatics ; 26 : — Cloud-Coffee: implementation of a parallel consistency-based multiple alignment algorithm in the T-Coffee package and its benchmarking on the Amazon Elastic-Cloud. Bioinformatics ; 26 : — 4.

Segment-based multiple sequence alignment. Bioinformatics ; 24 : i — Since no real testset with large alignments exists, we were forced to use simulations to create this dataset. We used Rose version 1. Given a specified number of sequences and a target average evolutionary distance between them, Rose constructs a random phylogenetic tree, a random ancestor sequence at the root, and simulates evolution by applying substitutions, insertions, and deletions to create the sequences along the edges.

As all the events in the history of the sequences are known, the true alignment is known and is recorded. Obviously, alignments and sequences cannot be simulated perfectly by an evolutionary model. However, it is undeniable that also simulated sequences will become more and more difficult to align with increasing PAM distances.

Alignment programs will align distant sequences differently, and based on this a meaningful and informative comparison between the programs can be made. A large test set such as this offers the opportunity to analyze the running times of alignment methods in depth.

The alignment quality of each method was determined by calculating the sum-of-pairs score SP according to Thompson et al.

This score reflects the percentage of correctly aligned residues determined by comparison to a reference alignment, and has little in common with the likewise called sum-of-pairs score often used as an objective function. We do not use the column score CS [ 14 ] as we feel this score is inadequate and does not reflect the biological correctness of alignments. In practice, the CS score gives lower accuracies than the SP score, but the ranking of the methods remains the same results not shown.

We compared our method Kalign with the default parameters to Mafft version 3. A comprehensive review of the individual programs and their options is beyond this paper.

Unless otherwise stated, we used the programs tested here in their highest accuracy mode. Our experience in using Mafft revealed only small differences in quality and speed between the scripts using FFT or not results not shown. The average sum-of-pairs score was calculated for each category in Balibase and the results are shown in Table 1. Importantly, only conserved blocks in the Balibase alignments were used for evaluation.

On this test set, Kalign performs slightly worse on average than Mafft and Muscle. We do not believe that these differences are meaningful due to the small size of alignments on average 10 sequences in Balibase and due to the fact that there are so few of them Unfortunately, some of the methods tested here were developed using Balibase making comparisons to other methods problematic.

For example, both ClustalW and Muscle were trained on Balibase [ 36 ]. Given that we did not make an attempt to train on Balibase we believe Kalign performs considerably well in general, and especially well in category 2 and 3. The results Table 2 agree with the results obtained from Balibase, with Kalign being approximately as accurate as Muscle and Mafft.

It is worth noting that in out of the cases Kalign produced better alignments than Muscle and in cases better alignments than Mafft. Based on this we conclude that Muscle, Mafft and Kalign are equally accurate on Prefab, and that all three methods are more accurate than ClustalW. Kalign was between 4 and 7 times faster than the other methods in completing all alignments. To further evaluate the quality and speed of multiple alignment methods, we used our own large testset with up to sequences per alignment, and up to an evolutionary distance of PAM.

T-Coffee and Dialign were excluded from these tests due to computational limitations. To investigate the effect of the Wu-Manber algorithm, we generated a series of alignments with Rose, each containing 50 sequences with an average evolutionary distance ranging from 0 — Figure 2.

We tested the Kalign algorithm with the Wu-Manber string matching enabled Kalign-default and disabled Kalign-ktuple. To make a fair comparison with other methods, we included versions of Mafft and Muscle with the iterative refinement disabled FFTNSI and Muscle with the '-maxiters 1' option and ClustalW run with the '-quicktree' option.

In accordance with the results obtained on the Balibase and Prefab test set, ClustalW-quicktree is least accurate method. The alignment procedures used in Kalign-ktuple, Muscle-fast and Mafft-fast are comparable and as expected these methods are equally accurate.

The default Kalign algorithm using the Wu-Manber algorithm becomes more accurate than the k-tuple based methods at high evolutionary distances. This confirms that the Wu-Manber method is superior to the k-tuple method traditionally used to estimate sequence distances and that both Mafft and Muscle could potentially benefit from using it.

In order to examine the effects of evolutionary distance and number of sequences, we generated a test set containing alignments. The evolutionary distance was varied in steps of 20 up to and the number of sequences was gradually increased from 20 to sequences. For each individual alignment in this test the winner, i. Kalign shown in red generally wins in difficult cases of high evolutionary distance and in cases with many sequences.

The cases where Kalign does not win are alignments with few and highly similar sequences. Since these are relatively easy alignments the differences to Kalign are insignificant. To further demonstrate the differences in accuracy and to analyze the running time in detail, we focused on one row and one column in Figure 3. The row was taken at sequences, varying the evolutionary distance Figure 4 , and the column at an evolutionary distance of , varying the number of sequences Figure 5.

In contrast to Figure 3 , the tests were repeated three times and the average scores are plotted with error bars. Figure 4a reveals that there is little difference in terms of quality between the methods at low evolutionary distances. However, at high evolutionary distances, Kalign is superior to Mafft and Muscle.

In comparison to the other methods, ClustalW performs very poorly in this test. The analysis of the running times Figure 4b , reveals a strong tendency of Mafft and Muscle to run increasingly slower with increasing evolutionary distances.

In contrast, Kalign's complexity is not affected by evolutionary distance and is consistently faster than the other methods tested. ClustalW on the other hand, which starts off as the slowest method, becomes slightly faster with increased evolutionary distance and is in fact faster than both Mafft and Muscle at PAM.

The number of input sequences has a big effect on the running time of each method as the complexity of all alignment algorithms depend on it. Conversely, the more sequences that are used in an alignment, the better an alignment algorithm should perform.

To our surprise, the quality of all methods except for Kalign decreased when the number of input sequences was increased Figure 5a. The analysis of running time versus increased number of sequences reveals a clear advantage of the Kalign algorithm over other methods Figure 5b. Again Muscle is the slowest program; above sequences it is on average four times slower than the second slowest program, ClustalW.

Kalign takes 5 minutes to align sequences while the same alignment takes 90 minutes using Muscle. Although we consider alignment quality to be the more important than speed, it is clearly advantageous in practice to have a fast alignment method.

Analysis of the contribution to alignment accuracy made by different algorithmic variants. Kalign-default uses Wu-Manber approximate string matching, while Kalign-ktuple, Mafft-fast, Muscle-fast, and ClustalW-quicktree use exact k-tuple matching. The default Kalign Wu-Manber based algorithm becomes more accurate than other methods at high evolutionary distances.

The alignments consisted of 50 simulated sequences. A 2D plot indicating in which situations different methods perform better on the large testset. The accuracy of the most accurate versions of Kalign, Muscle, and Mafft was measured for each combination of average evolutionary distance in PAM units and number of sequences.

The cells were colored according to the most accurate program as: Kalign:red; Muscle:blue; Mafft:yellow. If there was a tie between two or more methods the cell is black. Up to PAM no program stands out as a clear winner while above this distance Kalign dominates.

The number of sequences and the average sequence length residues are kept constant. The evolutionary distance PAM and the average sequence length residues are kept constant. In this paper we present Kalign, a novel multiple sequence alignment algorithm based on Wu-Manber approximate pattern matching that combines high quality with high speed.

Compared to existing programs, Kalign performed much more robustly when aligning large amounts of sequences or distant sequences in a large-scale benchmark of generated alignments.

In terms of computational efficiency, Kalign is superior to the other methods, and readily aligns hundreds of sequences in minutes on a normal desktop computer. Coupled with the fact that Kalign gives very accurate alignments, this makes Kalign a very attractive overall method. The high accuracy of Kalign is due to the innovative use of the approximate Wu-Manber string-matching algorithm. BMC Bioinformatics. Katoh K, Standley DM. MAFFT multiple sequence alignment software version 7: improvements in performance and usability.

Mol Biol Evol. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol Syst Biol. ProbCons: Probabilistic consistency-based multiple sequence alignment. T-coffee: a novel method for fast and accurate multiple sequence alignment.

Zhou H, Zhou Y. SPEM: improving multiple sequence alignment with sequence profiles and predicted secondary structures. Expresso: automatic incorporation of structural information in multiple sequence alignments using 3D-coffee.

Align-m—a new algorithm for multiple alignment of highly divergent sequences. Lassmann T, Sonnhammer EL. Kalign—an accurate and fast multiple sequence alignment algorithm. Multiple sequence alignment using partial order graphs. Multiple alignment of protein sequences with repeats and rearrangements.

Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol. Alignment uncertainty and genomic analysis. A comprehensive benchmark study of multiple sequence alignment methods: current challenges and future perspectives. PLoS One. Comparative analysis of multiple protein-sequence alignment methods.

A new protein linear motif benchmark for multiple sequence alignment software. Protein Sci. Announcing the worldwide protein data Bank. Nat Struct Biol. Data growth and its impact on the SCOP database: new developments.

OXBench: a benchmark for evaluation of protein multiple sequence alignment accuracy. SABmark—a benchmark for sequence alignment that covers the entire known fold space. Algorithms Mol Biol. Analysis and comparison of benchmarks for multiple sequence alignment.

In Silico Biol. Pei J. Multiple protein sequence alignment. Curr Opin Struct Biol. Edgar RC, Batzoglou S. Multiple sequence alignment. Quality measures for protein alignment benchmarks. Issues in bioinformatics benchmarking: the case study of multiple sequence alignment.

A comprehensive comparison of multiple sequence alignment programs. Quality assessment of multiple alignment programs. The accuracy of several multiple sequence alignment programs for proteins. Assessing the efficiency of multiple sequence alignment programs. A large-scale benchmark study of existing algorithms for taxonomy-independent microbial community analysis.

Brief Bioinform. Advanced computational algorithms for microbial community analysis using massive 16S rRNA sequence data. Cai Y, Sun Y. On the comparison of relative clustering validity criteria. SIAM; Dunn J. Well separated clusters and optimal fuzzy partitions.

J Cybern. A cluster separation measure. Quality scheme assessment in the clustering process. Lect Notes Comput Sci. Halkidi M, Vazirgiannis M. Clustering validity assessment: finding the optimal partitioning of a data set. Google Scholar. Rousseeuw PJ. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis.

Comput Appl Math. Subhash S. Applied multivariate techniques. Wiley New York; Download references. You can also search for this author in PubMed Google Scholar. YYW performed the analyses and wrote the manuscript.

HYW analyzed the results and revised the manuscript. YPC conceived of the study and helped to draft and revise the manuscript. All authors read and approved the final manuscript. Correspondence to Hongyan Wu or Yunpeng Cai.



0コメント

  • 1000 / 1000