Changes

From Opengenome.net

Sequence alignment

12,019 bytes removed, 14:59, 18 October 2010
no edit summary
<div class="center">
<div class="thumb tnone">
<div style="WIDTH: 721px" class="thumbinner"><a class="image" href="/wiki/File:Zinc-finger-seq-alignment2.png"><img class="thumbimage" alt="" width="719" height="135" src="http://upload.wikimedia.org/wikipedia/commons/8/86/Zinc-finger-seq-alignment2.png" width="719" height="135" /></a>
<div class="thumbcaption">A sequence alignment, produced by <a class="mw-redirect" title="ClustalW" href="/wiki/ClustalW"><font color="#0645ad">ClustalW</font></a>, of two <a title="Human" href="/wiki/Human"><font color="#0645ad">human</font></a> <a title="Zinc finger" href="/wiki/Zinc_finger"><font color="#0645ad">zinc finger</font></a> proteins, identified on the left by <a title="GenBank" href="/wiki/GenBank"><font color="#0645ad">GenBank</font></a> accession number.<br />
Key: Single letters: <a title="Amino acid" href="/wiki/Amino_acid#Table_of_standard_amino_acid_abbreviations_and_side_chain_properties"><font color="#0645ad">amino acids</font></a>. Red: small, hydrophobic, aromatic, not Y. Blue: acidic. Magenta: basic. Green: hydroxyl, amine, amide, basic. Gray: others. &quot;*&quot;: identical. &quot;:&quot;: conserved substitutions (same colour group). &quot;.&quot;: semi-conserved substitution (similar shapes).<sup id="cite_ref-1" class="reference"><a href="#cite_note-1"><font size="1"><font color="#0645ad"><span>[</span>2<span>]</span></font></font></a></sup></div>
</div>
<p>Sequence alignments are also used for non-biological sequences, such as those present in <a title="Natural language" href="/wiki/Natural_language"><font color="#0645ad">natural language</font></a> or in financial data.</p>
<table id="toc" class="toc"> <tbody> <tr> <td> <div id="toctitle"> <h2>Contents</h2> <span class="toctoggle"><font size="2">[</font><a id="togglelink" class="internal" href="#"><font color="#0645ad" size="2">hide</font></a><font size="2">]</font></span></div> <ul> <li class="toclevel-1 tocsection-1"><a href="#Interpretation"><font color="#0645ad"><span class="tocnumber">1</span> <span class="toctext">Interpretation</span></font></a></li> <li class="toclevel-1 tocsection-2"><a href="#Alignment_methods"><font color="#0645ad"><span class="tocnumber">2</span> <span class="toctext">Alignment methods</span></font></a></li> <li class="toclevel-1 tocsection-3"><a href="#Representations"><font color="#0645ad"><span class="tocnumber">3</span> <span class="toctext">Representations</span></font></a></li> <li class="toclevel-1 tocsection-4"><a href="#Global_and_local_alignments"><font color="#0645ad"><span class="tocnumber">4</span> <span class="toctext">Global and local alignments</span></font></a></li> <li class="toclevel-1 tocsection-5"><a href="#Pairwise_alignment"><font color="#0645ad"><span class="tocnumber">5</span> <span class="toctext">Pairwise alignment</span></font></a> <ul> <li class="toclevel-2 tocsection-6"><a href="#Dot-matrix_methods"><font color="#0645ad"><span class="tocnumber">5.1</span> <span class="toctext">Dot-matrix methods</span></font></a></li> <li class="toclevel-2 tocsection-7"><a href="#Dynamic_programming"><font color="#0645ad"><span class="tocnumber">5.2</span> <span class="toctext">Dynamic programming</span></font></a></li> <li class="toclevel-2 tocsection-8"><a href="#Word_methods"><font color="#0645ad"><span class="tocnumber">5.3</span> <span class="toctext">Word methods</span></font></a></li> </ul> </li> <li class="toclevel-1 tocsection-9"><a href="#Multiple_sequence_alignment"><font color="#0645ad"><span class="tocnumber">6</span> <span class="toctext">Multiple sequence alignment</span></font></a> <ul> <li class="toclevel-2 tocsection-10"><a href="#Dynamic_programming_2"><font color="#0645ad"><span class="tocnumber">6.1</span> <span class="toctext">Dynamic programming</span></font></a></li> <li class="toclevel-2 tocsection-11"><a href="#Progressive_methods"><font color="#0645ad"><span class="tocnumber">6.2</span> <span class="toctext">Progressive methods</span></font></a></li> <li class="toclevel-2 tocsection-12"><a href="#Iterative_methods"><font color="#0645ad"><span class="tocnumber">6.3</span> <span class="toctext">Iterative methods</span></font></a></li> <li class="toclevel-2 tocsection-13"><a href="#Motif_finding"><font color="#0645ad"><span class="tocnumber">6.4</span> <span class="toctext">Motif finding</span></font></a></li> <li class="toclevel-2 tocsection-14"><a href="#Techniques_inspired_by_computer_science"><font color="#0645ad"><span class="tocnumber">6.5</span> <span class="toctext">Techniques inspired by computer science</span></font></a></li> </ul> </li> <li class="toclevel-1 tocsection-15"><a href="#Structural_alignment"><font color="#0645ad"><span class="tocnumber">7</span> <span class="toctext">Structural alignment</span></font></a> <ul> <li class="toclevel-2 tocsection-16"><a href="#DALI"><font color="#0645ad"><span class="tocnumber">7.1</span> <span class="toctext">DALI</span></font></a></li> <li class="toclevel-2 tocsection-17"><a href="#SSAP"><font color="#0645ad"><span class="tocnumber">7.2</span> <span class="toctext">SSAP</span></font></a></li> <li class="toclevel-2 tocsection-18"><a href="#Combinatorial_extension"><font color="#0645ad"><span class="tocnumber">7.3</span> <span class="toctext">Combinatorial extension</span></font></a></li> </ul> </li> <li class="toclevel-1 tocsection-19"><a href="#Phylogenetic_analysis"><font color="#0645ad"><span class="tocnumber">8</span> <span class="toctext">Phylogenetic analysis</span></font></a> <ul> <li class="toclevel-2 tocsection-20"><a href="#Assessment_of_significance"><font color="#0645ad"><span class="tocnumber">8.1</span> <span class="toctext">Assessment of significance</span></font></a></li> <li class="toclevel-2 tocsection-21"><a href="#Assessment_of_credibility"><font color="#0645ad"><span class="tocnumber">8.2</span> <span class="toctext">Assessment of credibility</span></font></a></li> <li class="toclevel-2 tocsection-22"><a href="#Scoring_functions"><font color="#0645ad"><span class="tocnumber">8.3</span> <span class="toctext">Scoring functions</span></font></a></li> </ul> </li> <li class="toclevel-1 tocsection-23"><a href="#Other_biological_uses"><font color="#0645ad"><span class="tocnumber">9</span> <span class="toctext">Other biological uses</span></font></a></li> <li class="toclevel-1 tocsection-24"><a href="#Non-biological_uses"><font color="#0645ad"><span class="tocnumber">10</span> <span class="toctext">Non-biological uses</span></font></a></li> <li class="toclevel-1 tocsection-25"><a href="#Software"><font color="#0645ad"><span class="tocnumber">11</span> <span class="toctext">Software</span></font></a></li> <li class="toclevel-1 tocsection-26"><a href="#References"><font color="#0645ad"><span class="tocnumber">12</span> <span class="toctext">References</span></font></a></li> </ul> </td> </tr> </tbody></table><script type="text/javascript">//<![CDATA[if (window.showTocToggle) { var tocShowText = "show"; var tocHideText = "hide"; showTocToggle(); } //]]></script><h2><span class="editsection">[<a title="Edit section: Interpretation" href="/w/index.php?title=Sequence_alignment&amp;action=edit&amp;section=1"><font color="#0645ad">edit</font></a>]</span> <span id="Interpretation" class="mw-headline">Interpretation</span></h2>
<p>If two sequences in an alignment share a common ancestor, mismatches can be interpreted as <a title="Point mutation" href="/wiki/Point_mutation"><font color="#0645ad">point mutations</font></a> and gaps as <a title="Indel" href="/wiki/Indel"><font color="#0645ad">indels</font></a> (that is, insertion or deletion mutations) introduced in one or both lineages in the time since they diverged from one another. In sequence alignments of proteins, the degree of similarity between <a title="Amino acid" href="/wiki/Amino_acid"><font color="#0645ad">amino acids</font></a> occupying a particular position in the sequence can be interpreted as a rough measure of how <a class="mw-redirect" title="Conservation (genetics)" href="/wiki/Conservation_(genetics)"><font color="#0645ad">conserved</font></a> a particular region or <a title="Sequence motif" href="/wiki/Sequence_motif"><font color="#0645ad">sequence motif</font></a> is among lineages. The absence of substitutions, or the presence of only very conservative substitutions (that is, the substitution of amino acids whose <a title="Side chain" href="/wiki/Side_chain"><font color="#0645ad">side chains</font></a> have similar biochemical properties) in a particular region of the sequence, suggest <sup id="cite_ref-predict_2-0" class="reference"><a href="#cite_note-predict-2"><font size="2"><font color="#0645ad"><span>[</span>3<span>]</span></font></font></a></sup> that this region has structural or functional importance. Although DNA and RNA <a title="Nucleotide" href="/wiki/Nucleotide"><font color="#0645ad">nucleotide</font></a> bases are more similar to each other than are amino acids, the conservation of base pairs can indicate a similar functional or structural role.</p>
<h2><span class="editsection">[<a title="Edit section: Alignment methods" href="/w/index.php?title=Sequence_alignment&amp;action=edit&amp;section=2"><font color="#0645ad">edit</font></a>]</span> <span id="Alignment_methods" class="mw-headline">Alignment methods</span></h2>
<p>Very short or very similar sequences can be aligned by hand. However, most interesting problems require the alignment of lengthy, highly variable or extremely numerous sequences that cannot be aligned solely by human effort. Instead, human knowledge is applied in constructing algorithms to produce high-quality sequence alignments, and occasionally in adjusting the final results to reflect patterns that are difficult to represent algorithmically (especially in the case of nucleotide sequences). Computational approaches to sequence alignment generally fall into two categories: <em>global alignments</em> and <em>local alignments</em>. Calculating a global alignment is a form of <a title="Global optimization" href="/wiki/Global_optimization"><font color="#0645ad">global optimization</font></a> that &quot;forces&quot; the alignment to span the entire length of all query sequences. By contrast, local alignments identify regions of similarity within long sequences that are often widely divergent overall. Local alignments are often preferable, but can be more difficult to calculate because of the additional challenge of identifying the regions of similarity. A variety of computational algorithms have been applied to the sequence alignment problem, including slow but formally optimizing methods like <a title="Dynamic programming" href="/wiki/Dynamic_programming"><font color="#0645ad">dynamic programming</font></a>, and efficient, but not as thorough <a class="mw-redirect" title="Heuristic algorithm" href="/wiki/Heuristic_algorithm"><font color="#0645ad">heuristic algorithms</font></a> or <a title="Probability" href="/wiki/Probability"><font color="#0645ad">probabilistic</font></a> methods designed for large-scale database search.</p>
<h2><span class="editsection">[<a title="Edit section: Representations" href="/w/index.php?title=Sequence_alignment&amp;action=edit&amp;section=3"><font color="#0645ad">edit</font></a>]</span> <span id="Representations" class="mw-headline">Representations</span></h2>
<p>Alignments are commonly represented both graphically and in text format. In almost all sequence alignment representations, sequences are written in rows arranged so that aligned residues appear in successive columns. In text formats, aligned columns containing identical or similar characters are indicated with a system of conservation symbols. As in the image above, an asterisk or pipe symbol is used to show identity between two columns; other less common symbols include a colon for conservative substitutions and a period for semiconservative substitutions. Many sequence visualization programs also use color to display information about the properties of the individual sequence elements; in DNA and RNA sequences, this equates to assigning each nucleotide its own color. In protein alignments, such as the one in the image above, color is often used to indicate amino acid properties to aid in judging the <a class="mw-redirect" title="Conservation (genetics)" href="/wiki/Conservation_(genetics)"><font color="#0645ad">conservation</font></a> of a given amino acid substitution. For multiple sequences the last row in each column is often the <a title="Consensus sequence" href="/wiki/Consensus_sequence"><font color="#0645ad">consensus sequence</font></a> determined by the alignment; the consensus sequence is also often represented in graphical format with a <a title="Sequence logo" href="/wiki/Sequence_logo"><font color="#0645ad">sequence logo</font></a> in which the size of each nucleotide or amino acid letter corresponds to its degree of conservation.<sup id="cite_ref-Schneider_3-0" class="reference"><a href="#cite_note-Schneider-3"><font size="2"><font color="#0645ad"><span>[</span>4<span>]</span></font></font></a></sup></p>
<p>Sequence alignments can be stored in a wide variety of text-based file formats, many of which were originally developed in conjunction with a specific alignment program or implementation. Most web-based tools allow a limited number of input and output formats, such as <a title="FASTA format" href="/wiki/FASTA_format"><font color="#0645ad">FASTA format</font></a> and <a title="GenBank" href="/wiki/GenBank"><font color="#0645ad">GenBank</font></a> format and the output is not easily editable. Several conversion programs are available,<sup class="noprint Inline-Template"><span style="WHITE-SPACE: nowrap" title=" since August 2009"><font size="2">[<em><a title="Wikipedia:Link rot" href="/wiki/Wikipedia:Link_rot"><font color="#0645ad">dead link</font></a></em>]</font></span></sup> <a class="external text" rel="nofollow" href="http://bioweb.pasteur.fr/seqanal/interfaces/readseq.html" rel="nofollow"><font color="#3366bb">READSEQ</font></a> or <a title="EMBOSS" href="/wiki/EMBOSS"><font color="#0645ad">EMBOSS</font></a> having a graphical interfaces or command line interfaces, while several programming packages like <a title="BioPerl" href="/wiki/BioPerl"><font color="#0645ad">BioPerl</font></a>, <a title="BioRuby" href="/wiki/BioRuby"><font color="#0645ad">BioRuby</font></a> provide functions to do this.</p><h2><span class="editsection">[<a title="Edit section: Global and local alignments" href="/w/index.php?title=Sequence_alignment&amp;action=edit&amp;section=4"><font color="#0645ad">edit</font></a>]</span> <span id="Global_and_local_alignments" class="mw-headline">Global and local alignments</span></h2>
<div class="thumb tright">
<div style="WIDTH: 245px" class="thumbinner"><a class="image" href="/wiki/File:Global-local-alignment.png"><img class="thumbimage" alt="" width="243" height="109" src="http://upload.wikimedia.org/wikipedia/commons/4/4b/Global-local-alignment.png" width="243" height="109" /></a>
<div class="thumbcaption">Illustration of global and local alignments demonstrating the 'gappy' quality of global alignments that can occur if sequences are insufficiently similar</div>
</div>
<p>Global alignments, which attempt to align every residue in every sequence, are most useful when the sequences in the query set are similar and of roughly equal size. (This does not mean global alignments cannot end in gaps.) A general global alignment technique is the <a class="mw-redirect" title="Needleman-Wunsch algorithm" href="/wiki/Needleman-Wunsch_algorithm"><font color="#0645ad">Needleman-Wunsch algorithm</font></a>, which is based on dynamic programming. Local alignments are more useful for dissimilar sequences that are suspected to contain regions of similarity or similar sequence motifs within their larger sequence context. The <a class="mw-redirect" title="Smith-Waterman algorithm" href="/wiki/Smith-Waterman_algorithm"><font color="#0645ad">Smith-Waterman algorithm</font></a> is a general local alignment method also based on dynamic programming. With sufficiently similar sequences, there is no difference between local and global alignments.</p>
<p>Hybrid methods, known as semiglobal or &quot;glocal&quot; methods, attempt to find the best possible alignment that includes the start and end of one or the other sequence. This can be especially useful when the downstream part of one sequence overlaps with the upstream part of the other sequence. In this case, neither global nor local alignment is entirely appropriate: a global alignment would attempt to force the alignment to extend beyond the region of overlap, while a local alignment might not fully cover the region of overlap.<sup id="cite_ref-brudno_4-0" class="reference"><a href="#cite_note-brudno-4"><font size="2"><font color="#0645ad"><span>[</span>5<span>]</span></font></font></a></sup></p>
<h2><span class="editsection">[<a title="Edit section: Pairwise alignment" href="/w/index.php?title=Sequence_alignment&amp;action=edit&amp;section=5"><font color="#0645ad">edit</font></a>]</span> <span id="Pairwise_alignment" class="mw-headline">Pairwise alignment</span></h2>
<p>Pairwise sequence alignment methods are used to find the best-matching piecewise (local) or global alignments of two query sequences. Pairwise alignments can only be used between two sequences at a time, but they are efficient to calculate and are often used for methods that do not require extreme precision (such as searching a database for sequences with high similarity to a query). The three primary methods of producing pairwise alignments are dot-matrix methods, dynamic programming, and word methods;<sup id="cite_ref-mount_0-1" class="reference"><a href="#cite_note-mount-0"><font size="2"><font color="#0645ad"><span>[</span>1<span>]</span></font></font></a></sup> however, multiple sequence alignment techniques can also align pairs of sequences. Although each method has its individual strengths and weaknesses, all three pairwise methods have difficulty with highly repetitive sequences of low <a class="mw-redirect" title="Information content" href="/wiki/Information_content"><font color="#0645ad">information content</font></a> - especially where the number of repetitions differ in the two sequences to be aligned. One way of quantifying the utility of a given pairwise alignment is the 'maximum unique match', or the longest subsequence that occurs in both query sequence. Longer MUM sequences typically reflect closer relatedness.</p>
<h3><span class="editsection">[<a title="Edit section: Dot-matrix methods" href="/w/index.php?title=Sequence_alignment&amp;action=edit&amp;section=6"><font color="#0645ad">edit</font></a>]</span> <span id="Dot-matrix_methods" class="mw-headline">Dot-matrix methods</span></h3><table style="BORDER-BOTTOM: #aaa 1px solid; BORDER-LEFT: #aaa 1px solid; BACKGROUND-COLOR: #f9f9f9; BORDER-TOP: #aaa 1px solid; BORDER-RIGHT: #aaa 1px solid" class="metadata mbox-small plainlinks"> <tbody> <tr> <td class="mbox-image"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/9/91/Wikiversity-logo.svg/40px-Wikiversity-logo.svg.png" width="40" height="32" /></td> <td class="mbox-text">Wikiversity has learning materials about <em><strong><a class="extiw" title="v:Dot-matrix methods" href="http://en.wikiversity.org/wiki/Dot-matrix_methods"><font color="#3366bb">Dot-matrix methods</font></a></strong></em></td> </tr> </tbody></table>
<div class="thumb tright">
<div style="WIDTH: 202px" class="thumbinner"><a class="image" href="/wiki/File:Zinc-finger-dot-plot.png"><font color="#3366bb"><img class="thumbimage" alt="" width="200" height="200" src="http://upload.wikimedia.org/wikipedia/commons/thumb/3/33/Zinc-finger-dot-plot.png/200px-Zinc-finger-dot-plot.png" width="200" height="200" /></font></a>
<div class="thumbcaption">
<div class="magnify"><a class="internal" title="Enlarge" href="/wiki/File:Zinc-finger-dot-plot.png"><img alt="" width="15" height="11" src="http://bits.wikimedia.org/skins-1.5/common/images/magnify-clip.png" width="15" height="11" /></a></div>
A DNA dot plot of a <a title="Human" href="/wiki/Human"><font color="#0645ad">human</font></a> <a title="Zinc finger" href="/wiki/Zinc_finger"><font color="#0645ad">zinc finger</font></a> <a title="Transcription factor" href="/wiki/Transcription_factor"><font color="#0645ad">transcription factor</font></a> (GenBank ID NM_002383), showing regional <a title="Self-similarity" href="/wiki/Self-similarity"><font color="#0645ad">self-similarity</font></a>. The main diagonal represents the sequence's alignment with itself; lines off the main diagonal represent similar or repetitive patterns within the sequence. This is a typical example of a <a title="Recurrence plot" href="/wiki/Recurrence_plot"><font color="#0645ad">recurrence plot</font></a>.</div>
</div>
<p>Problems with dot plots as an information display technique include: noise, lack of clarity, non-intuitiveness, difficulty extracting match summary statistics and match positions on the two sequences. There is also much wasted space where the match data is inherently duplicated across the diagonal and most of the actual area of the plot is taken up by either empty space or noise, and, finally, dot-plots are limited to two sequences. None of these limitations apply to Miropeats alignment diagrams but they have their own particular flaws.</p>
<p>Dot plots can also be used to assess repetitiveness in a single sequence. A sequence can be plotted against itself and regions that share significant similarities will appear as lines off the main diagonal. This effect can occur when a protein consists of multiple similar <a class="mw-redirect" title="Structural domain" href="/wiki/Structural_domain"><font color="#0645ad">structural domains</font></a>.</p>
<h3><span class="editsection">[<a title="Edit section: Dynamic programming" href="/w/index.php?title=Sequence_alignment&amp;action=edit&amp;section=7"><font color="#0645ad">edit</font></a>]</span> <span id="Dynamic_programming" class="mw-headline">Dynamic programming</span></h3>
<p>The technique of <a title="Dynamic programming" href="/wiki/Dynamic_programming"><font color="#0645ad">dynamic programming</font></a> can be applied to produce global alignments via the <a class="mw-redirect" title="Needleman-Wunsch algorithm" href="/wiki/Needleman-Wunsch_algorithm"><font color="#0645ad">Needleman-Wunsch algorithm</font></a>, and local alignments via the <a class="mw-redirect" title="Smith-Waterman algorithm" href="/wiki/Smith-Waterman_algorithm"><font color="#0645ad">Smith-Waterman algorithm</font></a>. In typical usage, protein alignments use a <a title="Substitution matrix" href="/wiki/Substitution_matrix"><font color="#0645ad">substitution matrix</font></a> to assign scores to amino-acid matches or mismatches, and a <a title="Gap penalty" href="/wiki/Gap_penalty"><font color="#0645ad">gap penalty</font></a> for matching an amino acid in one sequence to a gap in the other. DNA and RNA alignments may use a scoring matrix, but in practice often simply assign a positive match score, a negative mismatch score, and a negative gap penalty. (In standard dynamic programming, the score of each amino acid position is independent of the identity of its neighbors, and therefore <a class="mw-redirect" title="Base stacking" href="/wiki/Base_stacking"><font color="#0645ad">base stacking</font></a> effects are not taken into account. However, it is possible to account for such effects by modifying the algorithm.) A common extension to standard linear gap costs, is the usage of two different gap penalties for opening a gap and for extending a gap. Typically the former is much larger than the latter, e.g. -10 for gap open and -2 for gap extension. Thus, the number of gaps in an alignment is usually reduced and residues and gaps are kept together, which typically makes more biological sense. The Gotoh algorithm implements affine gap costs by using three matrices.</p>
<p>Dynamic programming can be useful in aligning nucleotide to protein sequences, a task complicated by the need to take into account <a class="mw-redirect" title="Frameshift" href="/wiki/Frameshift"><font color="#0645ad">frameshift</font></a> mutations (usually insertions or deletions). The framesearch method produces a series of global or local pairwise alignments between a query nucleotide sequence and a search set of protein sequences, or vice versa. Although the method is very slow, its ability to evaluate frameshifts offset by an arbitrary number of nucleotides makes the method useful for sequences containing large numbers of indels, which can be very difficult to align with more efficient heuristic methods. In practice, the method requires large amounts of computing power or a system whose architecture is specialized for dynamic programming. The <a title="BLAST" href="/wiki/BLAST"><font color="#0645ad">BLAST</font></a> and <a title="EMBOSS" href="/wiki/EMBOSS"><font color="#0645ad">EMBOSS</font></a> suites provide basic tools for creating translated alignments (though some of these approaches take advantage of side-effects of sequence searching capabilities of the tools). More general methods are available from both commercial sources, such as <em>FrameSearch</em>, distributed as part of the <a title="Accelrys" href="/wiki/Accelrys"><font color="#0645ad">Accelrys</font></a> <a class="mw-redirect" title="GCG (software)" href="/wiki/GCG_(software)"><font color="#0645ad">GCG package</font></a>, and <a class="mw-redirect" title="Open Source" href="/wiki/Open_Source"><font color="#0645ad">Open Source</font></a> software such as <a class="external text" rel="nofollow" href="http://www.ebi.ac.uk/Wise2" rel="nofollow"><font color="#3366bb">Genewise</font></a>.</p>
<p>The dynamic programming method is guaranteed to find an optimal alignment given a particular scoring function; however, identifying a good scoring function is often an empirical rather than a theoretical matter. Although dynamic programming is extensible to more than two sequences, it is prohibitively slow for large numbers of or extremely long sequences.</p>
<h3><span class="editsection">[<a title="Edit section: Word methods" href="/w/index.php?title=Sequence_alignment&amp;action=edit&amp;section=8"><font color="#0645ad">edit</font></a>]</span> <span id="Word_methods" class="mw-headline">Word methods</span></h3>
<p>Word methods, also known as <em>k</em>-tuple methods, are <a title="Heuristic" href="/wiki/Heuristic"><font color="#0645ad">heuristic</font></a> methods that are not guaranteed to find an optimal alignment solution, but are significantly more efficient than dynamic programming. These methods are especially useful in large-scale database searches where it is understood that a large proportion of the candidate sequences will have essentially no significant match with the query sequence. Word methods are best known for their implementation in the database search tools <a title="FASTA" href="/wiki/FASTA"><font color="#0645ad">FASTA</font></a> and the <a title="BLAST" href="/wiki/BLAST"><font color="#0645ad">BLAST</font></a> family.<sup id="cite_ref-mount_0-2" class="reference"><a href="#cite_note-mount-0"><font size="2"><font color="#0645ad"><span>[</span>1<span>]</span></font></font></a></sup> Word methods identify a series of short, nonoverlapping subsequences (&quot;words&quot;) in the query sequence that are then matched to candidate database sequences. The relative positions of the word in the two sequences being compared are subtracted to obtain an offset; this will indicate a region of alignment if multiple distinct words produce the same offset. Only if this region is detected do these methods apply more sensitive alignment criteria; thus, many unnecessary comparisons with sequences of no appreciable similarity are eliminated.</p>
<p>In the FASTA method, the user defines a value <em>k</em> to use as the word length with which to search the database. The method is slower but more sensitive at lower values of <em>k</em>, which are also preferred for searches involving a very short query sequence. The BLAST family of search methods provides a number of algorithms optimized for particular types of queries, such as searching for distantly related sequence matches. BLAST was developed to provide a faster alternative to FASTA without sacrificing much accuracy; like FASTA, BLAST uses a word search of length <em>k</em>, but evaluates only the most significant word matches, rather than every word match as does FASTA. Most BLAST implementations use a fixed default word length that is optimized for the query and database type, and that is changed only under special circumstances, such as when searching with repetitive or very short query sequences. Implementations can be found via a number of web portals, such as <a class="external text" rel="nofollow" href="http://www.ebi.ac.uk/fasta33/" rel="nofollow"><font color="#3366bb">EMBL FASTA</font></a> and <a class="external text" rel="nofollow" href="http://www.ncbi.nlm.nih.gov/BLAST/" rel="nofollow"><font color="#3366bb">NCBI BLAST</font></a>.</p><h2><span class="editsection">[<a title="Edit section: Multiple sequence alignment" href="/w/index.php?title=Sequence_alignment&amp;action=edit&amp;section=9"><font color="#0645ad">edit</font></a>]</span> <span id="Multiple_sequence_alignment" class="mw-headline">Multiple sequence alignment</span></h2>
<div class="rellink relarticle mainarticle">Main article: <a title="Multiple sequence alignment" href="/wiki/Multiple_sequence_alignment"><font color="#0645ad">Multiple sequence alignment</font></a></div>
<div class="thumb tright">
<div style="WIDTH: 302px" class="thumbinner"><a class="image" href="/wiki/File:Hemagglutinin-alignments.png"><font color="#0645ad"><img class="thumbimage" alt="" width="300" height="322" src="http://upload.wikimedia.org/wikipedia/commons/thumb/b/b4/Hemagglutinin-alignments.png/300px-Hemagglutinin-alignments.png" width="300" height="322" /></font></a>
<div class="thumbcaption">
<div class="magnify"><a class="internal" title="Enlarge" href="/wiki/File:Hemagglutinin-alignments.png"><img alt="" width="15" height="11" src="http://bits.wikimedia.org/skins-1.5/common/images/magnify-clip.png" width="15" height="11" /></a></div>
Alignment of 27 <a title="Avian influenza" href="/wiki/Avian_influenza"><font color="#0645ad">avian influenza</font></a> <a title="Hemagglutinin" href="/wiki/Hemagglutinin"><font color="#0645ad">hemagglutinin</font></a> protein sequences colored by residue conservation (top) and residue properties (bottom)</div>
</div>
</div>
<p><a title="Multiple sequence alignment" href="/wiki/Multiple_sequence_alignment"><font color="#0645ad">Multiple sequence alignment</font></a> is an extension of pairwise alignment to incorporate more than two sequences at a time. Multiple alignment methods try to align all of the sequences in a given query set. Multiple alignments are often used in identifying <a class="mw-redirect" title="Conservation (genetics)" href="/wiki/Conservation_(genetics)"><font color="#0645ad">conserved</font></a> sequence regions across a group of sequences hypothesized to be evolutionarily related. Such conserved sequence motifs can be used in conjunction with structural and <a title="Reaction mechanism" href="/wiki/Reaction_mechanism"><font color="#0645ad">mechanistic</font></a> information to locate the catalytic <a title="Active site" href="/wiki/Active_site"><font color="#0645ad">active sites</font></a> of <a title="Enzyme" href="/wiki/Enzyme"><font color="#0645ad">enzymes</font></a>. Alignments are also used to aid in establishing evolutionary relationships by constructing <a title="Phylogenetic tree" href="/wiki/Phylogenetic_tree"><font color="#0645ad">phylogenetic trees</font></a>. Multiple sequence alignments are computationally difficult to produce and most formulations of the problem lead to <a title="NP-complete" href="/wiki/NP-complete"><font color="#0645ad">NP-complete</font></a> combinatorial optimization problems.<sup id="cite_ref-wang_5-0" class="reference"><a href="#cite_note-wang-5"><font size="2"><font color="#0645ad"><span>[</span>6<span>]</span></font></font></a></sup><sup id="cite_ref-elias_6-0" class="reference"><a href="#cite_note-elias-6"><font size="2"><font color="#0645ad"><span>[</span>7<span>]</span></font></font></a></sup> Nevertheless, the utility of these alignments in bioinformatics has led to the development of a variety of methods suitable for aligning three or more sequences.</p>
<h3><span class="editsection">[<a title="Edit section: Dynamic programming" href="/w/index.php?title=Sequence_alignment&amp;action=edit&amp;section=10"><font color="#0645ad">edit</font></a>]</span> <span id="Dynamic_programming_2" class="mw-headline">Dynamic programming</span></h3><p>The technique of dynamic programming is theoretically applicable to any number of sequences; however, because it is computationally expensive in both time and <a title="Computer memory" href="/wiki/Computer_memory"><font color="#0645ad">memory</font></a>, it is rarely used for more than three or four sequences in its most basic form. This method requires constructing the <em>n</em>-dimensional equivalent of the sequence matrix formed from two sequences, where <em>n</em> is the number of sequences in the query. Standard dynamic programming is first used on all pairs of query sequences and then the &quot;alignment space&quot; is filled in by considering possible matches or gaps at intermediate positions, eventually constructing an alignment essentially between each two-sequence alignment. Although this technique is computationally expensive, its guarantee of a global optimum solution is useful in cases where only a few sequences need to be aligned accurately. One method for reducing the computational demands of dynamic programming, which relies on the &quot;sum of pairs&quot; <a class="mw-redirect" title="Objective function" href="/wiki/Objective_function"><font color="#0645ad">objective function</font></a>, has been implemented in the <a class="external text" rel="nofollow" href="http://www.ncbi.nlm.nih.gov/CBBresearch/Schaffer/msa.html" rel="nofollow"><font color="#3366bb">MSA</font></a> software package.<sup id="cite_ref-lipman_7-0" class="reference"><a href="#cite_note-lipman-7"><font size="2"><font color="#0645ad"><span>[</span>8<span>]</span></font></font></a></sup></p><h3><span class="editsection">[<a title="Edit section: Progressive methods" href="/w/index.php?title=Sequence_alignment&amp;action=edit&amp;section=11"><font color="#0645ad">edit</font></a>]</span> <span id="Progressive_methods" class="mw-headline">Progressive methods</span></h3>
<p>Progressive, hierarchical, or tree methods generate a multiple sequence alignment by first aligning the most similar sequences and then adding successively less related sequences or groups to the alignment until the entire query set has been incorporated into the solution. The initial tree describing the sequence relatedness is based on pairwise comparisons that may include heuristic pairwise alignment methods similar to <a title="FASTA" href="/wiki/FASTA"><font color="#0645ad">FASTA</font></a>. Progressive alignment results are dependent on the choice of &quot;most related&quot; sequences and thus can be sensitive to inaccuracies in the initial pairwise alignments. Most progressive multiple sequence alignment methods additionally weight the sequences in the query set according to their relatedness, which reduces the likelihood of making a poor choice of initial sequences and thus improves alignment accuracy.</p>
<p>Many variations of the <a title="Clustal" href="/wiki/Clustal"><font color="#0645ad">Clustal</font></a> progressive implementation<sup id="cite_ref-higgins_8-0" class="reference"><a href="#cite_note-higgins-8"><font size="2"><font color="#0645ad"><span>[</span>9<span>]</span></font></font></a></sup><sup id="cite_ref-thompson_9-0" class="reference"><a href="#cite_note-thompson-9"><font size="2"><font color="#0645ad"><span>[</span>10<span>]</span></font></font></a></sup><sup id="cite_ref-chenna_10-0" class="reference"><a href="#cite_note-chenna-10"><font size="2"><font color="#0645ad"><span>[</span>11<span>]</span></font></font></a></sup> are used for multiple sequence alignment, phylogenetic tree construction, and as input for <a title="Protein structure prediction" href="/wiki/Protein_structure_prediction"><font color="#0645ad">protein structure prediction</font></a>. A slower but more accurate variant of the progressive method is known as <a title="T-Coffee" href="/wiki/T-Coffee"><font color="#0645ad">T-Coffee</font></a><sup id="cite_ref-notredame_11-0" class="reference"><a href="#cite_note-notredame-11"><font size="2"><font color="#0645ad"><span>[</span>12<span>]</span></font></font></a></sup>.</p>
<h3><span class="editsection">[<a title="Edit section: Iterative methods" href="/w/index.php?title=Sequence_alignment&amp;action=edit&amp;section=12"><font color="#0645ad">edit</font></a>]</span> <span id="Iterative_methods" class="mw-headline">Iterative methods</span></h3>
<p>Iterative methods attempt to improve on the weak point of the progressive methods, the heavy dependence on the accuracy of the initial pairwise alignments. Iterative methods optimize an <a class="mw-redirect" title="Objective function" href="/wiki/Objective_function"><font color="#0645ad">objective function</font></a> based on a selected alignment scoring method by assigning an initial global alignment and then realigning sequence subsets. The realigned subsets are then themselves aligned to produce the next iteration's multiple sequence alignment. Various ways of selecting the sequence subgroups and objective function are reviewed in.<sup id="cite_ref-hirosawa_12-0" class="reference"><a href="#cite_note-hirosawa-12"><font size="2"><font color="#0645ad"><span>[</span>13<span>]</span></font></font></a></sup></p>
<h3><span class="editsection">[<a title="Edit section: Motif finding" href="/w/index.php?title=Sequence_alignment&ampnbsp;action=edit&amp;section=13"><font color="#0645ad">edit</font></a>]</span> <span id="Motif_finding" class="mw-headline">Motif finding</span></h3>
<p>Motif finding, also known as profile analysis, constructs global multiple sequence alignments that attempt to align short conserved <a title="Sequence motif" href="/wiki/Sequence_motif"><font color="#0645ad">sequence motifs</font></a> among the sequences in the query set. This is usually done by first constructing a general global multiple sequence alignment, after which the highly <a class="mw-redirect" title="Conservation (genetics)" href="/wiki/Conservation_(genetics)"><font color="#0645ad">conserved</font></a> regions are isolated and used to construct a set of profile matrices. The profile matrix for each conserved region is arranged like a scoring matrix but its frequency counts for each amino acid or nucleotide at each position are derived from the conserved region's character distribution rather than from a more general empirical distribution. The profile matrices are then used to search other sequences for occurrences of the motif they characterize. In cases where the original <a title="Data set" href="/wiki/Data_set"><font color="#0645ad">data set</font></a> contained a small number of sequences, or only highly related sequences, <a title="Pseudocount" href="/wiki/Pseudocount"><font color="#0645ad">pseudocounts</font></a> are added to normalize the character distributions represented in the motif.</p>
<h3><span class="editsection">[<a title="Edit section: Techniques inspired by computer science" href="/w/index.php?title=Sequence_alignment&amp;action=edit&amp;section=14"><font color="#0645ad">edit</font></a>]</span> <span id="Techniques_inspired_by_computer_science" class="mw-headline">Techniques inspired by computer science</span></h3>
<p>A variety of general <a title="Optimization (mathematics)" href="/wiki/Optimization_(mathematics)"><font color="#0645ad">optimization</font></a> algorithms commonly used in computer science have also been applied to the multiple sequence alignment problem. <a title="Hidden Markov model" href="/wiki/Hidden_Markov_model"><font color="#0645ad">Hidden Markov models</font></a> have been used to produce probability scores for a family of possible multiple sequence alignments for a given query set; although early HMM-based methods produced underwhelming performance, later applications have found them especially effective in detecting remotely related sequences because they are less susceptible to noise created by conservative or semiconservative substitutions.<sup id="cite_ref-karplus_13-0" class="reference"><a href="#cite_note-karplus-13"><font size="2"><font color="#0645ad"><span>[</span>14<span>]</span></font></font></a></sup> <a title="Genetic algorithm" href="/wiki/Genetic_algorithm"><font color="#0645ad">Genetic algorithms</font></a> and <a title="Simulated annealing" href="/wiki/Simulated_annealing"><font color="#0645ad">simulated annealing</font></a> have also been used in optimizing multiple sequence alignment scores as judged by a scoring function like the sum-of-pairs method. More complete details and software packages can be found in the main article <a title="Multiple sequence alignment" href="/wiki/Multiple_sequence_alignment"><font color="#0645ad">multiple sequence alignment</font></a>.</p>
<h2><span class="editsection">[<a title="Edit section: Structural alignment" href="/w/index.php?title=Sequence_alignment&amp;action=edit&amp;section=15"><font color="#0645ad">edit</font></a>]</span> <span id="Structural_alignment" class="mw-headline">Structural alignment</span></h2><div class="rellink relarticle mainarticle">Main article: <a title="Structural alignment" href="/wiki/Structural_alignment"><font color="#0645ad">Structural alignment</font></a></div>
<p>Structural alignments, which are usually specific to protein and sometimes RNA sequences, use information about the <a class="mw-redirect" title="Secondary structure" href="/wiki/Secondary_structure"><font color="#0645ad">secondary</font></a> and <a class="mw-redirect" title="Tertiary structure" href="/wiki/Tertiary_structure"><font color="#0645ad">tertiary structure</font></a> of the protein or RNA molecule to aid in aligning the sequences. These methods can be used for two or more sequences and typically produce local alignments; however, because they depend on the availability of structural information, they can only be used for sequences whose corresponding structures are known (usually through <a title="X-ray crystallography" href="/wiki/X-ray_crystallography"><font color="#0645ad">X-ray crystallography</font></a> or <a title="NMR spectroscopy" href="/wiki/NMR_spectroscopy"><font color="#0645ad">NMR spectroscopy</font></a>). Because both protein and RNA structure is more evolutionarily conserved than sequence,<sup id="cite_ref-chothia_14-0" class="reference"><a href="#cite_note-chothia-14"><font size="2"><font color="#0645ad"><span>[</span>15<span>]</span></font></font></a></sup> structural alignments can be more reliable between sequences that are very distantly related and that have diverged so extensively that sequence comparison cannot reliably detect their similarity.</p>
<p>Structural alignments are used as the &quot;gold standard&quot; in evaluating alignments for homology-based <a title="Protein structure prediction" href="/wiki/Protein_structure_prediction"><font color="#0645ad">protein structure prediction</font></a><sup id="cite_ref-skolnick_15-0" class="reference"><a href="#cite_note-skolnick-15"><font size="2"><font color="#0645ad"><span>[</span>16<span>]</span></font></font></a></sup> because they explicitly align regions of the protein sequence that are structurally similar rather than relying exclusively on sequence information. However, clearly structural alignments cannot be used in structure prediction because at least one sequence in the query set is the target to be modeled, for which the structure is not known. It has been shown that, given the structural alignment between a target and a template sequence, highly accurate models of the target protein sequence can be produced; a major stumbling block in homology-based structure prediction is the production of structurally accurate alignments given only sequence information.<sup id="cite_ref-skolnick_15-1" class="reference"><a href="#cite_note-skolnick-15"><font size="2"><font color="#0645ad"><span>[</span>16<span>]</span></font></font></a></sup></p>
<h3><span class="editsection">[<a title="Edit section: DALI" href="/w/index.php?title=Sequence_alignment&amp;action=edit&amp;section=16"><font color="#0645ad">edit</font></a>]</span> <span id="DALI" class="mw-headline">DALI</span></h3><p>The DALI method, or <a title="Distance matrix" href="/wiki/Distance_matrix"><font color="#0645ad">distance matrix</font></a> alignment, is a fragment-based method for constructing structural alignments based on contact similarity patterns between successive hexapeptides in the query sequences.<sup id="cite_ref-holm_16-0" class="reference"><a href="#cite_note-holm-16"><font size="2"><font color="#0645ad"><span>[</span>17<span>]</span></font></font></a></sup> It can generate pairwise or multiple alignments and identify a query sequence's structural neighbors in the <a title="Protein Data Bank" href="/wiki/Protein_Data_Bank"><font color="#0645ad">Protein Data Bank</font></a> (PDB). It has been used to construct the <a title="Families of structurally similar proteins" href="/wiki/Families_of_structurally_similar_proteins"><font color="#0645ad">FSSP</font></a> structural alignment database (Fold classification based on Structure-Structure alignment of Proteins, or Families of Structurally Similar Proteins). A DALI webserver can be accessed at <a class="external text" rel="nofollow" href="http://www.ebi.ac.uk/dali/" rel="nofollow"><font color="#3366bb">EBI DALI</font></a> and the FSSP is located at <a class="external text" rel="nofollow" href="http://ekhidna.biocenter.helsinki.fi/dali/start" rel="nofollow"><font color="#3366bb">The Dali Database</font></a>.</p><h3><span class="editsection">[<a title="Edit section: SSAP" href="/w/index.php?title=Sequence_alignment&amp;action=edit&amp;section=17"><font color="#0645ad">edit</font></a>]</span> <span id="SSAP" class="mw-headline">SSAP</span></h3><p>SSAP (sequential structure alignment program) is a dynamic programming-based method of structural alignment that uses atom-to-atom vectors in structure space as comparison points. It has been extended since its original description to include multiple as well as pairwise alignments,<sup id="cite_ref-taylor_17-0" class="reference"><a href="#cite_note-taylor-17"><font size="2"><font color="#0645ad"><span>[</span>18<span>]</span></font></font></a></sup> and has been used in the construction of the <a title="CATH" href="/wiki/CATH"><font color="#0645ad">CATH</font></a> (Class, Architecture, Topology, Homology) hierarchical database classification of protein folds.<sup id="cite_ref-orengo_18-0" class="reference"><a href="#cite_note-orengo-18"><font size="2"><font color="#0645ad"><span>[</span>19<span>]</span></font></font></a></sup> The CATH database can be accessed at <a class="external text" rel="nofollow" href="http://www.cathdb.info/" rel="nofollow"><font color="#3366bb">CATH Protein Structure Classification</font></a>.</p><h3><span class="editsection">[<a title="Edit section: Combinatorial extension" href="/w/index.php?title=Sequence_alignment&amp;action=edit&amp;section=18"><font color="#0645ad">edit</font></a>]</span> <span id="Combinatorial_extension" class="mw-headline">Combinatorial extension</span></h3><p>The combinatorial extension method of structural alignment generates a pairwise structural alignment by using local geometry to align short fragments of the two proteins being analyzed and then assembles these fragments into a larger alignment.<sup id="cite_ref-shindyalov_19-0" class="reference"><a href="#cite_note-shindyalov-19"><font size="2"><font color="#0645ad"><span>[</span>20<span>]</span></font></font></a></sup> Based on measures such as rigid-body <a title="Root mean square deviation (bioinformatics)" href="/wiki/Root_mean_square_deviation_(bioinformatics)"><font color="#0645ad">root mean square distance</font></a>, residue distances, local secondary structure, and surrounding environmental features such as residue neighbor <a class="mw-redirect" title="Hydrophobic" href="/wiki/Hydrophobic"><font color="#0645ad">hydrophobicity</font></a>, local alignments called &quot;aligned fragment pairs&quot; are generated and used to build a similarity matrix representing all possible structural alignments within predefined cutoff criteria. A path from one protein structure state to the other is then traced through the matrix by extending the growing alignment one fragment at a time. The optimal such path defines the combinatorial-extension alignment. A web-based server implementing the method and providing a database of pairwise alignments of structures in the Protein Data Bank is located at the <a class="external text" rel="nofollow" href="http://web.archive.org/web/cl.sdsc.edu/" rel="nofollow"><font color="#3366bb">Combinatorial Extension</font></a> website.</p><h2><span class="editsection">[<a title="Edit section: Phylogenetic analysis" href="/w/index.php?title=Sequence_alignment&amp;action=edit&amp;section=19"><font color="#0645ad">edit</font></a>]</span> <span id="Phylogenetic_analysis" class="mw-headline">Phylogenetic analysis</span></h2>
<div class="rellink relarticle mainarticle">Main article: <a title="Computational phylogenetics" href="/wiki/Computational_phylogenetics"><font color="#0645ad">Computational phylogenetics</font></a></div>
<p>Phylogenetics and sequence alignment are closely related fields due to the shared necessity of evaluating sequence relatedness. The field of <a title="Phylogenetics" href="/wiki/Phylogenetics"><font color="#0645ad">phylogenetics</font></a> makes extensive use of sequence alignments in the construction and interpretation of <a title="Phylogenetic tree" href="/wiki/Phylogenetic_tree"><font color="#0645ad">phylogenetic trees</font></a>, which are used to classify the evolutionary relationships between homologous <a title="Gene" href="/wiki/Gene"><font color="#0645ad">genes</font></a> represented in the <a title="Genome" href="/wiki/Genome"><font color="#0645ad">genomes</font></a> of divergent species. The degree to which sequences in a query set differ is qualitatively related to the sequences' evolutionary distance from one another. Roughly speaking, high sequence identity suggests that the sequences in question have a comparatively young <a title="Most recent common ancestor" href="/wiki/Most_recent_common_ancestor"><font color="#0645ad">most recent common ancestor</font></a>, while low identity suggests that the divergence is more ancient. This approximation, which reflects the &quot;<a title="Molecular clock" href="/wiki/Molecular_clock"><font color="#0645ad">molecular clock</font></a>&quot; hypothesis that a roughly constant rate of evolutionary change can be used to extrapolate the elapsed time since two genes first diverged (that is, the <a class="mw-redirect" title="Coalescence (genetics)" href="/wiki/Coalescence_(genetics)"><font color="#0645ad">coalescence</font></a> time), assumes that the effects of mutation and <a title="Natural selection" href="/wiki/Natural_selection"><font color="#0645ad">selection</font></a> are constant across sequence lineages. Therefore it does not account for possible difference among organisms or species in the rates of <a title="DNA repair" href="/wiki/DNA_repair"><font color="#0645ad">DNA repair</font></a> or the possible functional conservation of specific regions in a sequence. (In the case of nucleotide sequences, the molecular clock hypothesis in its most basic form also discounts the difference in acceptance rates between <a title="Silent mutation" href="/wiki/Silent_mutation"><font color="#0645ad">silent mutations</font></a> that do not alter the meaning of a given <a class="mw-redirect" title="Codon" href="/wiki/Codon"><font color="#0645ad">codon</font></a> and other mutations that result in a different <a title="Amino acid" href="/wiki/Amino_acid"><font color="#0645ad">amino acid</font></a> being incorporated into the protein.) More statistically accurate methods allow the evolutionary rate on each branch of the phylogenetic tree to vary, thus producing better estimates of coalescence times for genes.</p>
<p>Progressive multiple alignment techniques produce a phylogenetic tree by necessity because they incorporate sequences into the growing alignment in order of relatedness. Other techniques that assemble multiple sequence alignments and phylogenetic trees score and sort trees first and calculate a multiple sequence alignment from the highest-scoring tree. Commonly used methods of phylogenetic tree construction are mainly <a title="Heuristic" href="/wiki/Heuristic"><font color="#0645ad">heuristic</font></a> because the problem of selecting the optimal tree, like the problem of selecting the optimal multiple sequence alignment, is <a title="NP-hard" href="/wiki/NP-hard"><font color="#0645ad">NP-hard</font></a>.<sup id="cite_ref-felsenstein_20-0" class="reference"><a href="#cite_note-felsenstein-20"><font size="2"><font color="#0645ad"><span>[</span>21<span>]</span></font></font></a></sup></p>
<h3><span class="editsection">[<a title="Edit section: Assessment of significance" href="/w/index.php?title=Sequence_alignment&amp;action=edit&amp;section=20"><font color="#0645ad">edit</font></a>]</span> <span id="Assessment_of_significance" class="mw-headline">Assessment of significance</span></h3>
<p>Sequence alignments are useful in bioinformatics for identifying sequence similarity, producing phylogenetic trees, and developing homology models of protein structures. However, the biological relevance of sequence alignments is not always clear. Alignments are often assumed to reflect a degree of evolutionary change between sequences descended from a common ancestor; however, it is formally possible that <a title="Convergent evolution" href="/wiki/Convergent_evolution"><font color="#0645ad">convergent evolution</font></a> can occur to produce apparent similarity between proteins that are evolutionarily unrelated but perform similar functions and have similar structures.</p>
<p>In database searches such as BLAST, statistical methods can determine the likelihood of a particular alignment between sequences or sequence regions arising by chance given the size and composition of the database being searched. These values can vary significantly depending on the search space. In particular, the likelihood of finding a given alignment by chance increases if the database consists only of sequences from the same organism as the query sequence. Repetitive sequences in the database or query can also distort both the search results and the assessment of statistical significance; BLAST automatically filters such repetitive sequences in the query to avoid apparent hits that are statistical artifacts.</p>
<p>Methods of statistical significance estimation for gapped sequence alignments are available in the literature.<sup id="cite_ref-newberg_21-0" class="reference"><a href="#cite_note-newberg-21"><font size="2"><font color="#0645ad"><span>[</span>22<span>]</span></font></font></a></sup><sup id="cite_ref-eddy_22-0" class="reference"><a href="#cite_note-eddy-22"><font size="2"><font color="#0645ad"><span>[</span>23<span>]</span></font></font></a></sup></p>
<h3><span class="editsection">[<a title="Edit section: Assessment of credibility" href="/w/index.php?title=Sequence_alignment&amp;action=edit&amp;section=21"><font color="#0645ad">edit</font></a>]</span> <span id="Assessment_of_credibility" class="mw-headline">Assessment of credibility</span></h3>
<p>Statistical significance indicates the probability that an alignment of a given quality could arise by chance, but does not indicate how much superior a given alignment is to alternative alignments of the same sequences. Measures of alignment credibility indicate the extent to which the best scoring alignments for a given pair of sequences are substantially similar. Methods of alignment credibility estimation for gapped sequence alignments are available in the literature.<sup id="cite_ref-NewbergLawrence2009_23-0" class="reference"><a href="#cite_note-NewbergLawrence2009-23"><font size="2"><font color="#0645ad"><span>[</span>24<span>]</span></font></font></a></sup></p>
<h3><span class="editsection">[<a title="Edit section: Scoring functions" href="/w/index.php?title=Sequence_alignment&amp;action=edit&amp;section=22"><font color="#0645ad">edit</font></a>]</span> <span id="Scoring_functions" class="mw-headline">Scoring functions</span></h3>
<p>The choice of a scoring function that reflects biological or statistical observations about known sequences is important to producing good alignments. Protein sequences are frequently aligned using <a title="Substitution matrix" href="/wiki/Substitution_matrix"><font color="#0645ad">substitution matrices</font></a> that reflect the probabilities of given character-to-character substitutions. A series of matrices called <a title="Point accepted mutation" href="/wiki/Point_accepted_mutation"><font color="#0645ad">PAM matrices</font></a> (Point Accepted Mutation matrices, originally defined by <a class="mw-redirect" title="Margaret Dayhoff" href="/wiki/Margaret_Dayhoff"><font color="#0645ad">Margaret Dayhoff</font></a> and sometimes referred to as &quot;Dayhoff matrices&quot;) explicitly encode evolutionary approximations regarding the rates and probabilities of particular amino acid mutations. Another common series of scoring matrices, known as <a title="BLOSUM" href="/wiki/BLOSUM"><font color="#0645ad">BLOSUM</font></a> (Blocks Substitution Matrix), encodes empirically derived substitution probabilities. Variants of both types of matrices are used to detect sequences with differing levels of divergence, thus allowing users of BLAST or FASTA to restrict searches to more closely related matches or expand to detect more divergent sequences. <a title="Gap penalty" href="/wiki/Gap_penalty"><font color="#0645ad">Gap penalties</font></a> account for the introduction of a gap - on the evolutionary model, an insertion or deletion mutation - in both nucleotide and protein sequences, and therefore the penalty values should be proportional to the expected rate of such mutations. The quality of the alignments produced therefore depends on the quality of the scoring function.</p>
<p>It can be very useful and instructive to try the same alignment several times with different choices for scoring matrix and/or gap penalty values and compare the results. Regions where the solution is weak or non-unique can often be identified by observing which regions of the alignment are robust to variations in alignment parameters.</p>
<h2><span class="editsection">[<a title="Edit section: Other biological uses" href="/w/index.php?title=Sequence_alignment&amp;action=edit&amp;section=23"><font color="#0645ad">edit</font></a>]</span> <span id="Other_biological_uses" class="mw-headline">Other biological uses</span></h2>
<p>Sequenced RNA, such as <a class="mw-redirect" title="Expressed sequence tags" href="/wiki/Expressed_sequence_tags"><font color="#0645ad">expressed sequence tags</font></a> and full-length mRNAs, can be aligned to a sequenced genome to find where there are genes and get information about <a title="Alternative splicing" href="/wiki/Alternative_splicing"><font color="#0645ad">alternative splicing</font></a><sup id="cite_ref-24" class="reference"><a href="#cite_note-24"><font size="2"><font color="#0645ad"><span>[</span>25<span>]</span></font></font></a></sup> and <a title="RNA editing" href="/wiki/RNA_editing"><font color="#0645ad">RNA editing</font></a>.<sup id="cite_ref-25" class="reference"><a href="#cite_note-25"><font size="2"><font color="#0645ad"><span>[</span>26<span>]</span></font></font></a></sup> Sequence alignment is also a part of <a class="mw-redirect" title="Genome assembly" href="/wiki/Genome_assembly"><font color="#0645ad">genome assembly</font></a>, where sequences are aligned to find overlap so that <em><a title="Contig" href="/wiki/Contig"><font color="#0645ad">contigs</font></a></em> (long stretches of sequence) can be formed.<sup id="cite_ref-26" class="reference"><a href="#cite_note-26"><font size="2"><font color="#0645ad"><span>[</span>27<span>]</span></font></font></a></sup> Another use is <a class="mw-redirect" title="Single nucleotide polymorphism" href="/wiki/Single_nucleotide_polymorphism"><font color="#0645ad">SNP</font></a> analysis, where sequences from different individuals are aligned to find single basepairs that are often different in a population.<sup id="cite_ref-27" class="reference"><a href="#cite_note-27"><font size="2"><font color="#0645ad"><span>[</span>28<span>]</span></font></font></a></sup></p>
<h2><span class="editsection">[<a title="Edit section: Non-biological uses" href="/w/index.php?title=Sequence_alignment&amp;action=edit&amp;section=24"><font color="#0645ad">edit</font></a>]</span> <span id="Non-biological_uses" class="mw-headline">Non-biological uses</span></h2>
<p>The methods used for biological sequence alignment have also found applications in other fields, most notably in <a title="Natural language processing" href="/wiki/Natural_language_processing"><font color="#0645ad">natural language processing</font></a> and in social sciences.<sup id="cite_ref-28" class="reference"><a href="#cite_note-28"><font size="2"><font color="#0645ad"><span>[</span>29<span>]</span></font></font></a></sup> Techniques that generate the set of elements from which words will be selected in natural-language generation algorithms have borrowed multiple sequence alignment techniques from bioinformatics to produce linguistic versions of computer-generated mathematical proofs.<sup id="cite_ref-Barzilay_29-0" class="reference"><a href="#cite_note-Barzilay-29"><font size="2"><font color="#0645ad"><span>[</span>30<span>]</span></font></font></a></sup> In the field of historical and comparative <a title="Linguistics" href="/wiki/Linguistics"><font color="#0645ad">linguistics</font></a>, sequence alignment has been used to partially automate the <a title="Comparative method" href="/wiki/Comparative_method"><font color="#0645ad">comparative method</font></a> by which linguists traditionally reconstruct languages.<sup id="cite_ref-30" class="reference"><a href="#cite_note-30"><font size="2"><font color="#0645ad"><span>[</span>31<span>]</span></font></font></a></sup> Business and marketing research has also applied multiple sequence alignment techniques in analyzing series of purchases over time.<sup id="cite_ref-prinzie_31-0" class="reference"><a href="#cite_note-prinzie-31"><font size="2"><font color="#0645ad"><span>[</span>32<span>]</span></font></font></a></sup></p>
<h2><span class="editsection">[<a title="Edit section: Software" href="/w/index.php?title=Sequence_alignment&amp;action=edit&amp;section=25"><font color="#0645ad">edit</font></a>]</span> <span id="Software" class="mw-headline">Software</span></h2>
<div class="rellink relarticle mainarticle">Main article: <a class="mw-redirect" title="Sequence alignment software" href="/wiki/Sequence_alignment_software"><font color="#0645ad">Sequence alignment software</font></a></div>
<p>A more complete list of available software categorized by algorithm and alignment type is available at <a class="mw-redirect" title="Sequence alignment software" href="/wiki/Sequence_alignment_software"><font color="#0645ad">sequence alignment software</font></a>, but common software tools used for general sequence alignment tasks include <a class="external text" rel="nofollow" href="http://www2.ebi.ac.uk/clustalw/" rel="nofollow"><font color="#3366bb">ClustalW</font></a> and <a class="external text" rel="nofollow" href="http://tcoffee.vital-it.ch/cgi-bin/Tcoffee/tcoffee_cgi/index.cgi" rel="nofollow"><font color="#3366bb">T-coffee</font></a> for alignment, and <a class="external text" rel="nofollow" href="http://ncbi.nih.gov/BLAST/" rel="nofollow"><font color="#3366bb">BLAST</font></a> and <a class="external text" rel="nofollow" href="http://fasta.bioch.virginia.edu/fasta_www2/fasta_list2.shtml" rel="nofollow"><font color="#3366bb">FASTA3x</font></a> for database searching.</p><p>Alignment algorithms and software can be directly compared to one another using a standardized set of <a title="Benchmark (computing)" href="/wiki/Benchmark_(computing)"><font color="#0645ad">benchmark</font></a> reference multiple sequence alignments known as BAliBASE.<sup id="cite_ref-thompson2_32-0" class="reference"><a href="#cite_note-thompson2-32"><font size="2"><font color="#0645ad"><span>[</span>33<span>]</span></font></font></a></sup> The data set consists of structural alignments, which can be considered a standard against which purely sequence-based methods are compared. The relative performance of many common alignment methods on frequently encountered alignment problems has been tabulated and selected results published online at <a class="external text" rel="nofollow" href="http://bips.u-strasbg.fr/fr/Products/Databases/BAliBASE/prog_scores.html" rel="nofollow"><font color="#3366bb">BAliBASE</font></a>.<sup id="cite_ref-thompson3_33-0" class="reference"><a href="#cite_note-thompson3-33"><font size="2"><font color="#0645ad"><span>[</span>34<span>]</span></font></font></a></sup> A comprehensive list of BAliBASE scores for many (currently 12) different alignment tools can be computed within the protein workbench <a class="external text" rel="nofollow" href="http://3d-alignment.eu/" rel="nofollow"><font color="#3366bb">STRAP</font></a>.</p><h2><span class="editsection">[<a title="Edit section: References" href="/w/index.php?title=Sequence_alignment&amp;action=edit&amp;section=26"><font color="#0645ad">edit</font></a>]</span> <span id="References" class="mw-headline">References</span></h2>
<div style="column-count: 2; -moz-column-count: 2; -webkit-column-count: 2" class="references-small references-column-count references-column-count-2">
<ol class="references">
<li id="cite_note-mount-0">^ <a href="#cite_ref-mount_0-0"><sup><em>< <li id="cite_note-mount-0">^ <a href="#cite_ref-mount_0-0"><sup><em><
</ol>
</div>

Navigation menu