This utility locates regions in a given alignment that have good column agreement. The columns are examined individually to determine whether or not they meet a user-specified threshold for letter agreement, and runs of columns passing this test are reported.
There are two criteria by which letter agreement in a column can be evaluated: percentage or exclusion. By the percentage criterion, a column is accepted if some letter occurs in at least the user-provided percentage of the rows. By the exclusion criterion, it is accepted if all but a (user-provided) number of letters in the column are the same. This majority character is the consensus character for the column.
Ambiguity codes (e.g., W representing A or T) are permitted in columns, however they are treated as a separate category that does not match any non-ambiguity symbol. Moreover, a column whose consensus character is an ambiguity code will fail the agreement criterion, and thus will not be a part of any reported region.
The utility has two ways of dealing with gaps: general and restricted. In the restricted case, a column containing a gap symbol will be rejected, and thus no reported region can contain a gap. In the general case a gap is treated like any other character.
When searching for conserved regions in alignments, one frequently wants to locate those whose length suggests that there is indeed some functionality preserved across the species. However, as the conservation need not be perfect, we need a systematic way to bridge positions which might otherwise break a relatively well-conserved region into fragments too short to be selected. Two of our tools, infocon and phylogen, use a technique that assigns a numerical score to each column, and then looks for consecutive runs of columns whose cumulative score (obtained by adding together the individual column scores) is maximized. Specifically, these full runs have the property that none of their sub-runs has a higher score, and they are maximal in the sense that they are not contained in any longer run also having this property.
The infocon tool finds full runs of columns with high information content in the given alignment. That is, each column's score is based on both the frequency of its letters within the column and within the alignment as a whole. The raw information content of any alignment column is always positive, and thus unsuitable for scoring (because extending a region would always increase the score). Accordingly, the score is adjusted by subtracting the average per-column information content of the alignment and/or a user-specified constant.
Columns containing ambiguity codes are allowed in the alignment, but they cannot appear in the selected regions. Once a column is selected, its consensus character is the letter that occurs most frequently; in the case of a tie, an ambiguity code is used.
Like the agree program, infocon allows for flexible treatment of gaps. In addition to options for treating gaps as ordinary characters or prohibiting them in the selected regions, it supports an option for ignoring them during the calculation of the information content.
The phylogen program finds all full runs of columns, scored by the minimum number of changes needed to account for the evolutionary relationships among the sequences of the given alignment, with respect to a supplied phylogenetic tree. This scoring scheme draws on the approaches outlined in Fitch (1971), and Sankoff and Rousseau (1975). The phylogenetic tree has a leaf node for each species, and each internal node represents a putative common ancestor for the species in its subtree. For each column, we label a copy of the tree as follows: each leaf node receives the letter from the alignment row of the corresponding species, and the internal nodes are labelled in such manner as to minimize the total number of changes in the tree. The character assigned to the root of the tree is called the ancestral character.
Since well-conserved columns will have low scores, but the selection algorithm is geared toward maximization, the column scores are adjusted by subtracting them from a suitable anchor value. However, as with the infocon program, it is essential that both positive and negative scores occur, so the anchor value must be chosen carefully. It can be calculated by the program, as either the current number of active rows for a column or the current number of active rows not containing a gap, or it can be set to an arbitrary non-negative number by the user. Combinations of these approaches are also possible.
The kkno program scans the alignment to determine, starting at each position, the longest region in which no row differs from a specified, known center sequence in more than k positions. The parameter k denoting the number of permitted mismatches is user-selectable, and the known center can be an existing alignment sequence or specified separately.
Ambiguity codes are supported. An ambiguous center symbol will be satisfied by any of its representative symbols in the alignment. Conversely, an ambiguous alignment symbol will not satisfy an exact symbol in the center sequence.
Gaps can be treated as ordinary characters or prohibited in the selected regions, either just in the alignment or in both the alignment and the center sequence.
The kunk utility locates all maximal alignment regions in which each row differs in at most k positions from an unknown center sequence. For each column in the alignment, it recursively examines all possible center sequences starting at that position and how long they extend, backtracking when the extension becomes impossible.
The center sequence can be thought of as belonging to a common ancestor of the species represented in the alignment or as the potential binding site for known or yet unidentified proteins, hence it is itself informative. Thus, the computational problem is two-fold: apart from determining the conserved regions according to the above criterion, the utility must produce a best center sequence for each region. The quality measure in assessing a potential candidate sequence is the sum of the squares of its Hamming distances from the alignment sequences within the region, where a lower value indicates a better candidate. The Hamming distance between two sequences is the number of mismatches between them, taking into account the semantics of ambiguity codes. Only characters within a column can be used in the center sequence. When selected in the center sequence, the semantics of an ambiguous character dictates that it matches any character it encodes and mismatches any other character. For instance, `W' in the center sequence will match `W', `A' or `T' in the alignment, but not G and C.
As before, the user can choose that gaps be excluded from both the alignment and the center sequences in the selected regions, allowed to appear in the alignment but not in the center sequence or selected in the center whenever they occur in a column.