We conducted a series of tests on the multiple alignment of the beta-globin gene cluster (Hardison et al., 1997) using five utilities for finding conserved sequences (agree, infocon, phylogen, kkno, kunk) and varying the values of the relevant parameters for each method. The goal was to determine the sets of parameter values that would optimize (minimize) a chosen cost function. Specifically, the cost function considered in assessing the accuracy of the results was the total count of false positives and false negatives with respect to a set of landmarks. A false positive is a position in the human sequence which does not belong to any of the known functional sites, but was reported by the program under examination. A false negative is a position in the human sequence that belongs to a known functional site but was not reported by the program.
Two types of assessments were performed: per region assessments, targeted towards the HS2, HS3 and the HBB promoter regions individually, and overall assessments. In the latter case, the goal was to find the set of parameter values that would produce a best aggregate total count for the regions considered. We describe our methods next.
[kkno, kunk]
Parameters: k=1 fixed, l (required minimum region length).
Method:
[infocon]*
Parameters: l (required minimum region length), a
(anchor value, or score adjustment factor).
Method:
[phylogen]*
Parameters: l (required minimum region length), a (fixed anchor),
mode (flexible anchor, if any).
Method:
[agree]*
Parameters: l (required minimum region length), p (percent similarity
threshold), mode (gap inclusive or not).
Method:
* NOTE
For all three of these programs (infocon, phylogen and
agree), there is an efficient way to partition the interval of
parameter values. Since the number of false positives and the number of
false negatives vary monotonically with the value of the second
parameter (a or p), the continuum of possible values can
be explored using a binary-search-type procedure. This partitions the
range of parameter values into intervals over which the program's
results are invariant. For completeness, the intervals were chosen to
cover more than the effective ranges, which were [ 0.0, 1.651064 ] for
infocon, where 1.651064 is the maximum information content for
the alignment tested, and [0.0,4.0] for phylogen (mode=N),
since the maximum phylogenetic distance for a five-sequence alignment
is 4. Intervals in the extremes of the parameter ranges gave identical
costs, and thus fewer runs were made in those intervals, allowing these
extreme intervals to be sampled at low computational cost. By this
procedure, intervals in the more effective range were explored in
smaller increments.
Calibration curve, showing the costs of results returned by phylogen while varying the anchor value. The anchor value was varied over the range from 0 to 5 in increments of 0.001, holding the minimum length l constant at the best value for each region. Each line has 5000 data points.