TABLE OF CONTENTS
| create an exons file from a GenBank entry | ||
| create an exons file from blEST output | ||
| create an exons file from Genscan output | ||
| create a color underlay file from Genscan output | ||
| convert an exons file to underlay format | ||
| use an exons file to extract mRNA from genomic sequence | ||
| create a repeats file from a GenBank entry | ||
| simplify RepeatMasker output | ||
| find gap-free matches in an alignment that meet specified constraints for length and strength | ||
| given a position in one sequence of an alignment, find matching positions in the other sequence | ||
| shift a file's position coords by a given offset, or convert them to the reverse sequence | ||
| use an alignment to convert a file's position coords from one sequence to another | ||
| sort an exons file for use with PipMaker | ||
| make repeats-style entries for CpG islands | ||
| compute the reverse complement of a sequence | ||
| mask out portions of a sequence |
The syntax specifications given below for each program use the following conventions. Words that appear in plain monospaced font are keywords that you should type literally. Words that appear in italic monospaced font are metasyntactic variables, i.e., you should replace them with the actual filename, number, or other value that you want to use. Items enclosed in [square brackets] are optional; you can include them or not, but never type the brackets themselves. A vertical bar | means "or", indicating alternatives; it should not be typed either. Three dots in brackets [...] means that the previous item can be repeated as many times as you like, but don't type the brackets or the dots. You should, however, type the hyphens - at the beginning of tag keywords, and also the > symbol, which tells your operating system to write the output in the specified file instead of on the screen.
The order of the arguments is quite flexible, within reason. However, if there are two items that are not identified by preceding tag keywords (e.g., two filenames), then they must be supplied in the specified order so the program can tell them apart. Note that in this document, long lines have sometimes been wrapped for easier reading, but you should always type each command as a single line.
If you are using a system like Windows that does not support the "#!" syntax for invoking the Perl interpreter automatically, you will need to add perl -S at the beginning of each command. (The -S flag tells perl to use your system path to find the tool you're trying to run.) For example, the syntax for genbank2exons would become:
perl -S genbank2exons genbank_file > output_file
You can get a quick reminder of the syntax for any of these programs by running it without any arguments. Unfortunately this will not be able to use italics to distinguish the keywords from the meta-variables, but you can still tell them apart because the keywords generally start with hyphens (except the program names).
genbank2exons genbank_file > output_file
This program attempts to create an
exons file from a
GenBank
entry, and usually does a pretty good job. However, because GenBank
entries don't always use certain keywords consistently, it might not
work completely in every case, so you should check the output carefully.
The input file needs to be in GenBank's "flat file" format, so if you
are downloading it from the
NCBI website, be sure to
select "GenBank, Plain Text" format. Note that the gene/exon positions
in the output file will be given with respect to that entry's reference
sequence; if you want to translate them to some other sequence, please
see the Tips and Examples page for more
information. This program is written in Perl to take advantage of the
Boulder module for interpreting GenBank files.
blest2exons blest_file [-min min_exons] > output_file
This program facilitates the process of building an
exons file from a database of
expressed sequences, such as those available from
TIGR. It converts
output from our blEST program
(available separately) into sorted exons format. The -min
option will omit sequences having fewer than the specified number of
exons.
genscan2exons genscan_file [-prob] > output_file
This program converts predictions from the
Genscan program to
exons format. The -prob
option will use each exon's probability score as a "name label" for the
exon.
genscan2underlays genscan_file [-laj] [-split]  
[-fprom fwd_prom_color] [-rprom rev_prom_color]  
[-fexon fwd_exon_color] [-rexon rev_exon_color]  
[-fplya fwd_plya_color] [-rplya rev_plya_color]  
> output_file
This program converts predictions from the
Genscan program to
color underlay format. The
-laj option will add labels for Laj to display when the user
points at a color band, and the -split option will cause
features in the forward strand to be colored only on the top half of
the PIP, while features in the reverse strand will be colored only
on the bottom half. The other options specify colors for predicted
promoters, exons, and poly-A tails, in the forward and reverse
directions (see PipMaker's
color list).
The colors are optional, but features with unspecified colors will
be skipped, so you must specify at least one color to get any
output.
exons2underlays exons_file [-laj] [-split]  
-fexon fwd_exon_color [-futr fwd_utr_color]  
[-rexon rev_exon_color] [-rutr rev_utr_color]  
[-intron intron_color] > output_file
This program converts an exons
file into the format for
color underlays. The
-laj option will add labels for Laj to display when the user
points at a color band, and the -split option will cause
features in the forward strand to be colored only on the top half of
the PIP, while features in the reverse strand will be colored only
on the bottom half. The other options specify colors for exons and
UTRs in the forward and reverse directions, and introns (see PipMaker's
color list).
Unlike genscan2underlays, this program will always make underlays for all of the items in the exons file; if you leave some colors unspecified, it will build defaults from the ones you do specify as follows. If you specify only one exon color, it will assume the other exon color should be the same. Likewise if you specify only one UTR color, it will use that for both directions of UTRs. However, if you omit both UTR colors, then the corresponding exon colors will be used, with the result that the UTRs will not be distinguishable from the rest of their exons (except possibly by their Laj labels). Lastly, the default intron color is White, which is invisible against the white PIP background.
exons2mrna exons_file genomic_seq_file > output_file
This program reads an exons file
and a genomic sequence, and writes
out a FASTA library of putative mRNA sequences. CDS specifications are
included when available, e.g.,
>gene1 2328 bp CDS=143..2089
ACGT...
>gene2 1345 bp CDS=321..1234
TGCA...
This is often run after genbank2exons, in
preparation for using the
sim4 program
(available
separately) to adjust the exon positions relative to a genomic sequence
other than the reference sequence listed in the GenBank entry (please see
the Tips and Examples page for more
information).
genbank2repeats genbank_file > output_file
This program attempts to create a
repeats file from a
GenBank
entry, and usually does a pretty good job. However, because GenBank
entries don't always use certain keywords consistently, it might not
work completely in every case, so you should check the output carefully.
The input file needs to be in GenBank's "flat file" format, so if you
are downloading it from the
NCBI website, be sure to
select "GenBank, Plain Text" format. The output will be in the
simplified repeats format used by PipMaker and Laj (see
rmask2repeats for more information).
Note that the repeat positions in the output file will be with respect to the GenBank entry's reference sequence; at the present time we do not have convenient tools for automatically translating them to some other sequence (though the shift-pos program may help somewhat). However, you can get the repeats for any DNA sequence directly from the RepeatMasker server; genbank2repeats is just a shortcut in case you already have a GenBank file and don't want to wait for RepeatMasker. This program is written in Perl to take advantage of the Boulder module for interpreting GenBank files.
rmask2repeats repeatmasker_file [-genbank] [-nosimple]  
[-check seq_length] > output_file
This program converts output from the
RepeatMasker program (the list of repeat locations, that is --
not the masked sequence or summary) into the simpler
repeats format used by PipMaker
and Laj. Actually PipMaker will accept either format (as long as you
don't try to mix them) and will run this program for you if necessary;
however Laj expects the simplified format. You can also use this
program to convert RepeatMasker results to GenBank header format
by specifying the -genbank option; this is convenient when
preparing new sequences for submission to GenBank. The -nosimple
option instructs the program to ignore any Simple, Low Complexity, or
Satellite types of repeats, while the -check option checks the
repeat positions to make sure they fall within the given sequence length.
strong-hits align_file -len min_length -pct min_ident  
[-gap max_gap_size -step max_ident_step]  
[-overlaps interval_file [-inseq2] [-exons-only]] > output_file
This program examines an
alignment file from PipMaker
to find all gap-free segments, or groups of such segments, that meet
user-specified thresholds for length and percent identity.
The grouping feature allows you to find aligning regions that are interrupted by small indels. It is invoked via the -gap and -step options, which specify the longest allowable gap between adjacent segments, and the maximum difference in their percent identities, respectively. Each segment must still meet the minimum identity criterion individually, but the length requirement is applied to the total length of the group (excluding gaps). In this case the output contains one line for each group instead of for each segment, and includes the overall endpoints, length-weighted average percent identity, total length excluding gaps, and number of segments.
If an optional file of intervals is provided, the program will report which of these intervals, if any, overlap each qualifying segment or group. Several formats for the interval file are supported, including those for exons, repeats, hyperlinked annotations, and color underlays, or any other format having lines containing two positions separated and surrounded by spaces. If the file is in exons format, the -exons-only option can be used to ignore the intervals for genes and translated regions, and look only for overlaps with exons. By default the program will look for intervals that overlap the segment's first sequence, but this can be changed to the second sequence by using the -inseq2 option.
where-hit position align_file [-inverse]  
[-contig ">contig header"] > output_file
This program examines an
alignment file from PipMaker
to find all matches for a specified position. It also indicates the
orientation of the match, and whether it fell in a gap between segments
within a local alignment. By default the position number is given with
respect to the alignment's first sequence, and the program reports the
corresponding position(s) in the second sequence; however this can be
inverted using the -inverse option. By specifying a contig
header (or the beginning of one) you can limit the search to a
particular contig in the second sequence; this is especially useful in
inverse mode, to clarify the meaning of the given position number.
The contig header is case sensitive, must start with the standard
FASTA arrow ">" , and will probably need to be
enclosed in quotes.
shift-pos position_file -add offset | -reverse seq_length  
[-exons | -repeats | -other] > output_file
This program reads a file containing position intervals, and adjusts
the positions in one of two ways. It can either add a given fixed
offset to each position, or compute the corresponding position in the
reverse complement of the sequence. Several file formats are
supported, including those for
exons,
repeats,
hyperlinked annotations, and
color underlays,
or any other format having lines containing two positions separated
and surrounded by spaces.
Lines beginning with "#" and those that do not contain a position interval are left unchanged. Lines that cannot be converted (e.g., because one or both of the positions would end up being less than 1) are flagged with "### FAILED ###" in the output, and are also displayed on the screen (stderr) as a warning.
When computing reverse complement positions, the program will also swap the interval endpoints and make other adjustments as follows. For exons and repeats files, it will interchange the direction indicators ">" vs. "<" and "Right" vs. "Left" , respectively. For exons files, it will also reverse the order of the entries, to keep them in ascending numerical order. By default the program will attempt to recognize the file format, but this can be overridden by specifying the file type on the command line.
transform-pos position_file align_file [-inverse]  
[-exons | -repeats | -other] > output_file
This program reads a file containing position intervals which are
specified with respect to one sequence of a given
alignment file from PipMaker,
and converts them to the corresponding positions in the alignment's
other sequence. Several position file formats are supported,
including those for
exons,
repeats,
hyperlinked annotations, and
color underlays,
or any other format having lines containing two positions separated
and surrounded by spaces. By default the conversion is from the
alignment's first sequence to its second sequence, but this can be
inverted using the -inverse option. Multiple contigs in the
second sequence are not supported, since the output is intended to be
used as first-sequence annotation for a subsequent PipMaker alignment.
Lines beginning with "#" and those that do not contain a position interval are left unchanged. Lines that cannot be converted (e.g., because one or both of the positions do not align with the other sequence) are flagged with "### FAILED ###" in the output, and are also displayed on the screen (stderr) as a warning.
If a position aligns with several regions in the other sequence, then only the first one encountered is reported. Thus it is best if the alignment file is produced using Advanced PipMaker's "single coverage" or "chaining" options. If the region aligns in reverse complement, the program will swap the interval endpoints, and for exons and repeats files it will also interchange the direction indicators ">" vs. "<" and "Right" vs. "Left" , respectively. By default it will attempt to recognize the file format, but this can be overridden by specifying the file type on the command line.
However, note that transform-pos will not attempt to rearrange the order of the lines. Since in general the aligning regions might not appear in the same order in the two sequences, this means the output file may not be sorted. Thus when transforming an exons file (which must be sorted in increasing order for use with PipMaker), the alignment file should be generated using Advanced PipMaker's "chaining" and "search one strand" options; otherwise the output may have to be sorted afterwards (see the sort-exons program).
A related problem arises if an interval's two endpoints align with widely-separated regions in the other sequence. The program transforms the endpoints individually, and even though it performs a few basic sanity checks, there is no guarantee that the resulting "interval" in the other sequence will make sense. Again, Advanced PipMaker's "chaining" and "search one strand" options can help to alleviate this. In addition, transform-pos will print a warning if an interval's endpoints are transformed using different local alignments (though in that case the resulting interval might still be legitimate).
sort-exons exons_file > output_file
This program sorts an exons file,
to try to make it acceptable to PipMaker. It sorts both the genes
within the file and the exons within each gene (though only the latter
is strictly necessary for PipMaker). It sorts based on the beginning
position of each gene or exon, ignoring the ending position. Note
that it is still possible for the resulting file to be invalid, for
example if two exons overlap, or if an exon or CDS region extends
beyond the specified endpoints of the corresponding gene.
Also, note that all lines beginning with a "#" comment character are discarded, since it is not obvious where they belong once the genes and exons have been rearranged. Thus if transform-pos advises that its output needs to be sorted, be sure to fix any "### FAILED ###" lines first, or they will disappear. This program is written in Perl.
find-cpg seq_file > output_file
The PipMaker server automatically locates CpG islands in a submitted
sequence, but Laj does not. If
you want Laj to display CpG islands the way PipMaker would, you can
run this program to find them, and add its output to the end of your
repeats file. (The repeats
file describes all of the non-exon feature symbols that appear in the
panel above the PIP, even if they are not actually repeats.) Even if
you are not using Laj, you might want to run this program to obtain
the precise locations of CpG islands drawn by PipMaker.
This program uses criteria that match PipMaker's (at least, as of April 2001). That is, it finds regions of at least 50% G+C content where the ratio of CpG dinucleotides relative to GpC is at least 75 or 60 percent. Like PipMaker, Laj will draw these as low gray and white boxes, respectively.
reverse-comp seq_file > output_file
This program reads a sequence
file and produces a similar file containing the reverse complement of
the given sequence. A notation saying "(reverse complement)" will be
added to (or removed from) the end of the FASTA header line. If the
file contains multiple sequences (i.e., contigs), then each of them
is reversed individually.
mask-seq seq_file position_file [-char mask_char]  
[-skip "string1" [-skip "string2" [...]]] > output_file
This program reads a sequence
file and a file containing position intervals, and produces a new
version of the sequence file with the specified intervals masked out,
either by changing the nucleotide characters to lowercase (the
default) or by replacing them with a specified mask character (e.g.,
"X" or "N" ). The position
file will typically be a
repeats file, but could also
be in the format for
exons,
hyperlinked annotations, or
color underlays,
or any other format having lines containing two positions separated
and surrounded by spaces. It is also possible to skip lines in the
position file that contain specified strings of characters; this makes
it easy to avoid masking particular types of repeats, for example,
without having to edit the repeats file manually. Each such string
is case sensitive, and should be enclosed in quotes, especially if
it contains any spaces.
Cathy Riemer and Matt Weirauch, July 2002