TABLE OF CONTENTS
create an exons file from a GenBank entry | ||
create an exons file from blEST output | ||
create an exons file from Genscan output | ||
create a color underlay file from Genscan output | ||
convert an exons file to underlay format | ||
use an exons file to extract mRNA from genomic sequence | ||
create a repeats file from a GenBank entry | ||
simplify RepeatMasker output | ||
find matches in an alignment that meet specified constraints for length and strength | ||
given a position in one sequence of an alignment, find matching positions in the other sequence | ||
shift a file's position coords by a given offset, or convert them to the reverse sequence | ||
use an alignment to convert a file's position coords from one sequence to another | ||
sort an exons file for use with PipMaker | ||
make repeats-style entries for CpG islands | ||
compute the reverse complement of a sequence | ||
mask out portions of a sequence |
The syntax specifications given below for each program use the following conventions. Words that appear in plain monospaced font are keywords that you should type literally. Words that appear in italic monospaced font are metasyntactic variables, i.e., you should replace them with the actual filename, number, or other value that you want to use. Items enclosed in [square brackets] are optional; you can include them or not, but never type the brackets themselves. A vertical bar | means "or", indicating alternatives; it should not be typed either. Three dots in brackets [...] means that the previous item can be repeated as many times as you like, but don't type the brackets or the dots. You should, however, type the hyphens - at the beginning of tag keywords, and also the > symbol, which tells your operating system to write the output in the specified file instead of on the screen.
The order of the arguments is quite flexible, within reason. However, if there are two items that are not identified by preceding tag keywords (e.g., two filenames), then they must be supplied in the specified order so the program can tell them apart. Note that in this document, long lines have sometimes been wrapped for easier reading, but you should always type each command as a single line.
Three of these programs (genbank2exons, genbank2repeats, and sort-exons) are written in Perl instead of in C. If you are using a system like Windows that does not support the "#!" syntax for invoking the Perl interpreter automatically, you will need to add perl -S at the beginning of your command for these three tools. (The -S flag tells perl to use your system path to find the tool you're trying to run.) For example, the syntax for genbank2exons would become:
perl -S genbank2exons genbank_file > output_file
You can get a quick reminder of the syntax for any of these programs by running it without any arguments. Unfortunately this will not be able to use italics to distinguish the keywords from the meta-variables, but you can still tell them apart because the keywords generally start with hyphens (except the program names).
genbank2exons genbank_file > output_fileThis program attempts to create an exons file from a GenBank entry, and usually does a pretty good job. However, because GenBank entries don't always use certain keywords consistently, it might not work completely in every case, so you should check the output carefully. The input file needs to be in GenBank's "flat file" format, so if you are downloading it from the NCBI website, be sure to select "GenBank, Plain Text" format. Note that the gene/exon positions in the output file will be given with respect to that entry's reference sequence; if you want to translate them to some other sequence, please see the Tips and Examples page for more information. This program is written in Perl to take advantage of the Boulder module for interpreting GenBank files.
blest2exons blest_file [-min min_exons] > output_fileThis program facilitates the process of building an exons file from a database of expressed sequences, such as those available from TIGR. It converts output from our blEST program (available separately) into sorted exons format. The -min option will omit sequences having fewer than the specified number of exons.
genscan2exons genscan_file [-prob] > output_fileThis program converts predictions from the Genscan program to exons format. The -prob option will use each exon's probability score as a "name label" for the exon.
genscan2underlays genscan_file [-laj] [-split]   [-fprom fwd_prom_color] [-rprom rev_prom_color]   [-fexon fwd_exon_color] [-rexon rev_exon_color]   [-fplya fwd_plya_color] [-rplya rev_plya_color]   > output_fileThis program converts predictions from the Genscan program to color underlay format. The -laj option will add labels for Laj to display when the user points at a color band, and the -split option will cause features in the forward strand to be colored only on the top half of the PIP, while features in the reverse strand will be colored only on the bottom half. The other options specify colors for predicted promoters, exons, and poly-A tails, in the forward and reverse directions (see PipMaker's color list). The colors are optional, but features with unspecified colors will be skipped, so you must specify at least one color to get any output.
exons2underlays exons_file [-laj] [-split]   -fexon fwd_exon_color [-futr fwd_utr_color]   [-rexon rev_exon_color] [-rutr rev_utr_color]   [-intron intron_color] > output_fileThis program converts an exons file into the format for color underlays. The -laj option will add labels for Laj to display when the user points at a color band, and the -split option will cause features in the forward strand to be colored only on the top half of the PIP, while features in the reverse strand will be colored only on the bottom half. The other options specify colors for exons and UTRs in the forward and reverse directions, and introns (see PipMaker's color list).
Unlike genscan2underlays, this program will always make underlays for all of the items in the exons file; if you leave some colors unspecified, it will build defaults from the ones you do specify as follows. If you specify only one exon color, it will assume the other exon color should be the same. Likewise if you specify only one UTR color, it will use that for both directions of UTRs. However, if you omit both UTR colors, then the corresponding exon colors will be used, with the result that the UTRs will not be distinguishable from the rest of their exons (except possibly by their Laj labels). Lastly, the default intron color is White, which is invisible against the white PIP background.
exons2mrna exons_file genomic_seq_file > output_fileThis program reads an exons file and a genomic sequence, and writes out a FASTA library of putative mRNA sequences. CDS specifications are included when available, e.g.,
>gene1 2328 bp CDS=143..2089 ACGT... >gene2 1345 bp CDS=321..1234 TGCA...This is often run after genbank2exons, in preparation for using the sim4 program (available separately) to adjust the exon positions relative to a genomic sequence other than the reference sequence listed in the GenBank entry (please see the Tips and Examples page for more information).
genbank2repeats genbank_file > output_fileThis program attempts to create a repeats file from a GenBank entry, and usually does a pretty good job. However, because GenBank entries don't always use certain keywords consistently, it might not work completely in every case, so you should check the output carefully. The input file needs to be in GenBank's "flat file" format, so if you are downloading it from the NCBI website, be sure to select "GenBank, Plain Text" format. The output will be in the simplified repeats format used by PipMaker and Laj (see rmask2repeats for more information).
Note that the repeat positions in the output file will be with respect to the GenBank entry's reference sequence; at the present time we do not have convenient tools for automatically translating them to some other sequence (though the shift-pos program may help somewhat). However, you can get the repeats for any DNA sequence directly from the RepeatMasker server; genbank2repeats is just a shortcut in case you already have a GenBank file and don't want to wait for RepeatMasker. This program is written in Perl to take advantage of the Boulder module for interpreting GenBank files.
rmask2repeats repeatmasker_file [-genbank] [-nosimple]   [-check seq_length] > output_fileThis program converts output from the RepeatMasker program (the list of repeat locations, that is -- not the masked sequence or summary) into the simpler repeats format used by PipMaker and Laj. Actually PipMaker will accept either format (as long as you don't try to mix them) and will run this program for you if necessary; however Laj expects the simplified format. You can also use this program to convert RepeatMasker results to GenBank header format by specifying the -genbank option; this is convenient when preparing new sequences for submission to GenBank. The -nosimple option instructs the program to ignore any Simple, Low Complexity, or Satellite types of repeats, while the -check option checks the repeat positions to make sure they fall within the given sequence length.
strong-hits align_file -len min_length -pct min_ident   [-gap max_gap_size -step max_ident_step]   [-overlaps interval_file [-inseq2] [-exons-only]] > output_fileThis program examines an alignment file from PipMaker to find all gap-free segments, or groups of such segments, that meet user-specified thresholds for length and percent identity.
The grouping feature allows you to find aligning regions that are interrupted by small indels. It is invoked via the -gap and -step options, which specify the longest allowable gap between adjacent segments, and the maximum difference in their percent identities, respectively. Each segment must still meet the minimum identity criterion individually, but the length requirement is applied to the total length of the group (excluding gaps). In this case the output contains one line for each group instead of for each segment, and includes the overall endpoints, length-weighted average percent identity, total length excluding gaps, and number of segments.
If an optional file of intervals is provided, the program will report which of these intervals, if any, overlap each qualifying segment or group. Several formats for the interval file are supported, including those for exons, repeats, hyperlinked annotations, and color underlays, or any other format having lines containing two positions separated and surrounded by spaces. If the file is in exons format, the -exons-only option can be used to ignore the intervals for genes and translated regions, and look only for overlaps with exons. By default the program will look for intervals that overlap the segment's first sequence, but this can be changed to the second sequence by using the -inseq2 option.
where-hit position align_file [-inverse]   [-contig ">contig header"] > output_fileThis program examines an alignment file from PipMaker to find all matches for a specified position. It also indicates the orientation of the match, and whether it fell in a gap between segments within a local alignment. By default the position number is given with respect to the alignment's first sequence, and the program reports the corresponding position(s) in the second sequence; however this can be inverted using the -inverse option. By specifying a contig header (or the beginning of one) you can limit the search to a particular contig in the second sequence; this is especially useful in inverse mode, to clarify the meaning of the given position number. The contig header is case sensitive, must start with the standard FASTA arrow ">" , and will probably need to be enclosed in quotes.
shift-pos position_file -add offset | -reverse seq_length   [-exons | -repeats | -other] > output_fileThis program reads a file containing position intervals, and adjusts the positions in one of two ways. It can either add a given fixed offset to each position, or compute the corresponding position in the reverse complement of the sequence. Several file formats are supported, including those for exons, repeats, hyperlinked annotations, and color underlays, or any other format having lines containing two positions separated and surrounded by spaces.
Lines beginning with "#" and those that do not contain a position interval are left unchanged. Lines that cannot be converted (e.g., because one or both of the positions would end up being less than 1) are flagged with "### FAILED ###" in the output, and are also displayed on the screen (stderr) as a warning.
When computing reverse complement positions, the program will also swap the interval endpoints and make other adjustments as follows. For exons and repeats files, it will interchange the direction indicators ">" vs. "<" and "Right" vs. "Left" , respectively. For exons files, it will also reverse the order of the entries, to keep them in ascending numerical order. By default the program will attempt to recognize the file format, but this can be overridden by specifying the file type on the command line.
transform-pos position_file align_file [-inverse]   [-exons | -repeats | -other] > output_fileThis program reads a file containing position intervals which are specified with respect to one sequence of a given alignment file from PipMaker, and converts them to the corresponding positions in the alignment's other sequence. Several position file formats are supported, including those for exons, repeats, hyperlinked annotations, and color underlays, or any other format having lines containing two positions separated and surrounded by spaces. By default the conversion is from the alignment's first sequence to its second sequence, but this can be inverted using the -inverse option. Multiple contigs in the second sequence are not supported, since the output is intended to be used as first-sequence annotation for a subsequent PipMaker alignment.
Lines beginning with "#" and those that do not contain a position interval are left unchanged. Lines that cannot be converted (e.g., because one or both of the positions do not align with the other sequence) are flagged with "### FAILED ###" in the output, and are also displayed on the screen (stderr) as a warning.
If a position aligns with several regions in the other sequence, then only the first one encountered is reported. Thus it is best if the alignment file is produced using Advanced PipMaker's "single coverage" or "chaining" options. If the region aligns in reverse complement, the program will swap the interval endpoints, and for exons and repeats files it will also interchange the direction indicators ">" vs. "<" and "Right" vs. "Left" , respectively. By default it will attempt to recognize the file format, but this can be overridden by specifying the file type on the command line.
However, note that transform-pos will not attempt to rearrange the order of the lines. Since in general the aligning regions might not appear in the same order in the two sequences, this means the output file may not be sorted. Thus when transforming an exons file (which must be sorted in increasing order for use with PipMaker), the alignment file should be generated using Advanced PipMaker's "chaining" and "search one strand" options; otherwise the output may have to be sorted afterwards (see the sort-exons program).
A related problem arises if an interval's two endpoints align with widely-separated regions in the other sequence. The program transforms the endpoints individually, and even though it performs a few basic sanity checks, there is no guarantee that the resulting "interval" in the other sequence will make sense. Again, Advanced PipMaker's "chaining" and "search one strand" options can help to alleviate this. In addition, transform-pos will print a warning if an interval's endpoints are transformed using different local alignments (though in that case the resulting interval might still be legitimate).
sort-exons exons_file > output_fileThis program sorts an exons file, to try to make it acceptable to PipMaker. It sorts both the genes within the file and the exons within each gene (though only the latter is strictly necessary for PipMaker). It sorts based on the beginning position of each gene or exon, ignoring the ending position. Note that it is still possible for the resulting file to be invalid, for example if two exons overlap, or if an exon or CDS region extends beyond the specified endpoints of the corresponding gene.
Also, note that all lines beginning with a "#" comment character are discarded, since it is not obvious where they belong once the genes and exons have been rearranged. Thus if transform-pos advises that its output needs to be sorted, be sure to fix any "### FAILED ###" lines first, or they will disappear. This program is written in Perl.
find-cpg seq_file > output_fileThe PipMaker server automatically locates CpG islands in a submitted sequence, but Laj does not. If you want Laj to display CpG islands the way PipMaker would, you can run this program to find them, and add its output to the end of your repeats file. (The repeats file describes all of the non-exon feature symbols that appear in the panel above the PIP, even if they are not actually repeats.) Even if you are not using Laj, you might want to run this program to obtain the precise locations of CpG islands drawn by PipMaker.
This program uses criteria that match PipMaker's (at least, as of April 2001). That is, it finds regions of at least 50% G+C content where the ratio of CpG dinucleotides relative to GpC is at least 75 or 60 percent. Like PipMaker, Laj will draw these as low gray and white boxes, respectively.
reverse-comp seq_file > output_fileThis program reads a sequence file and produces a similar file containing the reverse complement of the given sequence. A notation saying "(reverse complement)" will be added to (or removed from) the end of the FASTA header line. If the file contains multiple sequences (i.e., contigs), then each of them is reversed individually.
mask-seq seq_file position_file [-char mask_char]   [-skip "string1" [-skip "string2" [...]]] > output_fileThis program reads a sequence file and a file containing position intervals, and produces a new version of the sequence file with the specified intervals masked out, either by changing the nucleotide characters to lowercase (the default) or by replacing them with a specified mask character (e.g., "X" or "N" ). The position file will typically be a repeats file, but could also be in the format for exons, hyperlinked annotations, or color underlays, or any other format having lines containing two positions separated and surrounded by spaces. It is also possible to skip lines in the position file that contain specified strings of characters; this makes it easy to avoid masking particular types of repeats, for example, without having to edit the repeats file manually. Each such string is case sensitive, and should be enclosed in quotes, especially if it contains any spaces.
Cathy Riemer, January 2002