TABLE OF CONTENTS
This page provides examples, from a biologist's perspective, of how the PipTools programs can be used in real-life situations. It does not attempt to describe every tool exhaustively, but rather to give a flavor for what the tools can do, and especially how they can be combined to accomplish more complex tasks.
We have designed this toolkit in a modular fashion, where each program performs one small function, in order to provide more flexibility to meet your needs. Some tasks can be handled in one step by a single program (these should be fairly straightforward after consulting the Program Reference page), while others will require a multi-step procedure that applies several tools in succession. The examples shown here illustrate a few common situations that can be dealt with through "recipes" of this sort; additional combinations and variations are left to your creativity.
Keep in mind that when running the Perl versions of the tools on certain systems, you need to add perl -S to the beginning of each command; see Syntax Conventions for more information. (This does not apply to sim4, which is written in C.)
If you have any suggestions for additional tools or improvements that would make your work easier, please contact us at email@example.com.
Suppose you just finished assembling some human contigs and want to show how they align with the corresponding mouse sequence, or vice-versa. You could submit the sequences to PipMaker, but the output would show only the alignments, with no annotation -- not even where the known genes are. This leaves the reader with no useful information about the location of other interesting features in the region, such as putative regulatory elements. How can you build some annotation files so PipMaker will generate more informative results, and allow you to produce a nice figure for your paper?
One place to start is with the GenBank entry for a given sequence, which often contains annotations for known features, including the locations of genes and exons. The genbank2exons program can build an exons file directly from the GenBank flat file. However, this output identifies features with respect to the reference sequence for that GenBank entry, so the coordinates may not correspond to the correct positions of these genes and exons in your sequence. What to do?
If the reference sequence is an exact subsequence of your larger one (or vice-versa), one possibility is to use the shift-pos program, which can adjust all of the coordinates in the exons file by an offset you specify. So for instance, if your output file from genbank2exons was called "human_exons", you could fix it using the following command:
shift-pos human_exons -add 144562 > new_exons
However, sometimes the relationship between the GenBank reference sequence and your genomic sequence is not so simple. For example, suppose you've assembled several overlapping contigs that each have their own numbering system (i.e., positions in each contig start with 1). In cases like this, the sim4 program (available separately) provides a more flexible and robust way to translate the coordinates. sim4 is an alignment program; it has an option to produce output in exons-file format, but you need to feed it mRNA sequences to align with the genomic sequence. You can get these using the exons2mrna tool, which turns the output from genbank2exons into putative mRNA sequences for sim4. So in this case, the steps would be:
genbank2exons genbank_file > exons_file
exons2mrna exons_file ref_seq > mrna_fileNote that if the endpoints of the coding sequences are available in the GenBank entry, they will be added to the FASTA header lines in the output.
sim4 genomic_seq mrna_file A=5 > exons_file2
Now suppose that when you submit your sequences and exons file to PipMaker, the resulting pip shows some unexpected regions of conservation. Are you sure they are not genes? Perhaps they just hadn't been identified when the GenBank sequence was annotated.
You might want to try a Genscan analysis of these regions, to see if it thinks they look like genes. And to get the most out of the results, it would be helpful to map the coordinates of Genscan's predictions onto the pip. One way to do this is to use the genscan2exons program to convert your Genscan output into a PipMaker exons file:
genscan2exons genscan_output > genscan_exonsThis is also handy for generating a putative exons file when there are no GenBank annotations for your sequence.
PipMaker's color underlay feature allows you to highlight arbitrary regions of a pip with colors of your choosing. Among other things, it can be used to compare similar information from different sources. For example, if you already have an exons file (perhaps one that you built from GenBank entries) and want to show Genscan predictions on the pip simultaneously for comparison, you can display the predictions as color underlays to avoid conflict with the exons file. The genscan2underlays program converts Genscan output into a PipMaker underlay file, using the colors you specify to represent promoters, exons, and polyA sites on the forward and reverse strands.
genscan2underlays genscan_output -fprom LightGreen -fexon LightBlue -fplya LightRed -rprom LightGreen -rexon LightBlue -rplya LightRed > genscan_colors(Note that these example commands have been wrapped for easier reading, but in practice you would type them all on one line.)
You can also generate an underlay file from an exons file, using the exons2underlays program. In this case you specify colors to represent exons, UTRs, and introns on the forward and reverse strands (unlike genscan2underlays, this program will assume that colors you omit should be the same as the ones you specify).
exons2underlays exons_file2 -fexon LightBlue -futr LightOrange -intron LightYellow > exon_colors
You may have noticed that there are two ways to generate an underlay file from Genscan output: (1) run genscan2exons followed by exons2underlays, or (2) run genscan2underlays directly. Generally the latter is preferable, since it allows you to highlight promoters and polyA sites in addition to exons. The exons2underlays program is primarily intended for use with exons files from other sources, especially those containing CDS information so the UTRs can be highlighted (e.g., exons files built from GenBank entries).
PipMaker draws exons, underlays, and other annotations relative to the first (top) sequence. This means that if you want to flip the whole comparison over and look at it with the mouse as the primary sequence, you need to map the positions in your annotation files (exons, underlays, etc.) from the human sequence to their aligning positions in the mouse sequence. A tedious task by hand, but the transform-pos program makes it much easier.
transform-pos exons_file2 lav_file > mouse_exonsThe result may not have all of the annotations if some coordinates do not align with the other sequence, but the program will warn you about these trouble spots, and you'll have a good start toward an annotated sequence that you can refine to your liking.
Note that if the original pip was made with a fragmented second sequence, you will have to split it up into separate pips, because (1) transform-pos does not support multi-contig alignments, and (2) when you exchange the sequences, PipMaker does not currently allow the first sequence to be fragmented anyway. This means that each separate alignment file from step 1 will be smaller, and so in step 2 there will be more annotations that cannot be translated using that file (because they align with some other contig). You will need to sift through the output to determine which "failures" are real (a dotplot view of the original pip, provided by PipMaker and/or Laj, could be very helpful here).
In a related situation, you might have annotation files for an alignment's second sequence, but not for the first. In this case you could use transform-pos with the "inverse" option to map the annotation coordinates from the second sequence onto the first one. In our example, you would supply a file of, say, exons in the mouse sequence:
transform-pos mouse_exons lav_file -inverse > human_exons
To complete the analysis of sequence conservation in a genomic alignment, you might like to know the positions of all elements that meet specified conservation thresholds. We have developed the strong-hits tool for this purpose. It examines an lav or concise alignment file from PipMaker, and reports all regions that meet the requirements you stipulate for the minimum length of ungapped alignment and percent identity. Furthermore, you can provide a file containing a list of particular regions, and the output will indicate if the conserved elements it finds overlap with any of the intervals in your file. So for example, you could supply an exons file in order to find out which of the strong hits are exons, and (of more interest) which are not exons and may therefore represent putative regulatory elements.
strong-hits lav_file -len 100 -pct 70 -overlaps exons_file2 -exons-only > conserved_elements
You could then use this information to create a color underlay file, using different colors to indicate the percent identity levels, e.g., 70-79% LightPink, 80-89% LightRed, and 90-100% DarkRed. When this information is added to an underlay file describing genes, you will have a description of the alignment that is quite complete, highlighting the positions of both genes and conserved non-coding elements.