File Formats for the PipTools Programs


TABLE OF CONTENTS


Introduction

This page describes some of the file formats you will encounter while using the PipTools programs, particularly the formats that are specific to PipMaker and Laj. Other files, such as those obtained from GenBank, Genscan, and RepeatMasker, are not described here because (1) they should be documented at their home sites, and (2) you can usually just use them "as is" with the PipTools programs, without having to edit their contents.

All files must consist solely of plain text characters.

Sequences

Sequences are supplied in FASTA format, which looks like this:

     >Sequence name and arbitrary header text on one line
     ACGTGCGCGATCGCCTGCTAGGCGTACGTCGCAG
     GCGATCGATGTGCTAGATCAGATGACA
     ... etc.
At the present time, our software handles only the letters  A ,  C ,  G ,  T ,  N ,  X  (and their lowercase versions, if you are using Advanced PipMaker's user-controlled masking). For maximum interoperability, the sequence data should consist of short lines limited to about 70 characters, and it is generally best to keep the header line to a reasonable length as well. Some of the tools allow you to include several of these sequences in a single file, each with its own header line (i.e., multiple contigs).  

Exons

This file lists the locations of genes, exons, and coding regions in a sequence (typically the first sequence to be aligned by PipMaker). The directionality of a gene (">"  or  "<"), its start and end positions, and name should be on one line, followed by an optional line beginning with a  "+"  character that indicates the first and last nucleotides of the translated region (including the initiation codon, Met, and the stop codon). These are followed by lines specifying the start and end positions of each exon, which must be listed in order of increasing address even if the gene is on the reverse strand ("<"). By default PipMaker and Laj will supply exon numbers, but you can override this by specifying your own name or number for individual exons. Blank lines are ignored, and you can put an optional title line at the top. Thus, the file might begin as follows:

     My favorite genomic region

     > 100 800 XYZZY
     + 150 750
     100 200
     600 800

     < 1000 2000 Plugh gene
     1000 1200 exon 1
     1400 1500 alt. spliced exon
     1800 2000 exon 2

     ... etc.

Repeats

This file lists repeats and other features in a sequence (typically the first sequence to be aligned by PipMaker). The first line tells PipMaker that this is a simplified repeats file (as opposed to RepeatMasker output), and each subsequent line specifies the start, end, direction, and type of a particular feature.

     %:repeats

     1081 1364 Right Alu
     1365 1405 Simple
     ... etc.
The allowed types are:  Alu ,  B1 ,  B2 ,  SINE ,  LINE1 ,  LINE2 ,  MIR ,  LTR ,  DNA ,  RNA ,  Simple ,  CpG60 ,  CpG75 , and  Other . Of these, all except  Simple ,  CpG60 , and  CpG75  require a direction (Right  or  Left).  

Annotations

This file contains user-supplied annotations, i.e., links to web sites providing information about particular regions in a sequence (typically the first sequence to be aligned by PipMaker). It first defines various types of hyperlinks and associates a color with each of them, then specifies the type, position, description, and URL for each annotated feature. This is a change from the format formerly used by Laj.

     # annotations for part of the mouse MHC class II region

     %define type
     %name PubMed
     %color Blue

     %define type
     %name LocusLink
     %color Orange

     %define annotation
     %type PubMed
     %range 1 2000
     %label Yang et al. 1997.  Daxx, a novel Fas-binding protein...
     %summary Yang, X., Khosravi-Far, R. Chang, H., and Baltimore, D. (1997).
       Daxx, a novel Fas-binding protein that activates JNK and apoptosis.
       Cell 89(7):1067-76.
     %url http://www.ncbi.nlm.nih.gov:80/entrez/
     query.fcgi?cmd=Retrieve&db=PubMed&list_uids=9215629&dopt=Abstract

     ... etc.
Here, for example, the first stanza requests that each feature subsequently identified as a PubMed entry be colored blue. The name must be a single word, perhaps containing underline characters (e.g.,  Entry_in_GenBank). Colors start with capital letters, and must come from PipMaker's color list.

The third stanza associates a PubMed annotation with positions 1-2000 in the sequence. Note that summaries and URLs (but not labels) can be broken into several lines for convenience; the line breaks are removed when the file is read, but they are not replaced with spaces. Thus a continuation line for a summary typically begins with a space to separate it from the last word of the previous line, while a URL continuation does not. If the summary is omitted, it is assumed to be the same as the label.

Also note that stanzas should be separated by blank lines, and lines beginning with a  "#"  character are comments that are ignored by PipMaker and Laj. Several annotations can overlap at the same position with no problem; they will be displayed in multiple rows if necessary.  

Underlays

This file contains user-specified underlays, i.e., colored bands to be painted on the percent identity plot. Currently there are two different formats for this information: the regular format accepted by both PipMaker and Laj, and an additional labeled one that is only used by Laj. The regular format looks like this:

     # sample underlays for the BTK region

     LightYellow Gene
     Green Exon
     Red Strongly_conserved

     35324 72009 Gene
     49781 49849 Exon
     51403 51484 Exon
     50350 50513 Strongly_conserved +
     52376 52603 Strongly_conserved
     ... etc.
The first set of lines describes the intended meaning of the colors, while the second group specifies the location of each band. Colors start with capital letters and must come from PipMaker's color list, but the meaning of each color can be any single word chosen by you. A  "+"  or  "-"  character at the end of a location line will paint just the upper or lower half of the band, respectively. This allows you to differentiate between the two strands, or to plot potentially overlapping features like gene predictions and database matches. Note that if two bands overlap, the one that was specified last in the file appears "on top" and obscures the earlier one. Thus in this example, the green exons and red strongly conserved regions cover up parts of the long yellow band representing the gene. As in the annotations file, lines beginning with a  "#"  character are comments that will be ignored.

The second format is similar to the first one, but it allows you to specify a label for each color band which will be displayed in Laj's message box when the user points the mouse at that band. The color definition lines are the same as for the regular format, but the location lines look like this:

     35324 72009 (Here is one label) Gene
     50350 50513 (Here is another one) Strongly_conserved +
An underlay file for Laj can contain a mixture of these two formats (i.e., the label is optional). The parentheses must be present if the label is, and the label itself cannot contain any additional parentheses. (Note that the  dummy  item formerly required by this format is no longer necessary; it is still supported for your old files, but its use is discouraged.)  

Alignments

Some of the tools require a file containing local alignments as computed by PipMaker. This can be the "concise textual form of the alignments" or the "raw blastz output"; either one will do. (These are known as  concise  and  lav  files, respectively.)

You can obtain alignments in these formats by checking the "Select output" boxes on the Advanced PipMaker form. Note that PipMaker sends back your results as email attachments, in this case using a quoted-printable MIME format. Make sure this gets decoded into true plain text when saving the attachment, or the programs are likely to report errors due to the not-quite-right file format.



Cathy Riemer, June 2001