TABLE OF CONTENTS
This page describes some of the file formats you will encounter while using the PipTools programs, particularly the formats that are specific to PipMaker and Laj. Other files, such as those obtained from GenBank, Genscan, and RepeatMasker, are not described here because (1) they should be documented at their home sites, and (2) you can usually just use them "as is" with the PipTools programs, without having to edit their contents.
|
All files must consist solely of plain text characters. |
Sequences are supplied in FASTA format, which looks like this:
>Sequence name and arbitrary header text on one line
ACGTGCGCGATCGCCTGCTAGGCGTACGTCGCAG
GCGATCGATGTGCTAGATCAGATGACA
... etc.
At the present time, our software handles only the letters
A , C , G ,
T , N , X
(and their lowercase versions, if you are using Advanced PipMaker's
user-controlled masking). For maximum interoperability, the
sequence data should consist of short lines limited to about 70
characters, and it is generally best to keep the header line to a
reasonable length as well. Some of the tools allow you to include
several of these sequences in a single file, each with its own
header line (i.e., multiple contigs).
This file lists the locations of genes, exons, and coding regions in a sequence (typically the first sequence to be aligned by PipMaker). The directionality of a gene (">" or "<"), its start and end positions, and name should be on one line, followed by an optional line beginning with a "+" character that indicates the first and last nucleotides of the translated region (including the initiation codon, Met, and the stop codon). These are followed by lines specifying the start and end positions of each exon, which must be listed in order of increasing address even if the gene is on the reverse strand ("<"). By default PipMaker and Laj will supply exon numbers, but you can override this by specifying your own name or number for individual exons. Blank lines are ignored, and you can put an optional title line at the top. Thus, the file might begin as follows:
My favorite genomic region
> 100 800 XYZZY
+ 150 750
100 200
600 800
< 1000 2000 Plugh gene
1000 1200 exon 1
1400 1500 alt. spliced exon
1800 2000 exon 2
... etc.
This file lists repeats and other features in a sequence (typically the first sequence to be aligned by PipMaker). The first line tells PipMaker that this is a simplified repeats file (as opposed to RepeatMasker output), and each subsequent line specifies the start, end, direction, and type of a particular feature.
%:repeats
1081 1364 Right Alu
1365 1405 Simple
... etc.
The allowed types are:
Alu , B1 , B2 ,
SINE , LINE1 , LINE2 ,
MIR , LTR , DNA ,
RNA , Simple , CpG60 ,
CpG75 , and Other .
Of these, all except Simple , CpG60 ,
and CpG75 require a direction (Right
or Left).
This file contains user-supplied annotations, i.e., links to web sites providing information about particular regions in a sequence (typically the first sequence to be aligned by PipMaker). It first defines various types of hyperlinks and associates a color with each of them, then specifies the type, position, description, and URL for each annotated feature. This is a change from the format formerly used by Laj.
# annotations for part of the mouse MHC class II region
%define type
%name PubMed
%color Blue
%define type
%name LocusLink
%color Orange
%define annotation
%type PubMed
%range 1 2000
%label Yang et al. 1997. Daxx, a novel Fas-binding protein...
%summary Yang, X., Khosravi-Far, R. Chang, H., and Baltimore, D. (1997).
Daxx, a novel Fas-binding protein that activates JNK and apoptosis.
Cell 89(7):1067-76.
%url http://www.ncbi.nlm.nih.gov:80/entrez/
query.fcgi?cmd=Retrieve&db=PubMed&list_uids=9215629&dopt=Abstract
... etc.
Here, for example, the first stanza requests that each feature
subsequently identified as a PubMed entry be colored blue. The name
must be a single word, perhaps containing underline characters (e.g.,
Entry_in_GenBank). Colors start with capital letters,
and must come from PipMaker's
color list.
The third stanza associates a PubMed annotation with positions 1-2000 in the sequence. Note that summaries and URLs (but not labels) can be broken into several lines for convenience; the line breaks are removed when the file is read, but they are not replaced with spaces. Thus a continuation line for a summary typically begins with a space to separate it from the last word of the previous line, while a URL continuation does not. If the summary is omitted, it is assumed to be the same as the label.
Also note that stanzas should be separated by blank lines, and lines beginning with a "#" character are comments that are ignored by PipMaker and Laj. Several annotations can overlap at the same position with no problem; they will be displayed in multiple rows if necessary.
This file contains user-specified underlays, i.e., colored bands to be painted on the percent identity plot. Currently there are two different formats for this information: the regular format accepted by both PipMaker and Laj, and an additional labeled one that is only used by Laj. The regular format looks like this:
# sample underlays for the BTK region
LightYellow Gene
Green Exon
Red Strongly_conserved
35324 72009 Gene
49781 49849 Exon
51403 51484 Exon
50350 50513 Strongly_conserved +
52376 52603 Strongly_conserved
... etc.
The first set of lines describes the intended meaning of the colors,
while the second group specifies the location of each band. Colors
start with capital letters and must come from PipMaker's
color list,
but the meaning of each color can be any single word chosen by you.
A "+" or "-"
character at the end of a location line will paint just the upper or
lower half of the band, respectively. This allows you to differentiate
between the two strands, or to plot potentially overlapping features
like gene predictions and database matches. Note that if two bands
overlap, the one that was specified last in the file appears "on top"
and obscures the earlier one. Thus in this example, the green exons
and red strongly conserved regions cover up parts of the long yellow
band representing the gene. As in the annotations file, lines
beginning with a "#" character are comments that
will be ignored.
The second format is similar to the first one, but it allows you to specify a label for each color band which will be displayed in Laj's message box when the user points the mouse at that band. The color definition lines are the same as for the regular format, but the location lines look like this:
35324 72009 (Here is one label) Gene
50350 50513 (Here is another one) Strongly_conserved +
An underlay file for Laj can contain a mixture of these two formats
(i.e., the label is optional). The parentheses must be present if
the label is, and the label itself cannot contain any additional
parentheses. (Note that the dummy item formerly
required by this format is no longer necessary; it is still supported
for your old files, but its use is discouraged.)
Some of the tools require a file containing local alignments as computed by PipMaker. This can be the "concise textual form of the alignments" or the "raw blastz output"; either one will do. (These are known as concise and lav files, respectively.)
You can obtain alignments in these formats by checking the "Select output" boxes on the Advanced PipMaker form. Note that PipMaker sends back your results as email attachments, in this case using a quoted-printable MIME format. Make sure this gets decoded into true plain text when saving the attachment, or the programs are likely to report errors due to the not-quite-right file format.