Input Files for Gmaj

TABLE OF CONTENTS

Introduction

This page describes the input files supported by Gmaj, and their formats. Only the alignment file is required; the others are optional. Except where noted, all information applies to both the stand-alone and applet modes of Gmaj.

For annotations, Gmaj supports two broad categories of file formats. The original set of formats is essentially the same as those used by PipMaker and Laj, where each destination for the data (exons panel, color underlays, etc.) has its own file format tailored for the needs of that display. These files can be cumbersome to prepare manually, though PipMaker's associated utilities, such as PipHelper and the PipTools, can significantly reduce the burden.

However, since sequence annotations are increasingly becoming available in standardized formats from on-line resources such as the UCSC Table Browser, Gmaj can now accept some of these formats as well. These are referred to here as "generic" formats because they are not restricted to a particular biological data type or Gmaj display panel.

The PipMaker-style formats are described below in the sections for each panel, while the generic ones are discussed in a separate section, Generic Annotation Formats.

  • All files must consist solely of plain text ASCII characters.  (For example, no Word documents.)

  • All coordinates for PipMaker-style annotations are 1-based, closed interval.  Those for generic annotations may be either 1-based or 0-based and closed or half-open, depending on the format.

Parameters File

The annotation files are optional, but because in some alignments any of the sequences can be viewed as the reference sequence, there are potentially a large number of annotation files to provide, too many to type their names on the command line or paste them into a dialog box every time you want to view the data. For this reason, Gmaj uses a meta-level parameters file that lists the names of all the data files, plus a few other data-related options. Then when running Gmaj, you only have to specify that one file name. However, if you don't want to use any of these annotations or options, you can specify a single alignment file directly in place of a parameters file.

A sample parameters file that you can use as a template is provided at sample.gmaj. It contains detailed comments at the bottom explaining the syntax and meaning of the parameters.

Compression and Bundling

Gmaj supports a "bundle" option, which allows you to collect and compress some or all of the data files into a single file in .zip or .jar format (not .tar, sorry). This is especially useful for streamlining the applet's data download, but is also supported in stand-alone mode. A few tips:

As an alternative to bundling, data files can be compressed individually in .zip, .jar, or .gz format; this gains the compact size for storage and transfer, but still requires overhead for multiple HTTP connections in applet mode. The file name must end with the corresponding extension for the compression format to be recognized. (Such files can also be included in the bundle if desired; though little if any additional compression is typically achieved, this may be more convenient than unzipping a large file just to bundle it.)

Coordinate Systems

If you supply any annotations for Gmaj to display, these files must all use position coordinates that refer to the same original sequences identified in the MAF alignment files (ignoring any display offsets specified in the parameters file). However, even though the MAF coordinates are 0-based, the PipMaker-style annotation files all use a 1-based, closed-interval coordinate system (i.e., the first nucleotide in the sequence is called "1", and specified ranges include both endpoints). This is for consistency with PipMaker, so the same files can be used with both programs, and the same tools can be used to prepare them. Coordinates for generic annotations may be either 1-based or 0-based and closed or half-open, depending on the format, but Gmaj always adjusts them as needed (including the ones in the MAF files) to convert everything to a 1-based, closed-interval system for display.

Alignments

Gmaj is designed to display multiple-sequence alignments in MAF format. It is especially suited for sequence-symmetric alignments from programs such as TBA, but can also display MAF files that have a fixed reference sequence. (In the latter case it is a good idea to set the refseq field in your parameters file, to prevent displaying the alignments with an inappropriate reference sequence.) It is possible to display several alignment files simultaneously on the same plots, e.g. for comparing output from different alignment programs.

Gmaj normally requires that each sequence name appears at most once in each MAF block, i.e., that the values of the "src" field are unique across all of the s lines within the same block. However, there is a special exception for the case of pairwise self-alignments: if all of the blocks have just two rows, then all of the sequence names can be the same. In this case Gmaj distinguishes the rows in each block by internally adding a ~ suffix to the second row's sequence name; the ~ does not show in the main display, but you may occasionally see it in an error message.

The downside of this feature is that sequence names in the MAF files must not end with ~, even for non-self alignments.

Exons

Each of these files lists the locations of genes, exons, and coding regions in a particular reference sequence. The exons and UTRs are displayed as black and gray boxes in a separate panel above the alignment plots.

In the PipMaker-style exons format, the directionality of a gene (>, <, or |), its start and end positions, and name should be on one line, followed by an optional line beginning with a + character that indicates the first and last nucleotides of the translated region (including the initiation codon, Met, and the stop codon). These are followed by lines specifying the start and end positions of each exon, which must be listed in order of increasing address even if the gene is on the reverse strand (<). By default Gmaj will supply exon numbers, but you can override this by specifying your own name or number for individual exons. Blank lines are ignored, and you can put an optional title line at the top. Thus, the file might begin as follows:

     My favorite genomic region

     < 100 800 XYZZY
     + 150 750
     100 200
     600 800

     > 1000 2000 Frobozz gene
     1000 1200 exon 1
     1400 1500 alt. spliced exon
     1800 2000 exon 2

     ... etc.

Repeats

Each of these files lists interspersed repeats (and possibly other features such as CpG islands) in a particular reference sequence. These are displayed in a separate panel just below the exons, using the same shapes and shading as PipMaker if possible.

In the PipMaker-style repeats format, the first line identifies this as a simplified repeats file (as opposed to RepeatMasker output, which Gmaj does not yet support). Each subsequent line specifies the start, end, direction, and type of an individual feature.

     %:repeats

     1081 1364 Right Alu
     1365 1405 Simple
     ... etc.
The allowed PipMaker types are: Alu, B1, B2, SINE, LINE1, LINE2, MIR, LTR, DNA, RNA, Simple, CpG60, CpG75, and Other. Of these, all except Simple, CpG60, and CpG75 require a direction (Right or Left).

Linkbars

Each of these files contains reference annotations, i.e., noteworthy regions in a particular reference sequence, which are drawn in a separate panel as colored bars. Typically each bar has an associated URL pointing to a web site with more information about the region, but this is not required. In applet mode Gmaj opens a new browser window to visit the linked site when the user clicks on a bar; in stand-alone mode Gmaj is not running within a web browser, so it just displays the URL for the user to visit manually via copy-and-paste.

The PipMaker-style format first defines various types of links and associates a color with each of them, then specifies the type, position, description, and URL for each annotated region.

     # linkbars for part of the mouse MHC class II region

     %define type
     %name PubMed
     %color Blue

     %define type
     %name LocusLink
     %color Orange

     %define annotation
     %type PubMed
     %range 1 2000
     %label Yang et al. 1997.  Daxx, a novel Fas-binding protein...
     %summary Yang, X., Khosravi-Far, R. Chang, H., and Baltimore, D. (1997).
       Daxx, a novel Fas-binding protein that activates JNK and apoptosis.
       Cell 89(7):1067-76.
     %url http://www.ncbi.nlm.nih.gov:80/entrez/
     query.fcgi?cmd=Retrieve&db=PubMed&list_uids=9215629&dopt=Abstract

     ... etc.
Here, for example, the first stanza requests that each feature subsequently identified as a PubMed entry be colored blue. The name must be a single word, perhaps containing underline characters (e.g., Entry_in_GenBank), and the color must come from Gmaj's color list.

The third stanza associates a PubMed link with positions 1-2000 in this sequence. The label should be kept fairly short, as it will be displayed on Gmaj's position indicator line when the user points at this linkbar. The summary is optional; it is used only by PipMaker and will be ignored by Gmaj. Also, while PipMaker allows several summary/URL pairs within a single annotation, Gmaj expects each field to occur at most once. If Gmaj encounters extra URLs, it will just use the first one and display a warning message.

Note that summaries and URLs (but not labels) can be broken into several lines for convenience; the line breaks are removed when the file is read, but they are not replaced with spaces. Thus a continuation line for a summary typically begins with a space to separate it from the last word of the previous line, while a URL continuation does not.

Also note that stanzas should be separated by blank lines, and lines beginning with a # character are comments that will be ignored. The linkbars can appear in the file in any order, and several can overlap at the same position with no problem, since Gmaj will display them in multiple rows if necessary. In PipMaker this format is called "annotations with hyperlinks".

Underlays

Each of these files specifies underlays (colored bands) to be painted on a particular pairwise pip and its corresponding dotplot. The bands are specified as regions in the reference sequence and are normally drawn vertically; however for a dotplot, Gmaj will also look to see if you have specified an underlay file for the transposed situation where the reference and secondary sequences are swapped, and if so, will draw those underlays as horizontal bands in the secondary sequence.

The PipMaker-style underlay format supported by Gmaj looks like this:

     # partial underlays for the BTK region

     LightYellow Gene
     Green Exon
     Red Strongly_conserved

     35324 72009 (BTK gene) Gene
     49781 49849 (exon 4) Exon
     51403 51484 Exon
     50350 50513 (conserved 84%) Strongly_conserved 84
     52376 52603 (Kilroy was here) Strongly_conserved 92 +
     ... etc.
The first group of lines describes the intended meaning of the colors, while the second group specifies the location of each band. Colors must come from Gmaj's color list, but the meaning of each color can be any single word chosen by you. The text in parentheses is an optional label which will be displayed on Gmaj's position indicator line when the user points the mouse at that band. The parentheses must be present if the label is, and the label itself cannot contain any additional parentheses. The number following the color category is an optional integer score that can be used to interactively adjust which underlays are displayed; see "Underlays Box" in the Menus and Widgets section of Starting and Running Gmaj for more information. (The label and score are extra features not supported by PipMaker.) A + or - character at the end of a location line will paint just the upper or lower half of the band on the pip (but is ignored for dotplots). This allows you to differentiate between the two strands, or to plot potentially overlapping features like gene predictions and database matches.

Note that if two bands overlap, the one that was specified last in the file appears "on top" and obscures the earlier one (except for the special Hatch color). Thus in this example, the green exons and red strongly conserved regions cover up parts of the long yellow band representing the gene. As in the links file, lines beginning with a # character are comments that will be ignored.

Highlights

Highlight files are analogous to the underlay files, but each of these specifies colored regions for a particular sequence in the text view, rather than for a plot. If you do not specify a highlight file for a particular sequence, Gmaj will automatically provide default highlights based on the exons file (if you provided one). These will use one color for whole genes, overlaid with different colors to indicate exons on the forward vs. reverse strand. If the exons file specifies a gene's translated region, then the 5´ and 3´ UTRs will be shaded using lighter colors. These default highlights make it easy to examine the putative start/stop codons and splice junctions, as well as providing a visual connection between the graphical and text views. But if for some reason you do not want any text highlights, you can suppress them by specifying an empty highlight file.

The PipMaker-style format for highlights is the same as for underlays, except that any + or - indicators will be ignored, and the Hatch color is not supported for highlights. Just as with underlays, labels can be included which will be shown when the user points at the highlight, scores can be used to limit which entries are displayed, and highlights that are listed later in the file will cover up those that appear earlier.

Color List

For Gmaj's PipMaker-style annotations, the available colors are:

    Black   White        Clear
    Gray    LightGray    DarkGray
    Red     LightRed     DarkRed
    Green   LightGreen   DarkGreen
    Blue    LightBlue    DarkBlue
    Yellow  LightYellow  DarkYellow
    Pink    LightPink    DarkPink
    Cyan    LightCyan    DarkCyan
    Purple  LightPurple  DarkPurple
    Orange  LightOrange  DarkOrange
    Brown   LightBrown   DarkBrown
These names are case-sensitive (i.e., capitalization matters). Not all of these are supported by PipMaker. Also, be aware that the appearance of the colors may vary between PipMaker and Gmaj, and from one printer or monitor to the next.

Hatch

In addition to the regular colors listed above, Gmaj supports a special "color" for underlays called Hatch, which is drawn as a pattern of diagonal gray lines. Normally if two underlays overlap, the one that was specified last in the file appears "on top" and obscures the earlier one. However, Hatch underlays have the special property that they are always drawn after the other colors, and since the space between the diagonal lines is transparent, they allow the other colors to show through. Currently Hatch is only supported for underlays, not for highlights or linkbars.

Generic Annotation Formats

The standardized generic formats currently supported by Gmaj include GFF (v1 & v2), GTF, and various flavors of BED (including the full BED12 format, a.k.a. "gene BED"). For details on these formats, please see the specifications at the above links; this document will mainly discuss their use by Gmaj.

These formats are all tab-separated, and despite their differences are similar enough that Gmaj can extract comparable fields and treat them more or less the same. Note that Gmaj is not intended as a format validator: parsing is more lenient in some respects than the official format specifications, and Gmaj will ignore fields it has no use for. Also, interpretation of these open-ended formats depends partly on what type of annotation is expected; e.g. if Gmaj is trying to read exons from a GFF v1 file, it will assume that the group field is the gene name. It will generally show warning messages to keep the user apprised of any such assumptions it is making (if these become too annoying they can be individually suppressed in the parameters file; see sample.gmaj for details). Because one of the main reasons for supporting these formats is to enable the use of annotation files obtained from public sources, Gmaj tries not to balk at anomalies that are probably not the user's fault, and when practical will simply skip questionable items with a warning message. Each type of message will generally be displayed only once, and not repeated for every item with the same problem.

Filename Extensions

In order to distinguish generic files from PipMaker-style ones and handle them appropriately, Gmaj requires that files in generic formats have names ending with any of certain extensions. The default list is .gff, .gtf, .bed, .ct, and .trk, but this can be customized (see sample.gmaj).

Quoting

Some of the generic formats require text values to be enclosed in double quotes (" "). Even when not strictly required it is usually a good idea to do so, especially if the value contains spaces. The official specifications generally don't say what to do if a value contains embedded quote characters, but Gmaj supports a rudimentary mechanism for escaping them with a backslash (\). However it does not provide for escaping the backslash: quoted values should not end with \ (insert a space before the final quote if necessary).

Empty Fields

When reading the generic formats, Gmaj treats two adjacent tab characters as an empty field. However, your files will be easier for humans to read if you do not leave fields completely empty. Gmaj recognizes a value of . (the dot character) to mean "unspecified" for fields such as strand, score, feature, and color, in some cases even when the official formats don't. For instance, GFF v2 explicitly calls for using . when there is no score, but Gmaj allows you to do this with the other generic formats as well, in order to distinguish between "no score" and a score that is truly zero. For colors, in addition to . Gmaj also interprets 0 to mean "unspecified", in keeping with examples at UCSC.

Coordinates

The GFF and GTF formats use 1-based, closed-interval coordinates (i.e., sequence numbering starts with "1", and specified ranges include both endpoints), while BED uses a 0-based, half-open system (the first nucleotide of the sequence is numbered "0", and the ending position is not included in the region). For all of these formats, positions are given relative to the beginning of the named sequence regardless of which strand the feature is on (unlike MAF), and start must be less than or equal to end.

GFF Conventions

BED format is relatively fixed in how its fields are used, but GFF and GTF are more variable and require additional conventions for most effective use with Gmaj. In particular, the values of the "feature" field and the optional "attributes" affect how Gmaj will interpret and display an item.

Values of the feature field that are recognized for special treatment include:

Of these, only the PipMaker types are case-sensitive.

For GFF v2 and GTF, the currently recognized attribute tags are:

These keywords are not case-sensitive, but they cannot have multiple values.

Custom Tracks

Along with the basic formats listed above, Gmaj also supports UCSC custom track headers. Track lines can specify certain settings for an entire track; currently color, itemRgb, offset, and url are supported. They also allow several tracks (even in mixed formats) to be combined in a single file. Gmaj does not currently provide a way to use just one particular track from such a file (it will be treated as one big bag of annotations), but lines in unsupported formats such as WIG are gracefully skipped. Browser lines are also skipped; Gmaj's initial zoom position is controlled by command-line or applet parameters rather than by individual annotation files.

Multiple Sequences

Generic files can also contain annotations for several sequences, because unlike the PipMaker-style formats, they all have a "seqname" or "chrom" field that Gmaj can use to select the appropriate lines. Ideally Gmaj expects this field to match the sequence name from the alignment files, but has two ways to deal with exceptions. If there is only one seqname in the annotation file, then Gmaj will go ahead and use it, but will display a warning (unless the mismatch can be fixed by prepending the organism name, or the organism name plus chr, to the annotation seqname). But if the file has annotations for several sequences and some don't match the alignment files, you need to tell Gmaj which is which by adding an alias in the parameters file (see sample.gmaj).

Reusing Files

One of the advantages of using generic formats is that files can be reused in multiple panels without reformatting, e.g. as both exons and underlays. Normally linkbars, underlays, and text highlights are simply handled as arbitrary regions of a specified color, since they could represent any type of biological feature. However, you can ask Gmaj to interpret them as exons or repeats by adding a type hint in the parameters file (see sample.gmaj). Note that currently this will also cause any specified colors in that file to be overridden with Gmaj's defaults.

Combining several biological types of annotations (e.g. exons and repeats) in one file is possible, but not recommended. Gmaj will try to skip lines that are not appropriate for the type it is seeking, but it may draw more than you want.

Coding Sequence

Currently Gmaj has no special support for multiple transcripts. When inferring UTRs, all of the CDS-related items for a single gene name are combined, and the interval from the lowest coordinate to the highest is used as the CDS. Also, some of the formats' rules specify whether or not the initiation and stop codons should be included in the CDS, but Gmaj does not make adjustments to compensate for that; instead it simply includes all of the given endpoints in the CDS.

Colors

Colors can be specified for individual annotation lines via the itemRgb field (for BED) or a color attribute (for GFF v2 or GTF). However, for custom tracks, these are governed by the track line's itemRgb attribute, which defaults to off per the UCSC specification. Thus if you have track lines and want to use the per-item colors, you need to include itemRgb=On in the track attributes.

Track lines can also have a color attribute for the entire track, which will be used if itemRgb is off, or if an individual item does not have its own color. However in a rare break from the UCSC specification, Gmaj does not use black as the default if the track color is unspecified (black underlays and highlights just don't work with black plots and text). Instead it uses its own default colors, which for genes/exons are the same as the colors for default highlights, or light gray for other annotations. Note that these defaults will also override your colors when type hints are used.

All of the above-mentioned color values are specified in UCSC format, which consists of three comma-separated RGB values from 0-255 (e.g. 0,0,255).

Sorting

The order of the lines is not supposed to matter in these generic formats, but for most of the Gmaj panels it does matter: exons need to be grouped by gene and ordered by position so UTRs can be inferred and exon numbers assigned, early underlays are covered up by later ones, etc. Gmaj solves this problem by sorting the data before it is displayed. Exons are sorted first by gene name in ascending order, and then within each gene by start position (ascending) and lastly in case of a tie, by end position (descending). All other annotation types are sorted first by length in descending order, and then in case of a tie by start position (ascending). This usually produces a reasonable display, but if you need direct control of the order, you can use the PipMaker-style formats instead.


Cathy Riemer, June 2008