Information About blEST


blest is a version of the sim4 program that is specially tailored for finding near-identity matches between a genomic sequence and a database of ESTs or other expressed sequences. It is often used in conjunction with two other programs:  mb (megablast), which speeds up execution by weeding out unsuitable database entries in advance using a much faster algorithm, and summarize, which organizes the matches into putative exons and introns for convenient human readability.

These three programs are supplied as compiled executables for Linux and Solaris (sorry, source code is not available at this time). They are provided "as is", with no warranty of any kind.

Linux 2.2 (glibc) / Intel x86:   blest  mb  summarize
Solaris 7 / Sparc Ultra:   blest  mb  summarize


Example:

Suppose you have a file (called, say, genseq) that contains a genomic sequence in FASTA format, and another file (e.g., estlib) containing a FASTA library of many expressed sequences, some of which correspond to the genomic sequence. (The TIGR Gene Indices make good libraries for this purpose, since most of the EST redundancy has been removed.)

  1. If estlib is large and you want faster execution, you can use mb to extract candidate sequences relatively quickly. To be on the safe side it will generally keep more entries than necessary, but still far fewer than the original file. However, you will have to modify your input files first, because mb expects the file of expressed sequences to be formatted as a compressed database by NCBI's formatdb program (which comes with Standalone BLAST). The command
         formatdb -i estlib -p F
    
    will produce files called estlib.nhr, estlib.nin, and estlib.nsq, which will be used by mb when you refer to "estlib". Also, the genomic sequence should have its interspersed repeats masked with 'N's to avoid spurious matches (you can use RepeatMasker to accomplish this, or genbank2repeats and mask-seq from our PipTools package). If the masked version is called genseq.masked, then the command
         mb -e -i genseq.masked -d estlib > mb.out
    
    will store the selected sequences from estlib in a new FASTA library file called mb.out.

  2. Run blest:
         blest genseq mb.out > blest.out
    
    or, if you skipped step 1, use estlib in place of mb.out.

  3. The blest.out file can be hard to read, especially if estlib had a lot of redundancy. To obtain a more human-friendly version of the output, run
         summarize blest.out > summarize.out
    
    The first part of summarize.out will look a lot like blest.out, but at the end you will find a nice summary of the putative exons and introns in sorted order, along with a list of apparent inconsistencies among these conclusions.

  4. Another possibility is to use the blest2exons program in the PipTools package to convert blest.out into exon annotations for PipMaker.

You can find out about additional command-line options for these programs by running them without any arguments. For further discussion about sim4 and blest, please see Florea et al. 1998.

These programs are copyright (C) 1998-2000 by Liliana Florea, Zheng Zhang, Scott Schwartz, and Webb Miller.



Cathy Riemer, May 2001