/*======================================================================================*/ Documentation for tffind release 1.1: Basic program flow and overview See source code for more in-depth documentation Matt Weirauch (weirauch@soe.ucsc.edu) Nov 3, 2003 Tffind takes three formats of MultiPipMaker files as input. The first format (heretofore referred to as 'type m' format) is MultiPipMaker format, which looks like: 3 77 HumanD MouseD RatD TTAATAGATAAGTGAGATGAGAGTGATAAAGAGAGAA----GGGGGGA-AA------GACTTCCGGAAGTGACGTCA AAAAAAGATAAGTGAGGGAGAGATGATAA-AGAGAAAGAAAAG-TAGGATGATAACAC--TTCCGGAAGTGACGTCA AAAAAAGATAAGTGAGGGAGAGATTTTTTGAGAGAGAGAAAAG-TAGGATGATAACAC--TTCCGGAAGTGACGTCA The second format (referred to as 'type b') is blockbuilder format, and looks like this: As of now, this format has become obsolete and is no longer included in the documentation. However, its functionality still remains in the program in the event that it is resurrected some day. #:bb verbose blockbuilder "(human baboon)(mouse rat)" hum.bab.bltz mus.rat.bltz hum.mus.bltz 4 4 human ">dna 240000,420000 human1" 180001 baboon ">dna 60001,280000 baboon.old" 220000 mouse ">mouse" 201052 rat ">rat" 179567 4 20 0 + 1-20 T--ATGATAA--AGATAA-- 1 + 1-20 ATCCAGATAACGATGCGATG 2 + 1-20 A--ATTATCT--ATTATCTG 3 + 1-20 ACCCAGATAACGATGCGATG 2 20 0 + 21-40 A-A----GATAAGGAGAT-A 1 + 21-40 AAAAAAAGATAAGGAATTAT 2 10 2 + 21-30 CTAGATAAAC 3 + 21-30 CCAGATAACC 3 50 0 + 41-90 GATAA-AAAAAAA-GATA-----AAAAACGAGAATTTAGATAAAGTGATG 1 + 41-90 CTGATAGAAAAAAAGATAGAAAATTATCTAAGAATTTAGATAAAGTGATG 2 + 31-80 AGATAGAAAAAAAAGATAGAAAATTATCTAAGAATTTAGATAAAGTGATG 1 10 3 + 31-40 AAAAAAAAAA The third format is standard FASTA format (type 'f'), which looks like: >sequence header ACGTAGGCTGCGCGCTCGATCGCTAGCATCAGCATCAGCAGCGGATCAGCTACGACTATG GCATCAGCGGCGATCAGCGATCTATCTAGCGATCAGCGACATCAGCGACTACAGCGATCC Make sure that any b-type file used is 'verbose' (it actually shows the alignment.) The program utilizes two major linked lists of structures, the MATCH and ITEM linked lists. An ITEM constitutes either a motif (a transcription factor binding site, represented as a position-weight matrix, or PWM) or a spacer (in the format <2> or <2,5>, meaning "look for the next motif exactly 2 bps downstream" and "look for the next motif between 2 and 5 bps downstream," respectively.) The ITEM linked list is iteratively built by the user in the process_main_menu_choices function, or via the command line, in the process_cmd_line_options function. A MATCH constitutes a match to the pattern entered by the user. Each MATCH contains a score, length (in columns), positions (in nucleotides), which seqs match, etc. For information on blocks, segments, and other constructs used for 'b-type' files, see the documentation in bb_util.h and bb_util.c. Basic program flow (function names in CAPS, with their main purpose listed after the '-'): MAIN initialize vars, etc PROCESS_MAIN_MENU_CHOICES or PROCESS_CMD_LINE_OPTIONS -iteratively build pattern and other settings as prompted by user for each strand SEARCH_FOR_PATTERN -search for the pattern in the cur dir, on the proper seqs,in the proper range RECORD_PATTERN_MATCHES -record all matches to the entire pattern at the given seq and pos while (GET_NEXT_WINDOW -get next spacer window to look in for next motif) RECORD_MATCHES_IN_WINDOW -look in the given window for all matches to the current motif if we found at least one match to the entire pattern for each match GET_ENDPOINTS -given the column in the alignment, get the hit's start and end positions in relation to the proper seq ADD_TO_MATCHES -add the given match to the linked list of matches OUTPUT_MATCHES -step through MATCH list, outputting all matches that meet given requirements if it meets the requirements OUTPUT_MATCH -output one match in the proper format All important functions are listed alphabetically by name. Any function not listen here should be fairly self-explanatory from reading the code. Each function has a basic description, as well as a list of any problem areas. add_motif_to_list: Called whenever a new motif to be added to the pattern is entered by the user. Fills the given information into a new ITEM structure and adds it to the ITEM linked list. add_to_matches: Updates info in proper MATCH structure if it is a duplicate match. Otherwise, creates a new MATCH structure, fills in the relevant information, and calls add_to_match_list to place it in the proper place in the linked list. add_to_match_list: Called by add_to_matches. Physically adds the new MATCH structure to the proper place in the MATCH linked list. adjust_from_and_to: Because of gaps in alignments, there is a difference between column in an alignment and position in a sequence. When a user enters a range to search in, they are specifying where to search by position. This function adjusts this range to be relative to the column position in type 'm' files. This is not neccessary for type 'b' files, as their positions are given at the beginning of each block. adjust_vals / convert_to_log_odds: Adjust the values in the scoring matrix to be in log-odds form. The basic formula is taken from (Hertz GZ, Hartzell GW, and Stormo GD Comput. Appl. Biosci. 6:81-92 (1990).) get_endpoints: Given the column in the alignment (or block of the alignment), returns the start and end positions of the match in relation to the relevant sequence. By default, the relevant sequence is the sequence that the match occurs on. When the 'report_to_top' option is turned on in the main menu by the user, the relevant sequence is the top-most sequence in the alignment (sequence 0). Get_endpoints calls one of 4 functions, depending on the current settings: get_ends_b_rev: 'b-type' file, reverse strand get_ends_b_for: 'b-type' file, forward strand get_ends_m_rev: 'm-type' or 'f-type' file, reverse strand get_ends_m_for: 'm-type' or 'f-type' file, forward strand These functions proved to be among the more challenging parts of the program, as many issues must be considered, among them: Converting from column in alignment (or block of the alignment) to position: This is trivial for 'f-type' files, as there are no gaps in these. Essentially, for 'm-type' files, this means doing a 'col2pos' lookup, which means decrementing start and end by 1 for every gap that precedes the beginning of the match in the relevant sequence. For 'b-type' files, start and end are manually decremented for every gap that occurs in the block before the match in the relevant sequence. Checking if start or end align to nothing in the top seq (report_to_top only): A '-1' is reported for the start or end endpoint when the 'report_to_top' option is turned on and the match does not align to the top-most sequence (note that this can only happen in 'b type' files, namely in blocks that do not contain sequence 0.) Incrementing the start or end if either aligns to a gap (report_to_top only): This can best be shown by a simple example. Consider searching for the GATA-1 binding site (consensus of AGATAA) in the following alignment: Pos in Seq0: 12.......3456789 Sequence 0: AT-------TATTATC Sequence 1: CGG--AGATAA--CGC In this case, the start of the match aligns to a gap in sequence 0. If the start position is not decremented, when the user tries to slice the alignment at the given positions (3 and 4), he will end up with: Sequence 0: TA Sequence 1: AA By decrementing the start position, the user will slice the alignment at positions 2 and 4, getting: Sequence 0: T-------TA Sequence 1: GG--AGATAA The same holds true for the end position as well. Decrementing the end for every gap in the sequence: For every gap that occurs in the relevant sequence, the end position must be decremented. Consider searching for GATA-1 <0,20> GATA-1 in the following sequence: Position: 123456..789012 Sequence: AGATAA--AGATAA As is readily apparent, for each gap that occurs inside the match, the end position must be decremented (in this case, there are two gaps, so the end position will be 14 - 2 = 12.) Making sure that start and end don't go beyond their boundaries in top seq: After all of the preceding things have been taken into consideration, it is possible that the start or end has been incremented or decremented straight off of the alignment. An example (assuming a GATA-1 search with the report_to_top option turned on): Pos in Seq0: 12345678.... Sequence 0: ATTATCGC---- Sequence 1: AT--AGATAA-C Without the boundary check, tffind would report the positions of the match as 5 to 9 after the end has been incremented due to the fact that it aligns to a gap. However, sequence 0's positions only go up to 8! To fix this problem, check_boundaries is called. After it is called, the preceding hit would be listed as 5 to 8. get_next_window: Gets the next window to look for the next motif in. For example, the pattern {GATA-1}<300,500>{erg(c)}<200,600>{GATA-1(c)} has three windows. The first window is implied as merely <0> (you start looking for the first motif at the current position in the alignment.) main: Checks command line parameters Determines type of alignment file Opens alignment file Reads in alignment seq (type m), seq (type f), or blocks (type b) Calls process_main_menu_choices or process_cmd_line_options Prints header For each strand Searches for pattern Outputs matches Prints trailer Performs clean up motif_score: Given the current position and sequence and the current motif in the pattern, this function returns its score (obtained from the current item's scoring matrix.) NEG_INF (negative infinity) is returned if the motif runs off the edge of the alignment or sequence. move_to_adjacent_block: When called, this function moves to the adjacent block for the given sequence in the alignment. For the forward strand, this is the next block, whereas this is the previous block for the reverse strand. This function also updates ncol (the number of columns in the block) and which_row (the row of the block that the current sequence, as determined by which_seq, is located in.) NULL is returned if we have run off the end or beginning of the alignment. output_match: Output one match in the proper format if the match does not occur outside the search range or inside a filtered-out exonic region. Returns 1 if the match meets these criteria, and 0 if it does not. Two formats are supported: Tffind format (default) and GFF format (documented at the Sanger Institute website: http://www.sanger.ac.uk/Software/formats/GFF/). output_matches: Step through the match list and output all matches with at least the minimum number of sequence matches. For 'b' type files, move b_ptr to the proper block in the alignment for use by output_match. process_cmd_line_options: process_main_menu_choices: The main user interfaces for tffind, these functions iteratively build the user-entered ITEM linked list that constitutes the pattern of t.f.'s. One function is called in batch mode, the other in interactive mode. Upon returning from this function, the pattern and all associated vars (such as the minimum number of matches required, the range to search in, etc.) are set. A 'dummy' spacer value of <0,0> is placed at the beginning of the linked list to simplify the search later. record_match: Record position and percentage of one match. This function makes sure each match is recorded in the proper row and column in the arrays (making new rows and columns as necessary.) If a match is new (i.e. it is not merely a duplicate match from a different sequence at the same place.), which_match is incremented to keep it up to date. Match_arr and bp_arr look like this: |-------which_motif-------| 0 1 2 3 4 ... _ | 0 | 1 | 2 which_match 3 | 4 | . | . - . Which_motif is the number of the current motif in the pattern that we are searching for. Which_match is the number of the current match that we are working on. record_matches_in_window: Look in the given window for the current motif, recording all matches in match_arr and bp_arr. This was another tricky section, as the initialization of cp and bp was different for several different cases involving b_ptr, cur_dir, and which_motif. Other than that, this function is fairly straightforward. assign cp and bp for each potential motif in the window get score if (score > minimum required score) record match in match_arr and bp_arr record_pattern_matches: Record all matches to the pattern at the given position in the given sequence. while we have found at least one match for the current motif and there are still motifs left in the pattern for each of the potential matches we have found so far assign offset (how far into the sequence or block we should begin looking for the first motif in the pattern) record_matches_in_window() if we found at least one match record the matches in the MATCH list remove_trailing_spacers: This function removes all trailing spacers from the user-built pattern string. Trailing spacers are all spacers that occur after the final motif. For example, the pattern {GATA-1}<0,20>{GATA-1(c)}<40><0,200> has two trailing spacers. same_match: Returns 1 if the two matches occur at the same place in the alignment on two different sequences. Used by function add_to_matches to determine whether or not a new MATCH structure should be created for the current match. search_for_pattern: Search for the pattern in the current dir, in the given range, on the proper sequences. For type 'm' and 'f' alignment files, this function goes through the proper range of each sequence in the alignment to be searched in, calling record_pattern_matches. For type 'b' alignment files, it does this for each block in the alignment that has a at least as many rows in it as the minimum required matches. too_far_in_match_list: Returns 1 if we have passed up the spot where the current match belongs in the MATCH linked list. The first criterion is to sort by the top-most sequence that the match occurs on. This way, all matches to sequence 0 are at the top of the output, followed by sequence 1, etc. The second criterion is to sort by start position. This way, all matches are sorted increasingly by position within each sequence. valid_match: A match is valid if it has a hit in all of the motifs of the pattern. The match array is initialized to all -1's, so that if a hit is not found at any position, its value will remain a -1. Accordingly, valid_match returns a 1 if there are no -1's in the given match array. /*======================================================================================*/