/*======================================================================================*/

Documentation for tffind release 1.1: Basic program flow and overview
See source code for more in-depth documentation
Matt Weirauch (weirauch@soe.ucsc.edu)
Nov 3, 2003


Tffind takes three formats of MultiPipMaker files as input.  The first format (heretofore
referred to as 'type m' format) is MultiPipMaker format, which looks like:


	3 77
	HumanD
	MouseD
	RatD
	TTAATAGATAAGTGAGATGAGAGTGATAAAGAGAGAA----GGGGGGA-AA------GACTTCCGGAAGTGACGTCA
	AAAAAAGATAAGTGAGGGAGAGATGATAA-AGAGAAAGAAAAG-TAGGATGATAACAC--TTCCGGAAGTGACGTCA
	AAAAAAGATAAGTGAGGGAGAGATTTTTTGAGAGAGAGAAAAG-TAGGATGATAACAC--TTCCGGAAGTGACGTCA


The second format (referred to as 'type b') is blockbuilder format, and looks like this:
As of now, this format has become obsolete and is no longer included in the documentation.
However, its functionality still remains in the program in the event that it is 
resurrected some day.


	#:bb verbose
	blockbuilder "(human baboon)(mouse rat)" hum.bab.bltz mus.rat.bltz hum.mus.bltz

	4 4
	human ">dna 240000,420000 human1" 180001
	baboon ">dna 60001,280000 baboon.old" 220000
	mouse ">mouse" 201052
	rat ">rat" 179567

	4 20
	0 + 1-20 T--ATGATAA--AGATAA--
	1 + 1-20 ATCCAGATAACGATGCGATG
	2 + 1-20 A--ATTATCT--ATTATCTG
	3 + 1-20 ACCCAGATAACGATGCGATG

	2 20
	0 + 21-40 A-A----GATAAGGAGAT-A
	1 + 21-40 AAAAAAAGATAAGGAATTAT

	2 10
	2 + 21-30 CTAGATAAAC
	3 + 21-30 CCAGATAACC

	3 50
	0 + 41-90 GATAA-AAAAAAA-GATA-----AAAAACGAGAATTTAGATAAAGTGATG
	1 + 41-90 CTGATAGAAAAAAAGATAGAAAATTATCTAAGAATTTAGATAAAGTGATG
	2 + 31-80 AGATAGAAAAAAAAGATAGAAAATTATCTAAGAATTTAGATAAAGTGATG

	1 10
	3 + 31-40 AAAAAAAAAA


The third format is standard FASTA format (type 'f'), which looks like:


	>sequence header
	ACGTAGGCTGCGCGCTCGATCGCTAGCATCAGCATCAGCAGCGGATCAGCTACGACTATG
	GCATCAGCGGCGATCAGCGATCTATCTAGCGATCAGCGACATCAGCGACTACAGCGATCC


Make sure that any b-type file used is 'verbose' (it actually shows the alignment.)
The program utilizes two major linked lists of structures, the MATCH and ITEM linked
    lists.
An ITEM constitutes either a motif (a transcription factor binding site, represented as
    a position-weight matrix, or PWM) or a spacer (in the format <2> or <2,5>, meaning
    "look for the next motif exactly 2 bps downstream" and "look for the next motif
    between 2 and 5 bps downstream," respectively.)  The ITEM linked list is iteratively
    built by the user in the process_main_menu_choices function, or via the command
    line, in the process_cmd_line_options function.
A MATCH constitutes a match to the pattern entered by the user.  Each MATCH contains
    a score, length (in columns), positions (in nucleotides), which seqs match, etc.
For information on blocks, segments, and other constructs used for 'b-type' files, see
    the documentation in bb_util.h and bb_util.c.


Basic program flow (function names in CAPS, with their main purpose listed after the '-'):

MAIN
    initialize vars, etc
    PROCESS_MAIN_MENU_CHOICES or PROCESS_CMD_LINE_OPTIONS
      	-iteratively build pattern and other settings as prompted by user

    for each strand
	SEARCH_FOR_PATTERN
	    -search for the pattern in the cur dir, on the proper seqs,in the proper range
	    RECORD_PATTERN_MATCHES
	        -record all matches to the entire pattern at the given seq and pos
		while (GET_NEXT_WINDOW -get next spacer window to look in for next motif)
		    RECORD_MATCHES_IN_WINDOW
			-look in the given window for all matches to the current motif
	        if we found at least one match to the entire pattern
		    for each match
			GET_ENDPOINTS
			    -given the column in the alignment, get the hit's start and
			     end positions in relation to the proper seq
			ADD_TO_MATCHES
			    -add the given match to the linked list of matches

	OUTPUT_MATCHES
	    -step through MATCH list, outputting all matches that meet given requirements
	    if it meets the requirements
		OUTPUT_MATCH
		    -output one match in the proper format


All important functions are listed alphabetically by name.  
Any function not listen here should be fairly self-explanatory from reading the code.
Each function has a basic description, as well as a list of any problem areas.


add_motif_to_list:

	Called whenever a new motif to be added to the pattern is entered by the user.  
	Fills the given information into a new ITEM structure and adds it to the ITEM 
	linked list.

add_to_matches:

	Updates info in proper MATCH structure if it is a duplicate match.  Otherwise,
	creates a new MATCH structure, fills in the relevant information, and calls 
	add_to_match_list to place it in the proper place in the linked list.

add_to_match_list:

	Called by add_to_matches.  Physically adds the new MATCH structure to the 
	proper place in the MATCH linked list.

adjust_from_and_to:

	Because of gaps in alignments, there is a difference between column in an 
	alignment and position in a sequence.  When a user enters a range to search in,
	they are specifying where to search by position.  This function adjusts this
	range to be relative to the column position in type 'm' files.  This is not
	neccessary for type 'b' files, as their positions are given at the beginning of
	each block.

adjust_vals / convert_to_log_odds:

	Adjust the values in the scoring matrix to be in log-odds form.  The basic formula
	is taken from (Hertz GZ, Hartzell GW, and Stormo GD Comput. Appl. Biosci. 6:81-92
	(1990).)

get_endpoints:

	Given the column in the alignment (or block of the alignment), returns the start
	and end positions of the match in relation to the relevant sequence.  By default,
	the relevant sequence is the sequence that the match occurs on.  When the 
	'report_to_top' option is turned on in the main menu by the user, the relevant
	sequence is the top-most sequence in the alignment (sequence 0).  Get_endpoints
	calls one of 4 functions, depending on the current settings:

		get_ends_b_rev:  'b-type' file, reverse strand
		get_ends_b_for:  'b-type' file, forward strand
		get_ends_m_rev:  'm-type' or 'f-type' file, reverse strand
		get_ends_m_for:  'm-type' or 'f-type' file, forward strand

	These functions proved to be among the more challenging parts of the program,
	as many issues must be considered, among them:

	    Converting from column in alignment (or block of the alignment) to position:

		This is trivial for 'f-type' files, as there are no gaps in these.
		Essentially, for 'm-type' files, this means doing a 'col2pos' lookup,
		which means decrementing start and end by 1 for every gap that precedes
		the beginning of the match in the relevant sequence.  For 'b-type' files,
		start and end are manually decremented for every gap that occurs in the
		block before the match in the relevant sequence.

	    Checking if start or end align to nothing in the top seq (report_to_top only):

        	A '-1' is reported for the start or end endpoint when the 'report_to_top' 
        	option is turned on and the match does not align to the top-most sequence
        	(note that this can only happen in 'b type' files, namely in blocks that
        	do not contain sequence 0.)  

	    Incrementing the start or end if either aligns to a gap (report_to_top only):

		This can best be shown by a simple example.  Consider searching for the
		GATA-1 binding site (consensus of AGATAA) in the following alignment:
	
			Pos in Seq0:	12.......3456789
			Sequence 0:	AT-------TATTATC
			Sequence 1: 	CGG--AGATAA--CGC

		In this case, the start of the match aligns to a gap in sequence 0.
		If the start position is not decremented, when the user tries to slice
		the alignment at the given positions (3 and 4), he will end up with:

			Sequence 0:	TA
			Sequence 1:	AA


		By decrementing the start position, the user will slice the alignment at
		positions 2 and 4, getting:
		
			Sequence 0:	T-------TA
			Sequence 1:	GG--AGATAA

		The same holds true for the end position as well.

	    Decrementing the end for every gap in the sequence:
	
		For every gap that occurs in the relevant sequence, the end position
		must be decremented.  Consider searching for GATA-1 <0,20> GATA-1 in the
		following sequence:

			Position:	123456..789012
			Sequence:	AGATAA--AGATAA

		As is readily apparent, for each gap that occurs inside the match,
		the end position must be decremented (in this case, there are two gaps,
		so the end position will be 14 - 2 = 12.)

	    Making sure that start and end don't go beyond their boundaries in top seq:

		After all of the preceding things have been taken into consideration,
		it is possible that the start or end has been incremented or decremented
		straight off of the alignment.  An example (assuming a GATA-1 search with
		the report_to_top option turned on):

			Pos in Seq0:	12345678....
			Sequence 0:	ATTATCGC----
			Sequence 1:	AT--AGATAA-C

		Without the boundary check, tffind would report the positions of the match
		as 5 to 9 after the end has been incremented due to the fact that it
		aligns to a gap.  However, sequence 0's positions only go up to 8!  To
		fix this problem, check_boundaries is called.  After it is called, the
		preceding hit would be listed as 5 to 8.
	
get_next_window:

	Gets the next window to look for the next motif in.  For example, the pattern
	{GATA-1}<300,500>{erg(c)}<200,600>{GATA-1(c)} has three windows.  The first
	window is implied as merely <0> (you start looking for the first motif at the
	current position in the alignment.)

main: 

	Checks command line parameters
	Determines type of alignment file
	Opens alignment file
	Reads in alignment seq (type m), seq (type f), or blocks (type b)
	Calls process_main_menu_choices or process_cmd_line_options
	Prints header
	For each strand
		Searches for pattern
		Outputs matches
	Prints trailer
	Performs clean up

motif_score:

	Given the current position and sequence and the current motif in the pattern, this
	function returns its score (obtained from the current item's scoring matrix.)
	NEG_INF (negative infinity) is returned if the motif runs off the edge of the
	alignment or sequence.

move_to_adjacent_block:

	When called, this function moves to the adjacent block for the given sequence in
	the alignment.  For the forward strand, this is the next block, whereas this is
	the previous block for the reverse strand.  This function also updates ncol
	(the number of columns in the block) and which_row (the row of the block that the
	current sequence, as determined by which_seq, is located in.)  NULL is returned
	if we have run off the end or beginning of the alignment.

output_match:

	Output one match in the proper format if the match does not occur outside the 
	search range or inside a filtered-out exonic region.  Returns 1 if the match
	meets these criteria, and 0 if it does not.  Two formats are supported: Tffind
	format (default) and GFF format (documented at the Sanger Institute website:
	http://www.sanger.ac.uk/Software/formats/GFF/).

output_matches:

	Step through the match list and output all matches with at least the minimum
	number of sequence matches.  For 'b' type files, move b_ptr to the proper block
	in the alignment for use by output_match.

process_cmd_line_options:
process_main_menu_choices:

	The main user interfaces for tffind, these functions iteratively build the
	user-entered ITEM linked list that constitutes the pattern of t.f.'s.  One
	function is called in batch mode, the other in interactive mode.
	Upon returning from this function, the pattern and all associated vars (such as
	the minimum number of matches required, the range to search in, etc.) are set.
	A 'dummy' spacer value of <0,0> is placed at the beginning of the linked list
	to simplify the search later.

record_match:

	Record position and percentage of one match.  This function makes sure each
	match is recorded in the proper row and column in the arrays (making new rows
	and columns as necessary.)  If a match is new (i.e. it is not merely a duplicate
	match from a different sequence at the same place.), which_match is incremented
	to keep it up to date.	Match_arr and bp_arr look like this:

		 	  		|-------which_motif-------|

				0	1	2	3	4	...
	  	_ 
		|	0
		|	1
		|	2
	   which_match	3	
		|	4
		|	.
		|	.
		-	.

	Which_motif is the number of the current motif in the pattern that we are
	searching for.  Which_match is the number of the current match that we are
	working on.

record_matches_in_window:

	Look in the given window for the current motif, recording all matches in
	match_arr and bp_arr.  This was another tricky section, as the initialization
	of cp and bp was different for several different cases involving b_ptr, cur_dir,
	and which_motif.  Other than that, this function is fairly straightforward.

	assign cp and bp
	for each potential motif in the window
	    get score
	    if (score > minimum required score)
		record match in match_arr and bp_arr

record_pattern_matches:

	Record all matches to the pattern at the given position in the given sequence.
	
	while we have found at least one match for the current motif and there are still
	      motifs left in the pattern
	    for each of the potential matches we have found so far
	    	assign offset (how far into the sequence or block we should begin looking
			       for the first motif in the pattern)
		record_matches_in_window()
	if we found at least one match
	    record the matches in the MATCH list 

remove_trailing_spacers:

	This function removes all trailing spacers from the user-built pattern string.
	Trailing spacers are all spacers that occur after the final motif.  For example,
	the pattern {GATA-1}<0,20>{GATA-1(c)}<40><0,200> has two trailing spacers.

same_match:

	Returns 1 if the two matches occur at the same place in the alignment on two
	different sequences.  Used by function add_to_matches to determine whether or
	not a new MATCH structure should be created for the current match.

search_for_pattern:

	Search for the pattern in the current dir, in the given range, on the proper
	sequences.  For type 'm' and 'f' alignment files, this function goes through the
	proper range of each sequence in the alignment to be searched in, calling 
	record_pattern_matches.  For type 'b' alignment files, it does this for each block
	in the alignment that has a at least as many rows in it as the minimum required
	matches.

too_far_in_match_list:

	Returns 1 if we have passed up the spot where the current match belongs in the 
	MATCH linked list.  The first criterion is to sort by the top-most sequence that
	the match occurs on.  This way, all matches to sequence 0 are at the top of the
	output, followed by sequence 1, etc.  The second criterion is to sort by start
	position.  This way, all matches are sorted increasingly by position within each 
	sequence.

valid_match:

	A match is valid if it has a hit in all of the motifs of the pattern.  The
 	match array is initialized to all -1's, so that if a hit is not found at any
	position, its value will remain a -1.  Accordingly, valid_match returns a 1 if
	there are no -1's in the given match array.

/*======================================================================================*/