Pairwise Alignment Examples

TABLE OF CONTENTS


Introduction

This page illustrates the utility of our pairwise alignment tools by presenting several examples, most of which compare some region of the human genome with the syntenic region from a rodent genome. In each case, the two DNA sequences were aligned using a new variant of our sim program, and the resulting alignment was processed by another tool called pipper to produce a PIP ("percent identity plot") showing how well the two sequences matched within each aligned segment.

That is, the sim program inserts gaps into the sequences to make them align as closely as possible, thus breaking the alignment into a series of short segments. Then, pipper draws each segment as a horizontal bar whose horizontal position on the diagram corresponds to the segment's location in the human sequence, and whose vertical position indicates the percentage of nucleotides in the segment that were identical between the two sequences. However, only segments with an identity of 50% or more are actually plotted, so regions that match poorly appear blank on the diagram. Also, since the diagram is quite wide for long sequences, it usually doesn't fit on the page directly and must be broken into several lines.

For each example, links and/or references are provided for the original sequence data, relevant literature citations, and a PostScript file containing the PIP diagram of the pairwise alignment. If you don't have a PostScript viewer to display the diagram, you can either get one here, or if you have a PostScript printer, just send the file you receive from our server directly to your printer.


The alpha-globin genes, or HBA

Comparison of the human and rabbit HBA sequences yields a PIP diagram showing little sequence conservation outside of the coding regions.

The HBA gene encodes the alpha-globin polypeptide chain. The alpha-globin chain binds heme and forms a heterotetramer with heme-bound beta-globin polypeptides. This heterotetramer is the hemoglobin found in erythrocytes, which is responsible for transport of oxygen from the lungs to the peripheral tissues. Mutations in the alpha- and beta-globin genes cause the most common inherited diseases of man, including sickle cell disease and the thalassemias. (For more background information and references, see the entries in the OMIM and GDB databases.)

Examination of mammalian alpha-globin genes shows that even the absence of expected sequence matches can lead to productive, testable hypotheses. Despite their descent from a common ancestral gene and the requirement for coordinated, tissue-specific regulation, most mammalian alpha- and beta-globin genes are in very different genomic DNA contexts and are regulated in distinctly different ways. In particular, the alpha-globin gene clusters are highly G+C rich with several CpG islands and are in constitutively active chromatin (Craddock et al., 1995; Fischel-Ghodsian et al., 1987; Hardison et al., 1991), whereas the beta-globin gene clusters are more A+T rich, are devoid of CpG islands, and undergo chromatin domain opening only in erythroid tissues (Groudine et al., 1983; Margot et al., 1989; Stamatoyannopoulos et al., 1994). The flanking and internal sequences of the rabbit and human alpha-globin genes comprise prominent CpG rich islands that serve as strong, enhancer-independent promoters in a variety of transfected mammalian cells (Charnay et al., 1984; James-Pederson et al., 1995). Although the CpG islands are present in orthologous positions in the rabbit and human alpha-globin gene clusters, sequence alignments show the unexpected result that specific protein binding sites are conserved only in the 100 bp of proximal 5' flanking regions, not throughout the CpG islands (Hardison et al., 1991; Yost et al., 1993), as illustrated in the PIP diagram. This suggested that the effect of the CpG islands outside the proximal promoter was to provide a more permissive environment for promoter activity than does bulk A+T rich DNA, but that this effect is not dependent on binding of specific trans-activators at discrete locations. The postulated general effect of CpG islands is supported by three lines of evidence: (1) the level of gene expression increases with increasing size of the CpG island included in transfection constructs, (2) deletion of prominent binding sites for Sp1 and YY1 has no effect, and (3) addition of alpha-globin gene promoter fragments to a transcriptionally inactive CpG island gives a much higher level of expression after integration into the genome than does addition of these fragments to an A+T rich DNA fragment (Shewchuk and Hardison, 1997). This more permissive effect of CpG islands may be exerted at least in part at the level of chromatin structure, since CpG island DNA from the alpha-globin gene has a much lower affinity for nucleosome reconstitution in vitro than does the A+T rich DNA from the beta-globin gene (Shewchuk 1997).

References not found in Medline

Shewchuk, B.M. (1997) Effect of CpG islands from the alpha-globin gene cluster on gene expression: Evidence for a chromatin dependent activity. Ph.D. thesis, The Pennsylvania State University.

Shewchuk, B.M. and Hardison, R.C. (1997) CpG islands from the alpha-globin gene cluster increase gene expression in an integration-dependent manner. Mol. Cell. Biol. 17: 5856-5866.

Stamatoyannopoulos, G. and Nienhuis, A.W. (1994) Hemoglobin Switching. The Molecular Basis of Blood Diseases. G. Stamatoyannopoulos, A.W. Nienhuis, P.W. Majerus and H. Varmus, ed. (W.B. Saunders Co., Philadelphia) 107-155.


The beta-like globin gene cluster

Comparison of the human sequence HUMHBB with a mouse sequence assembled from three GenBank entries (MMBGCXD, MMCONREG, and MMMLCRHS4) yields a PIP diagram showing some matches in the 5' and 3' flanks of the beta gene, and matches extending for over 16 kb upstream of the epsilon gene, in an area known as the Locus Control Region, or LCR.

The existence of duplicated genes in this cluster complicates the analysis, but inspection of the PIP diagram suggests several hypotheses:

Before long, this page may discuss these points in more detail.

References

Hardison, R. and Miller, W. (1993) Use of long sequence alignments to study the evolution and regulation of mammalian globin gene clusters. Molecular Biology and Evolution 10, 73-102.


Bruton's tyrosine kinase, or BTK

Defects in the Bruton's tyrosine kinase (BTK) gene lead to X-linked agammaglobulinemia (Tsukada et al. 1993; Vetrie et al. 1993), a disorder characterized by a severe deficiency of circulating immunoglobulins and mature B cells (OMIM 300300). Examination of the B cells present in the bone marrow of XLA patients demonstrates increased levels of pro B cells but reduced levels of both pre B and mature B cells (Campana et al. 1990), suggesting that BTK function is crucial for the maturation of B cells. BTK expression further suggests its involvement in B cell development; expression is restricted to B cells and myeloid cells, but is not seen in T cells. In addition, BTK is expressed in the early stages of B cell differentiation before immunoglobulin heavy- or light- chain rearrangements. This expression is continued throughout B cell development but is down-regulated once the B cell matures into a plasma cell (de Weers et al. 1993; Smith et al. 1994).

Understanding elements regulating BTK expression is crucial to understanding its involvement in the complexities of B cell development. Both in vitro and in vivo experiments have demonstrated the contribution of binding sites for Spi-1/PU.1, SpiB, and Sp1 within the 280 bp 5' of BTK exon 1 to the hematopoietic cell lineage-specific expression (Sideras et al. 1994; Himmelmann et al. 1996; Muller et al. 1996). Under the hypothesis that elements important in regulating the expression of BTK are conserved between species, the human (GenBank U78027) and murine (GenBank U58105) genomic sequences in the region have been compared (Oeltjen et al. 1997).

Genomic sequencing of BTK has demonstrated both a gene rich and repeat dense region (Oeltjen et al. 1995). In addition to BTK, four genes previously mapped to the region (Vorechovsky et al. 1994) were localized: the single exon RNA-binding gene, FTP-3; the seven exon gene, alpha-D-galactosidase A (GLA), defects in which result in Fabry disease (a lysosomal storage disease); a five exon ribosomal gene, L44L; and a two exon gene of unknown function, FCI-12. As shown in the PIP diagram (PostScript) and in the Laj applet, the comparison of the mouse and human genomic sequences demonstrates not only conservation of the entire coding sequence, but also extensive conservation of the noncoding sequence. In comparing the conservation of the two ubiquitously expressed genes, L44L and GLA, to the more specifically expressed BTK gene, the noncoding sequence within the BTK locus appears to be more conserved (Oeltjen et al. 1997). While conservation within both the L44L and GLA loci is primarily restricted to the regions flanking the first exons, conservation within the BTK locus is clustered throughout. These clusters include the region flanking the first exon, at the 3' end of the first intron, within the fourth and fifth introns, between the eighth and tenth exons, and between the thirteenth through sixteenth exons.

Transient transfection experiments including the conserved sequence regions upstream and downstream of the first exon have demonstrated the contribution of both of these regions to the cell lineage-specific expression pattern of BTK (Oeltjen et al. 1997). These data suggest the hypothesis that other conserved regions within the locus are also important in the regulation of BTK.


The ERCC2 gene region

Comparison of the human and mouse sequences (Lamerdin et al. 96) yields a PIP diagram showing little sequence conservation outside of the coding regions.

ERCC2, also known as XPD, encodes an enzyme involved in excision repair of DNA. The names are acronymns for "excision repair cross-complementing rodent repair deficiency, complementation group 2" and "xeroderma pigmentosum D". When introduced into xeroderma pigmentosum cells of complementation group D, the ERCC2 gene corrects the sensitivity to UV radiation and defective nucleotide excision repair. The enzyme encoded by ERCC2 is a single-stranded DNA-dependent ATPase and a DNA helicase. Mutations in ERCC2 have been associated with at least 3 different disease phenotypes: xeroderma pigmentosum, Cockayne syndrome, and trichothiodystrophy. ERCC2 is homologous to RAD3 in yeast and is located on the boundary of 19q13.2-q13.3. For a more complete description and references, see the entries in OMIM and GDB.

The comparison of the ERCC2 locus shows matches in the 5' flank of exon 1 for only about 200 bp, indicative of a very small regulatory region. One plausible explanation is that ERCC2 may be expressed at about the same level in all cells, given the ubiquitous need for excision-repair of the DNA. Thus it may be under relatively simple control, manifested in this analysis as a limited number of cis-regulatory sequences. The adjacent, oppositely transcribed KLC gene shows a series of short matching segments for about 1000 bp 5' to the cap site. In these cases where the matching sequences are mostly restricted to exons, and especially when the pattern of expression differs in some respects between human and rodent, examination of the homologous locus in a species more closely related to humans, such as a prosimian primate, could be informative. For instance, regulatory elements that are conserved in primates but divergent in some other mammalian order should be readily detectable.


The cardiac myosin heavy chain genes

The beta- and alpha-myosin cardiac heavy chain genes are tandemly arrayed on chromosome 14q12, separated by 4.5 kb. Comparison of the human and hamster sequences (Jaenicke et al. 90, Liew et al. 90, Epp et al. 93, Wang et al. 94 and Wang et al. 95) yields a PIP diagram showing very high conservation of coding regions, together with somewhat less pronounced matches throughout the entire region.

The strongest matches in non-coding regions are in the first few hundred bp upstream of (non-coding) exon 1 of the beta gene and the first 100 bp upstream of (non-coding) exon 1 of the alpha gene; other potentially interesting matches are spread for more than 2000 bp further upstream of the alpha gene. Early experimental results based largely on transient transfection of cultured cells indicated that regulatory signals necessary for high level transcription of both genes, as well as responsiveness to thyroid hormone and contractile activity, are located in the first 400 bp upstream of the transcription start site. However, recent experiments using transgenic mice suggest that in fact, several thousand bp of upstream sequence is needed for proper regulation of these genes. For a review of these studies, see Robbins 96.

The tissue- and developmental stage-specificities of the alpha and beta cardiac myosin heavy chain genes differ between humans and rodents. In all mammals studied, alpha is the major atrial isoform. In small mammals, it is also the major isoform in ventricular myocytes from postpartum to adulthood (beta is predominately expressed in the embryonic and fetal ventricle), whereas in the human ventricle the beta isoform is expressed almost exclusively. (The alpha protein has a higher speed of contraction, and rodent hearts beat much faster than does a human heart.) This difference may limit what can be learned about gene regulation by comparing the human and rodent sequences.

[Note: The human genomic sequences for the beta and the alpha gene, resp., are in GenBank entries HSCBMYHC (25000 bp) and HSCAMHCA (31462 bp). The first 271 bp of the latter are identical to positions 24359-24629 of the former, but the final 25000 - 24626 = 364 nucleotides of the former do not match with the latter. To fuse these entries, we equated position 24359 of HSCBMYHC with position 1 of HSCAMHCA. This produces a sequence of 55,820 bp, which agrees with the figure cited by Epp et al. 93. Fusing the hamster entries was more straightforward because the final 60 nucleotides of HAMBMHC (33960 bp) are identical to the first 60 of HAMSHCA (32415 bp).]


The T-cell receptor alpha/delta constant region

Comparison of the human and mouse sequences in the C-delta to C-alpha region (Wilson et al. 92, Koop et al. 92, Koop et al. 94) yields a PIP diagram in which regions of high sequence conservation do not correspond to coding regions. Note that in all of our other examples, sequence conservation tends to drop off very sharply at coding-region boundaries.


The XRCC1 DNA repair gene

Comparison of the human and mouse sequences (Lamerdin et al. 95) yields a PIP diagram showing extensive matches extending for over 3000 bp in the 5' flank of the gene.

XRCC1 encodes an enzyme involved in repair of X-ray damage. Its name is an acronym for "X-ray-repair, complementing defective, in Chinese hamster, 1". The normal gene will correct defects in repair of DNA strand breaks and sister chromatid exchange in the repair-defective mutant Chinese hamster ovary cell line EM9 treated with ionizing radiation and alkylating agents. The encoded protein is apparently required for optimal activity of DNA ligase III. XRCC1 maps to human chromosome 19q13.2 . Three DNA repair genes are located on chromosome 19, with ERCC2 distal to XRCC1 and in the same region as ERCC1, but on different large restriction fragments generated by MluI. The human gene has 17 exons spanning about 31.9 kb. For further information and references, see the entries in OMIM and GDB.

Comparison of the XRCC1 genes shows very little match in the introns, but long, high-scoring alignments extend for 3000 bp 5' to exon 1, as well as some shorter matches at the 3' end of the gene. At the present time, no experimental tests of the role of this substantial 5' flanking region have been reported. We tested the hypothesis that some other gene could be located in this region by searching for matches in dbEST. Indeed, a very strong match to one cDNA sequence is found with the mouse 5' flanking region (positions 7732-8500, which is homologous to human positions 1001-1800). The presumptive exon sequence is identical to that of the cDNA, and we conclude that one of the reasons for the strong conservation is the presence of a coding region of a currently unknown gene. Presumably the homologous human cDNA has not yet been sequenced. Note that the XRCC1 gene is linked to the ERCC2 gene, whose PIP diagram is also available.


Mycoplasma

(Copyright 1997, Ross Hardison)

Comparison of the M. pneumoniae and M. genitalium sequences yields a PIP diagram showing matches over the full length of the shorter sequence. The PIP shows the aligning genes that were previously recognized as homologous between these two species, following the colored diagram from the Herrmann laboratory (see Herrmann, 1992). However, several genes in M. pneumoniae that are thought to be amplified from genes orthologous to those in M. genitalium are not shown as having matching sequences. These are represented as the paler colors in Herrmann's diagram. These matches may not have been displayed in the PIP because the percent match was below 50%, or because the matches were so short that they were not collected from the dot-plot for display in the PIP, or both. Some of these are paralogous duplications of either orf5 or orf6 of the P1 adhesin operon; the others are paralogous to a putative lipoprotein homologous to MG260 (i.e. orf260 in the M. genitalium genome). The P1 adhesin gene shows a diverse and segmented pattern of matches, i.e. many different segments match, separated by gaps, and the percent identity of each segment differs over a wide range, from 50% to 80%. Since the protein encoded by this gene is involved in attachment to the host cells, it is possible that some of the sequence differences are involved in determining host range. Also, it has been proposed that the duplicated copies of such cell-surface proteins have been altered to provide antigenic variation to allow escape from immune surveillance (Herrmann, 1992). If so, the positive selection on variation at these loci within each independent Mycoplasma lineage could contribute to the sequence divergence seen between species. In general, many of the paralogous gene assignments from the Herrmann diagram are not depicted on the PIP. Another example is the large set of "duplicate" genes between about 308500 and 322800.

Between 136060 and 139376 is a region almost devoid of genes in M. pneumoniae. A short ORF that has been assigned as a putative lipoprotein is located in this region, but no other hypothetical genes. It is unusual to have over 3 kb with such limited coding potential in this compact genome.

The most highly conserved region is the rRNA operon, rrn., between 83358 and 88155 in M. pneumoniae. It is separated from other strongly matching regions by several genes found in M. pneumoniae but not M. genitalium.

High-scoring sequence matches in noncoding regions are candidates for cis-regulatory elements. Many of the genes are adjacent or very close to each other, and many gene sets are expected to be organized into co-transcribed operons. However, in several instances the coding regions of genes transcribed in the same direction are about 100 bp apart or more, and matches in these "coding gaps" should be investigated for transcriptional regulatory elements. Two examples include the segments (1) between 110057 (beginning of R02_orf1386V, homologous to MG064) and 110294 (end of R02_orf300, encoding 1-phosphofructokinase) and (2) between 386304 (end of G12_orf166a, homologous to MG342) and 386397 (beginning of G12_orf1391o, encoding the beta subunit of RNA polymerase). One could then test whether these contain the promoters for R02_orf1386V and an rpoBC operon, respectively. Another example with longer matches is between 717293 and 717814, which could be the presumptive promoter for A65_orf251b, a homolog of MG116. Other good examples are centered around 728800 and 738000. Some "apparent" gaps in the coding regions that also show matches simply contain tRNA genes; an example is between 460000 and 461000, with seven tRNA genes and five high-scoring matches (presumably some with more than one tRNA gene).

One can also see matches in noncoding regions for divergently transcribed genes, and in this case one expects to find two promoters in opposite orientation. The PIP shows some short, high scoring matches in the noncoding region between 141633 and 141816, which are the beginnings of the divergently transcribed genes D09_orf657 and D09_orf384 (homolog of the glpD gene in E. coli). These encode, respectively, a putative lipoprotein homologous to MG040 and the aerobic glycerol-3-phosphate dehydrogenase. Two excellent examples of this and other noncoding matches are seen between 362000 and 380000. Divergent transcription begins between 363076 and 363194, transcribing a homolog of rpL10 leftward and a homolog of mucB (encoding UV protection protein) rightward. Two matches that appear to have greater than 90% identity are in this region (or close to it). In the second example, divergent transcription begins between 366364 and 366705, transcribing an unidentified ORF and homologs of ruvA and ruvB (needed for branch migration during recombination) leftward and a homolog of ackA (encoding acetate kinase) rightward. This larger intergenic region still matches in the two Mycoplasma genomes, almost throughout its length, although only at an average of about 70% identity. Regions between converging transcripts are also potential non-coding regulatory regions, and an example is seen in this same region. The segment between 371941 and 372465 is between the ends of rightward and leftward transcripts, and has several notable matches. The high-scoring match around 378900 is another tRNA gene.

This whole exercise of finding non-coding sequence matches could be done automatically in a more complete fashion, of course eliminating all genes, including those for tRNAs and rRNAs. Once all matches in noncoding regions had been collected, one could look for conserved, or frequently observed motifs. This could provide unique insight into mycoplasmal promoters. One might predict that they would look like Bacillus promoters, and this is of course testable.

Although the genes for many proteins show a largely continuous match or a few matching segments that cover most of the protein, some show a significantly different pattern, one of many very short matches. Examples of these patterns are seen in the region between 470000 and 473500, with H08_orf369 (homologous to competence locus E in Bacillus subtilis) showing a single continuous match, whereas H08_orf672, encoding a cytoadherence protein, shows a highly broken pattern of matches. The latter is seen for other genes encoding cytoadherence proteins (such as H08_orf1018 between 476000 and 479500). As mentioned above, there may be a selective advantage to accumulating more changes in some cell-surface proteins for adjusting to the particular types of host cells supporting this parasitic bacterium. Also, changes in cell-surface proteins can help the bacterium escape the immune system of the host.

(More background about Mycoplasma)

References

Herrmann, R. (1992) Genome Structure and Organization. in J. Maniloff, R.N. McElhaney, L.R. Finch and J.B. Baseman, ed., Mycoplasmas: Molecular Biology and Pathogenesis. (American Society for Microbiology, Washington, D.C.) 157-168.