Major concepts. (See The origin of interspersed repeats in the human genome, by Smit AF.) Almost all mammalian interspersed repeats fall into three categories:
LINE and SINE repeats. A LINE (long interspersed nuclear element) encodes a reverse transcriptase (RT) and perhaps other proteins. Mammalian genomes contain an old LINE family, called LINE2, which apparently stopped transposing before the mammalian radiation, and a younger family, called L1 or LINE1, many of which were inserted after the mammalian radiation (and are still being inserted). A SINE (short interspersed nuclear element) generally moves using RT from a LINE. Examples include the MIR elements, which co-evolved with the LINE2 elements. Since the mammalian radiation, each lineage has evolved its own SINE family. Primates have Alu elements and mice have B1, B2, etc. The process of insertion of a LINE or SINE into the genome causes a short sequence (7-21 bp for Alus) to be repeated, with one copy (in the same orientation) at each end of the inserted sequence. Alus have accumulated preferentially in GC-rich regions, L1s in GC-poor regions.
LTR retroposons These elements are characterized by a region of several hundred bp, called the long terminal repeat, that appears at each end. Some autonomous elements are cousins of retroviruses (e.g., HIV) but are unable to survive outside of the cell, and are called endogenous retroviruses. None are known to be currently active in humans, though some are still mobile in mice. The so-called MaLR (mammalian LTR) elements, which arose before the mammalian radiation, seem to be non-autonomous repeats that move via proteins from endogenous retroviruses.
DNA transposons. Full-length autonomous elements encode a protein, called transposase, by which an element can be removed from one position and inserted at another. Transposons typically have short inverted repeats at each end.
More on L1s. Full-length L1s are about 6-7 kb long. They have a 5' UTR, two ORFs (ORF2 encodes an RT), a 3' UTR and an A-rich tail. L1s, and perhaps the non-autonomous Alu and B1 elements and processed pseudogenes that use the L1's RT, preferentially insert at TT|AAAA (the vertical bar denotes the point of insertion.) Insertion of a transcribed element starts at the elements 3' end, and often results in only a portion being copied.
Many people believe that only a few "master" or "source" L1s are active at any time in evolution. The 3' UTRs of these master copies change extremely rapidly, far more quickly than the drift of neutral genomic sequence. (This may be driven by the competition with SINEs for the L1's RT. That is, changing the patterns used by RT to distinguish L1 transcripts keeps the SINEs from hogging the RT.) The 3' UTR varies in length from 500 bp to over 2000 bp, and the length of the 5' UTR is even more diverse. The 5' or 3' part of an ancient L1 is completely different from that of a recently inserted L1. On the other hand, the 3-kb ORF2 is extremely well conserved.
The rapid turn-over in 3' UTR sequence makes it easy to predict the insertion date of any particular L1 element. Another indicator of insertion age is the amount of nucleotide divergence from the consensus sequence. The following diagram indicates the age of major subclasses of L1 elements. It is adapted from Ancestral, mammalian-wide subfamilies of LINE-1 repetitive sequences, by Smit AF, Toth G, Riggs AD, Jurka J., J Mol Biol 1995 Feb 24;246(3):401-417. Approximate times of divergence of humans from (A) new-world monkeys, (B) prosimian primates (e.g., lemur) and (C) other mammals are indicated. For instance, the L1MA6 family was active around the time of the mammalian radiation.
Percent nucleotide substitutions: A B C 0 5 10 15 20 25 30 35 | | | | | | | | ---------------------------------------------------------------------------- PA2 PA6 PA12 PA16 MB1 ME1 ME2 ME3 PA4 PA9 PB3 MB8 MA6 MC1 MD1
The fact that L1s have a very well conserved ORF2 and highly variable UTRs has obvious repercussions for programs that attempt to identify repeat elements. RepeatMasker performs comparisons with three consensus sequences for ORF2 regions, 25 sequences for 5' UTRs and 50 sequences for 3' UTRs, then post-processes the results so as to predict entire L1s. Similarly, LINE2 elements are represented in RepeatMasker's database by one ORF2 consensus and a collection of 3' UTRs.
More on Alu elements. The Alu family descended from 7SL RNA (a component of the signal recognition particle), as did the mouse B1 repeats, and because of this, an Alu and a B1 will be recognizably similar in a sequence comparison. The Alu family arose from the tandem duplication of older 130-bp repeats, called FLAM and FRAM (free left/right Alu monomer), with an insertion of 31 bp of unrelated sequence in the "right half", making an Alu more than twice the length of a B1 (about 300 bp vs. 130 bp). Dimeric Alus divide neatly into the AluJ and AluS families, based on differences in at least 14 positions. AluJ, AluS and Alu-type monomers account for about 29%, 68% and 3% of Alu-type repeats in the human genome. The AluJ family includes the oldest Alu dimers, with the AluS family arising more recently. A subfamily of the AluS elements that were inserted very recently (e.g., after the human-chimp split) are now called AluY (Y = young).
Because all Alu elements are relatively young, they are very similar to the consensus sequences in RepeatMasker's database. When RepeatMasker finds one (or certain other young repeats), it deletes the copy and continues searching. In cases where the Alu inserted into an older repeat, the parts on either side can be recognized as belonging to the same element. Moreover, for short or highly diverged repeats, the joined portions may be detectable when the separated pieces are not.
MIRs (Mammalian-wide Interspersed Repeats). For some unknown region, MIR relics frequently mutate more slowly than neutral DNA, particularly in their middle 70 bp. (See: MIRs are classic, tRNA-derived SINEs that amplified before the mammalian radiation, Smit AF, Riggs AD. Nucleic Acids Res 1995 Jan 11;23(1):98-102.) In contrast, copies of other types of repeats accumulate mutations at the neutral rate.