Sequence analysis and gene content
Analysis of the genomic sequence of the model organisms has made extensive use of predictive computational analysis to identify genes4,5,6. In human DNA, identification of genes by these methods is more difficult because of extensive splicing, lower density of exons and the high proportion of interspersed repetitive sequences. The accuracy of ab initio gene prediction on vertebrate genomic sequence has been difficult to determine because of the lack of sequence that has been completely annotated by experiment. To determine the degree of overprediction made by such algorithms, all genes within a region need to be experimentally identified and annotated, however it is virtually impossible to know when this job is complete. A 1.4-Mb region of human genomic sequence around the BRCA2 locus has been subjected to extensive experimental investigation, and it is believed that the 170 exons identified is close to the total number expressed in the region.
The most recent calibration of ab initio methods against this region (R.B.S.K. and T.H., manuscript in preparation) shows that with the best methods7,8 more than 30% of exon predictions do not overlap any experimental exons, in other words, they are overpredictions. Furthermore, having now applied this analysis to larger amounts of data (more than 15 Mb from the Sanger Annotated Genome Sequence Repository which can be obtained as part of the Genesafe collection (http://www.hgmp.mrc.ac.uk/Genesafe/)), it is confirmed that prediction accuracy also varies considerably between different regions of sequence. It was hoped that these calibration efforts would lead to rules for reliable gene prediction based on ab initio methods alone, perhaps on the basis of combining several different methods, GC content and so on. However, so far this has not been possible. The same analysis also shows that although 95% of genes are at least partially predicted by ab initio methods, few gene structures are completely correct (none in BRCA2) and more than 20% of experimental exons are not predicted at all. The comparison of ab initio predictions and the annotated gene structures (see below) in the chromosome 22 sequence is consistent with this, with 94% of annotated genes at least partially detected by a Genscan gene prediction, but only 20% of annotated genes having all exons predicted exactly. Sixteen per cent of all the exons in annotated genes were not predicted at all, although this is only 10% for internal exons (that is, not 5' and 3' ends). As a result, we do not consider that ab initio gene prediction software can currently be used directly to reliably annotate genes in human sequence, although it is useful when combined with other evidence (see below), for example, to define splice-site boundaries, and as a starting point for experimental studies.
Fortunately, a vast resource of experimental data on human genes in the form of complementary DNA and protein sequences and expressed sequence tags (ESTs) is available which can be used to identify genes within genomic DNA. Furthermore about 60% of human genes have distinctive CpG island sequences at their 5' ends9 which can also be used to identify potential genes. Thus, the approach we have taken to annotating genes in the chromosome 22 sequence relies on a combination of similarity searches against all available DNA and protein databases, as well as a series of ab initio predictions. Upon completion of the sequence of each clone in the tile path, the sequence was subjected to extensive computational analysis using a suite of similarity searches and prediction tools. Briefly, the sequences were analysed for repetitive sequence content, and the repeats were masked using RepeatMasker (http://ftp.genome.washington.edu/RM/RepeatMasker.html). Masked sequence was compared to public domain DNA and protein databases by similarity searches using the blast family of programs10. Unmasked sequence was analysed for
Gene features were identified by a combination of human inspection and software procedures. Figure 1 (PDF 818k) shows the 679 gene sequences annotated across 22q. They were grouped according to the evidence that was used to identify them as follows: genes identical to known human gene or protein sequences, referred to as 'known genes' (247); genes homologous, or containing a region of similarity, to gene or protein sequences from human or other species, referred to as 'related genes' (150); sequences homologous to only ESTs, referred to as 'predicted genes' (148); and sequences homologous to a known gene or protein, but with a disrupted open reading frame, referred to 'pseudogenes' (134). (See Supplementary Information, Table 1, for details of these genes.) The ab initio gene prediction program, Genscan, predicted 817 genes (6,684 exons) in the contiguous sequence, of which 325 do not form part of the annotated genes categorized above. Given the calibration of ab initio prediction methods discussed above, we estimate that of the order of 100 of these will represent parts of 'real' genes for which there is currently no supporting evidence in any sequence database, and that the remainder are likely to be false positives.
The total length of the sequence occupied by the annotated genes, including their introns, is 13.0 Mb (39% of the total sequence). Of this, only 204 kb contain pseudogenes. About 3% of the total sequence is occupied by the exons of these annotated genes. This contrasts sharply with the 41.9% of the sequence that represents tandem and interspersed repeat sequences. There is no significant bias towards genes encoded on one strand at the 5% level (
A striking feature of the genes detected is their variety in terms of both identity and structure. There are several gene families that appear to have arisen by tandem duplication. The immunoglobulin locus is a well-known example, but there also are other immunoglobulin-related genes on the chromosome outside the immunoglobulin region. These include the three genes of the immunoglobulin -like (IGLL) family plus a fourth possible member of the family (AC007050.7). There are five clustered immunoglobulin variable region pseudogenes in AC006548, and an immunoglobulin variable-related sequence (VpreB3) in AP000348. Much further away from the genes is a variable region pseudogene, 123 kb telomeric of IGLL3 in sequence AL008721 (coordinates 9,420-9,530 kb from the centromeric end of the sequence), and a cluster of two constant region pseudogenes and a variable region pseudogene in sequences AL008723/AL021937 (coordinates 16,060-16,390 kb from the centromeric end).
Human chromosome 22 also contains other duplicated gene families that encode glutathione S-transferases, Ret-finger-like proteins3, phorbolins or APOBECs, apolipoproteins and -crystallins. In addition, there are families of genes that are interspersed among other genes and distributed over large chromosomal regions. The -glutamyl transferase genes represent a family that appears to have been duplicated in tandem along with other gene families, for instance the BCR-like genes, that span the 22q11 region and together form the well-known LCR22 (low-copy repeat 22) repeats (see below).
The size of individual genes encoded on this chromosome varies over a wide range. The analysis is incomplete as not all 5' ends have been defined. However, the smallest complete genes are only of the order of 1 kb in length (for example, HMG1L10 is 1.13 kb), whereas the largest single gene (LARGE2) stretches over 583 kb. The mean genomic size of the genes is 19.2 kb (median 3.7 kb). Some complete gene structures appear to contain only single exons, whereas the largest number of exons in a gene (PIK4CA) is 54. The mean exon number is 5.4 (median 3). The mean exon size is 266 bp (median 135 bp). The smallest complete exon we have identified is 8 bp in the PITPNB gene. The largest single exon is 7.6 kb in the PKDREJ, which is an intron-less gene with a 6.7-kb open reading frame. In addition, two genes occur within the introns of other expressed genes. The 61-kb TIMP3 gene, which is involved in Sorsby fundus macular degeneration, lies within a 268-kb intron of the large SYN3 gene, and the 8.5 kb HCF2 gene lies within a 27.5-b intron of the PIK4CA gene. In each case, the genes within genes are oriented in the opposite transcriptional orientation to the outer gene. We also observe pseudogenes frequently lying within the introns of other functional genes.
Peptide sequences for the 482 annotated full-length and partial genes with an open reading frame of greater than or equal to 50 amino acids were analysed against the protein family (PFAM)11, Prosite12 and SWISS-PROT13 databases. These data were processed and displayed in an implementation of ACEDB. Overall, 240 (50%) predicted proteins had matching domains in the PFAM database encompassing a total of 164 different PFAM domains. Of the residues making up these 482 proteins, 25% were part of a PFAM domain. This compares with PFAM's residue coverage of SWISS-PROT/TrEMBL, which is more than 45% and indicates that the human genome is enriched in new protein sequences. Sixty-two PFAM domains were found to match more than one protein, including ten predicted proteins containing the eukaryotic protein kinase domain (PF00069), nine matching the Src homology domain 3 (PF00018) and eight matching the RhoGAP domain (PF00620). Fourteen predicted proteins contain zinc-finger domains (See Supplementary Information, Table 2, for details of the PFAM domains identified in the predicted proteins).
Nineteen per cent of the coding sequences identified were designated as pseudogenes because they had significant similarity to known genes or proteins but had disrupted protein coding reading frames. Because 82% of the pseudogenes contained single blocks of homology and lacked the characteristic intron-exon structure of the putative parent gene, they probably are processed pseudogenes. Of the remaining spliced pseudogenes, most represent segments of duplicated gene families such as the immunoglobulin variable genes, the -crystallins, CYP2D7 and CYP2D8, and the GGT and BCR genes. The pseudogenes are distributed over the entire sequence, interspersed with and sometimes occurring within the introns of annotated expressed genes. However, there also is a dense cluster of 26 pseudogenes in the 1.5-Mb region immediately adjacent to the centromere; the significance of this cluster is currently unclear.
Given that the sequence of 33.4 Mb of chromosome 22q represents 1.1% of the genome and encodes 679 genes, then, if the distribution of genes on the other chromosomes is similar, the minimum number of genes in the entire human genome would be at least 61,000. Previous work has suggested that chromosome 22 is gene rich1 by a factor of 1.38 (http://www.ncbi.nlm.nih.gov/genemap/page.cgi?F=GeneDistrib.html), which would reduce this estimate to 45,000 genes. It is important, however, to recognize that the analysis described here only provides a minimum estimate for the gene content of chromosome 22q, and that further studies will probably reveal additional coding sequences that could not be identified with the current approaches.
Two lines of evidence point to the existence of additional genes that are not detected in this analysis. First, the 553 predicted CpG islands, which typically lie at the true 5' ends of about 60% of human genes9, are in excess of 60% of the number of genes identified (
References