Genome Organisation
Synopsis: C-value paradox, different classes of DNA, repetitive DNA and disease. If protein-coding portions of the human genome make up only 1.5% what is the rest doing? Transcription of rRNA and tRNA by RNA pol I and III. Other non-coding RNAs, snoRNAs, miRNAs, natural antisense RNAs and their processing and function.
Definitions:
Genome: the total amount of genetic material, stored as DNA. The nuclear genome refers to the DNA in the chromosomes contained in the nucleus; in the case of humans the DNA in the 46 chromosomes. It is the nuclear genome that defines a multicellular organism; it will be the same for all (almost) cells of the organism. You can have organelle genomes as well such as the mitochondrial genome. When you want to identify or distinguish one organism from another, such as in forensic testing, you investigate the genome.
Transcriptome: The total amount of genetic information which has been transcribed by the cell. This information will be stored as RNA. The transcriptome is unique to a cell type and is a measure of the gene expression. Different cells within an organism will have different transcriptomes. Cell types can be identified by their transcriptome.
Proteome: The cell’s complete protein output. This reflects all the mRNA sequences translated by the cell. Cell types have different proteomes and these can be used to identify a particular cell.
Genome Investigations: what can we learn?
There are a number of investigations that can be carried out on the genomes of various organisms; this is known as comparative genomics and is an emerging field of study in its own right. Having access to the sequence of a number of genomes has opened this area up enormously. Even before complete sequences were available, a number of parameters could be measured in various genomes and some information about genome organisation in multicellular organisms could be gleaned. Some of the measurements which have been made are:
The base composition of a vast array of organisms; both prokaryotic and eukaryotic
The repetitive and unique sequence content
The number of genes and their distribution throughout the genome
The genome organisation: base composition
The base composition of DNA in an organism is a fixed value and it is expressed as the % of (G + C) of the total genome. The variation of this value between different prokaryotes is large; this is surprising given that many of the individual proteins produced by the species have similar amino acid sequences. Prokaryotic genomic DNA can have as little as 25 % (G + C) in Mycoplasma genitalium to as high as ~72 % in Micrococcus lysodeikticus. Eukaryotic genomic DNA does not display the same variation between species. The % (G + C) composition of most plant and animal species falls within a narrow range, averaging at 39% with a variation of only ± 6%.
Base composition within the genome.
In prokaryotes the bases are distributed evenly throughout the genome with a slightly lower (G + C) content in promoter and intergenic regions; these often have A + T rich segments which melt more readily than G+C rich regions. The relatively constant base distribution within a given bacterial genome suggests that although there may be unequal nucleotide pool sizes inside the cell, the system will have evolved over many generations to be like this, and the rate of DNA replication is constant.
The distribution of the (G + C) content throughout each genome in eukaryotes, however, varies significantly, unlike prokaryotes. Whereas the mean variation in % (G + C) content throughout the E. coli genome is only 8.6 % in eukaryotes, this variation is over 30 %. Certain regions of eukaryotic genomic DNA are found to be (A + T) rich, with a % (G + C) content as low as 18 %, while other regions have a (G + C) content as high as 70 %.
The genome organisation: repetitive and unique sequences
Recapping….The C-value paradox
The C-value is the total number of DNA bases in the genome (per haploid set of chromosomes). When you compare this to the complexity of the organism you find a massive disparity. Some organisms seem to have far too much DNA for their complexity e.g. the carp has 52 chromosomes while the alligator that eats it has 16! Some flowers have far more genetic material than humans!! Clearly the amount of DNA is not proportional to that required to produce all the proteins made by the organism or to their position on the food chain.
Table 1 Examples of organisms, their genome size, number of genes, and % of genome single copy.
Organism | Size of genome (kbp) | Estimated # genes | % single copy |
Viruses Simian Virus 40 (SV40) Bacteriophage fX174 Bacteriophage l | 5.1 5.4 48.5 | 100 100 100 | |
Bacteria M. genitalium E. coli | 580 4,639 | 470 4,405 | >90 92 |
Yeast S. cerevisae | 12,100 | 6,200 | 90 |
Round Worm C. elegans | 97,000 | 19,000 | |
Fruit Fly D. melanogaster | 180,000 | 13,600 | 60 |
Mammals M. musculus H. sapiens | 2,500,000 3,240,000 | 30,000 30,000 | 70 64 |
Mustard Plant A. thaliana | 125,000 | 25,500 | 80 |
Only about 1.5% of the DNA in the genome actually codes for proteins. What is the rest of it doing?
A large amount of non-coding DNA is a feature of the genomes of complex eukaryotes. As I mentioned in the last lecture the selection pressure for prokaryotes is to be able to rapidly proliferate when conditions are suitable i.e. good supply of nutrient. They need to be able to quickly swing in to action under these circumstances and “make hay while the sun shines” so to speak. Complex multicellular organisms do not have the same selection pressure. In fact if a particular cell in a multicellular organism takes it upon themselves to rapidly and uncontrollably divide we call it cancer. Because the drive to genome efficiency is not so prevalent in eukaryotes (i.e. the bacterium that wants to divide rapidly has to copy the genome rapidly so it doesn’t want any redundant sequences in the DNA) they can afford to have larger amounts of non-coding DNA. While this explains why the extra DNA can persist without causing too much trouble to the organism it doesn’t account for it.
Let’s take a historical perspective for a moment and see when the extra DNA was first discovered. Let’s go back to the melting and re-annealing DNA covered last year. An interesting technique was pioneered 20 odd years ago which gained the name of Cot plots but it was the first hint as to the abundance and diversity of sequences in the genome.
If however, your DNA sample contains some sequences that are represented more than once on the genome then you get a much more interesting (and more difficult to analyse) plot. Those sequences that have the most repeats will hybridise quickest. Those with multiple copies will also hybridise quickly while the more complex single copy sequences take the longest to re-anneal. You plot the curve with A260 on the y axis and basically time (actually the initial concentration * time or Cot) as a log scale on the x axis. The plot is then analysed by computer (assuming second order kinetics for re-hybridisation). It is second order kinetics because it is a bimolecular reaction and the rate of association is proportional to the concentration of both strands.
The rate of rehybridisation is dependent on the complexity of the DNA. Complexity is defined as the number of bases in a unique sequence. For example Poly U has a complexity of 1; the sequence AGTTCAGTTCAGTTCAGTTC has a complexity of 5. The Cot½ for a given sequence of DNA is dependent on its complexity.
The plot of a complex mixture of DNA sequences has a bi or tri phasic look about it.
Essentially the human genomic DNA Cot curve has 40% fast annealing low Cot sequence elements and 60% slow renaturating high Cot unique sequences i.e. one copy per genome. The human genome can be divided up into 4 classes: highly repetitive (hundreds to millions of copies), moderately repetitive (10s to hundreds of copies), slightly repetitive (1 – 10 copies) and single copy sequences.
The repetitive DNA, by sequence analysis, was found to contain short repeat sections found in satellite DNA (satellite DNA was so named from its behaviour on CsCl density ultra centrifugation), and sequences of normal length for genes that have large numbers of copies.
Highly repetitive sequences: One group of highly repetitive DNA is the simple sequence DNA which contains thousands of copies of a simple sequence repeated in tandem. The repeat sequence can be as short as 5 bases. The repeat sequences are known as:
Short tandem repeats (STRs)
Microsatellites (1 – 13 bases)
Minisatellites (14 – 500 bases)
These sequences are often found clustered around the centromer or telomers. The term satellite came about from its behaviour in a CsCl gradient. Because many of the simple sequence repeat DNA is AT rich it has a lower density than the more GC rich genomic DNA and therefore banded differently on the gradient, as a satellite. These repeats account for 10 – 20 % of higher eukaryotic DNA.
The moderately repetitive DNA includes those sequences which have multiple copies in the genome, designed to increase the rate or amount of gene product, and some regulatory sequences found scattered throughout the genome. Sequences such as the ribosomal RNA and transfer RNA, which are required in large amounts for protein synthesis have many copies on the genome. The histone sequences also have large copy numbers in the genome.
Another group of moderately repetitive DNA sequences are those scattered throughout the genome; known as SINES (short interspersed elements) and LINES (long interspersed elements). Some famous SINES and LINES: Alu repeats are the major SINE in mammalian genomes. They are ~300 bp long and about a million of such sequences exist scattered throughout the genome. They account for ~10% of the genome. They are transcribed into RNA but have no known function?? They are known as Alu repeats because they contain the recognition sequence for the restriction enzyme Alu. The most common LINE in the human genome is L1, a 6 000 bp sequence which is repeated some 50 000 times in the human genome. L1 sequences are also transcribed and some even encode proteins! Their function in the cell is unknown. Both the Alu and L1 sequences are transposable elements, capable of moving to different sites in the genome.
The group of sequences with a small number of copies on the genome include such sequences as the globins. This family of genes contains a number of closely related sequences, varying by only a few bases in the code, will cross hybridise. These are also known as gene clusters.
The final group, the single copy sequences make up the vast majority of genes on the genome (gene being a functional unit which codes for a single polypeptide chain). This group is the most complex and takes the longest to re-anneal, hence the log scale on the time.
The highly repetitive DNA re-anneals in seconds while the most complex single copy group takes hours or days to re-anneal.
So our genome contains highly repetitive DNA which doesn’t code for proteins. It also contains some multi-gene families and multiple copies of some genes. This makes up 40% of the genome by Cot plot analysis. What of the other 60% unique sequences?? Remember only 1 – 2% of the genome is coding sequence. How do we account for this discrepancy??
Further investigations of eukaryotic genes (the 60%) found they were interrupted by large sections of non-coding regions called introns. This does not happen in bacteria. These stretches of DNA, which can make up over 90% of the gene by base # are cut out after transcription when the mRNA is processed. The coding sections are called exons.
Pseudogenes
The eukaryotic genome also contains pseudogenes that occupy a significant proportion of the genome. There are actually two classes of pseudogenes. Class I pseudogenes have arisen by gene duplication and then have been subsequently inactivated by various mutations (insertions, substitutions or deletions). These pseudogenes are often found near their functional gene counterpart. The second class, type II pseudogenes, are processed sequences (lacking introns and often containing a vestigial poly (A) tail) and have originated during evolution from mRNA that was copied by reverse transcriptase back into DNA. The sequence was then inserted into the genome by a retrotransposon event. The footprints of this event are evident in the direct repeats that flank the pseudogene; the repeats have facilitated its insertion back into the genome. These pseuogenes are usually found a long way from the functional parent gene. Pseudogenes do not code for functional proteins and they are not translated. The exact number of pseudogenes in the human genome is unknown although estimates, using various search criteria, have identified some 2,900 regions which probably represent processed pseudogenes. The pattern that has emerged from analysis of the human genome is that those sequences which tend to give rise to more pseudogenes have shorter than average transcripts and are sequences that are involved in nuclear regulation and translation (ribosomal proteins account for 67%, lamin receptors 10%, translation elongation factors 5%). The common theme amongst these sequences may actually be the increased level of transcription of these sequences.
Regulatory regions
Other forms of non-coding DNA also play an important role in gene transcription and contribute to the increased non-coding DNA found in eukaryotes. The “promoter” regions of eukaryotic genomes are substantially larger than their prokaryotic counterparts. Transcriptional regulation in eukaryotes is a very complex process, often involving enhancers and upstream binding sites for regulatory elements. These regions can cover thousands of bp. Intergenic DNA is estimated to occupy between 63 and 75% of the total base-pairs in the human genome. The longest stretch of non-coding DNA, termed gene desert is on chromosome 13. It is 3 038 416 bp long.
What have we learnt from sequencing the human genome?
This was the big event of the last decade of last century. Whole genome analysis has confirmed the earlier observations concerning gene density in eukaryotes. Below is the table with some general statistics gleaned from the human genome project.
Table 2 General characteristics of the Human Genome.
Human Genome : General Statistics | |
Approximate size of the genome | 2.9 Gbp |
% (A + T) | 54 |
% (G+C) | 38 |
% undetermined bases in genome | 9 |
Most GC- rich region 50 kb | Chr 2 (66%) |
Most AT-rich region 50 kb | Chr X (25%) |
Number of genes | 26,383 – 39,114 |
Most gene-rich chromosome | Chr. 19 (23 genes/Mb) |
Least gene-rich chromosomes | Chr. 13 (5 genes/Mb) and Chr. Y (5 genes/Mb) |
Average gene length | 27 kbp |
Gene with the most exons | Titin (234 exons) |
% of genome containing repeat sequences | 35 |
% exon base pairs | 1.1 – 1.4 |
% intron base pairs | 24 - 36 |
% intergenic DNA (bp) | 64 - 75 |
Repetitive DNA and disease.
Trinucleotide repeats (TNR: this means multiple tandem copies of a 3 nucleotide sequence) are a specialised type of repeat sequence found in the genome which come about from mutations during replication, recombination or repair of both somatic and germline cells. This process, known as dynamic mutation gives rise to unstable repeats. There are an increasing number of genetic disorders which result from expansion of trinucleotide repeats; many of them are neurological disorders. A static mutation (and I will discuss this more in a later lecture) is one that occurs in the germline (sperm or ovary cells which have undergone meiosis) which is passed onto the next generation and stably retained in the somatic cells’ genome (mitotic cells). This mutation is present in the genome of all somatic cells to the same level. Unlike static mutations dynamic mutations change; they continue to mutate between different tissues and across generations. The longer the tract length (i.e. number of repeats) the more likely the repeat is going to continue to mutate. This leads to increased severity with successive generations or in some diseases, the age of onset decreases. In other words with each generation the disorder becomes worse and/or you start to get the symptoms earlier. This leads to genetic anticipation. What causes this continuing mutation is unknown but some unusual single stranded DNA structures are thought to form during repair, recombination and replication. The longer the tract the more likely these aberrant structures are to form hence perpetuating the mutation process.
The most common are the fragile X syndrome (FRAXA), one of the most common inheritable forms of mental retardation and Huntington’s disease. The expansion of these repeat sequences is sometimes found in the coding region of the protein, leading to altered protein function or gain of function as it is described in the literature, in the non-coding region where you see loss of function and very recently, repeats which act at the RNA level, producing a pathogenic RNA species which results in aberrant RNA protein interactions. This also leads to neuronal dysfunction.
I would like to briefly consider 2 diseases. The fragile X syndrome (FRAXA) results from multiple copies of the sequence CGG (the expansion) in the 5’ UTR of the fragile X syndrome gene, FMR1, which causes transcriptional silencing of the protein product of this gene, FMRP. The number of repeats is very important to the final severity of the disease. 5 – 50 copies has no effect, 50 – 200 results in an intermediate and distinct syndrome, fragile X tremor/ataxia (FXTAS) while >200 copies gives rise to the full blown mutation. Mutations in certain regions of the FMR1 coding region produce a defective protein which also gives rise to the same phenotype as the gene silencing.
Obviously this protein is important to the cell, particularly neurons. The protein and its mRNA localise in dendritic spines. The expression of the protein is up-regulated in response to stimulation from glutamate receptors and it is involved in translational repression at synapses. The absence of the protein results in the neuronal dysfunction. Within the cell it is located largely in the cytoplasm but does move to the nucleus also. The protein, FMRP has 3 RNA binding domains, it associates with polyribosomes and seems to be involved in translational repression of a group of mRNA targets i.e. this protein binds to other mRNA sequences and regulates their translation, probably by mediating ribosome association and recruiting interfering RNA processes. The target mRNA species it binds to are sequences involved in cytoskeleton, neuronal development and synaptic transmission.
FMRP is also expressed in the liver, lung, kidney spinal cord, and gastrointestinal tract. These are not areas of significant problems for sufferers of fragile X syndrome. Two other proteins, FXR1 and FXR2, have similar functional and structural features and may be they can compensate for a lack of FMRP in those tissues. Obviously, FXR1 and FXR2 are not able to compensate for an absence of FMRP in the brains and testicles.
Huntington’s disease.
“George Huntington (1850-1916) described the condition while working as a newly qualified doctor in the rural general practice of his father and grandfather on Long Island, New York State. Together their observations covered 78 years. Geroge Huntington did not continue working on Hereditary chorea but went into general practice in Ohio.” He never published another paper in his life - yet his name is remembered from the single one he did write.
A second example is the expansion of the repeats in the coding region of a gene. This is obviously going to alter the function of the protein product. To date 9 separate disorders are associated with the expansion of a CAG repeat in the coding region of various proteins. CAG codes for glutamine and the expansion results in multiple copies of glutamine in the affected protein (polyglutamine disorders). This gain of function outcome accounts for the neurodegenerative symptoms of Huntington’s disorder. The affected protein is expressed widely in the CNS, particularly in certain neuron populations. The result of the polyglutamine in the protein is a misfolded protein which, in the case of Huntington’s disorder aggregates and is sequestered into inclusion bodies complete with the chaperones. This is thought to eventually overload the chaperone and ubiquitin systems. Other evidence suggests that the inclusion bodies are a protective response and that the mutant protein actually initiates a cascade of aberrant protein protein interactions which affect many processes resulting in neuronal dysfunction and death (as always).
Table 3
No. of CAG repeats | Outcome |
< 28 | Normal range; individual will not develop HD |
29-34 | Individual will not develop HD but the next generation is at risk |
35-39 | Some, but not all, individuals in this range will develop HD; next generation is also at risk |
> 40 | Individual will develop HD |
The third problem associated with these trinucleotide expansions concerns the production of a pathogenic RNA species. I will cover this later.
No comments