Breaking News

Relationship between expression and function

Determining a gene’s pattern of expression is a key step towards understanding its function during development. One way to utilize genome-wide gene expression data is to relate each gene’s expression specificity to its function. The functions of many genes have been determined either by direct experimental analysis or on the basis of sequence homology. The Gene Ontology (GO) consortium (Ashburner, Ball et al. 2000) has assembled the available data on gene function and localization. Additionally the Uniprot database provides conserved domain compositions for annotated genes in sequenced genomes (2007).
We extracted the GO designation and the Uniprot domains for each gene in our dataset and in Figure 10 plotted on a number line the percentage of genes in each category that are found in broad or restricted clusters. We highlight categories that are at least five times more abundant in either broad (1-20%) or restricted clusters (80 -100%). As discussed before, broad clusters are heavily enriched for genes involved in core cellular processed such as translation, protein degradation, cell division and energy metabolism. Interestingly, RNA binding proteins show strong bias for broad clusters. The majority of transcripts for RNA binding proteins are deposited maternally into the early embryo highlighting the necessity for mRNA regulation prior to the onset of zygotic transcription. Restricted clusters are enriched in genes with sequence-specific DNA-binding domains and signaling molecules and also contain a large number of the genes involved in cuticle formation.
To examine the enrichment of GO and Uniprot categories in individual gene expression clusters, we performed exhaustive pair-wise comparisons. We used the binomial z-score to evaluate the statistical significance of overlaps between pairs of gene lists derived from the different data-sources. Because we tested significance for all functional categories against all gene expression categories, we effectively tested multiple hypotheses; we have applied a correction to significance estimates to compensate for this fact. Because many GO categories are related to one another, the standard Bonferroni adjustment (ref) provides a relatively poor correction factor. We determined the empirical chance distribution by performing a large number of random permutations of gene functional assignments and determining the rate at which we attained particular p-values. We interpolated these results using a log-linear regression function fitting the empirical distribution (Materials and methods).
The results of this analysis are shown in Supplementary Table 3, which lists all GO and Uniprot categories significantly enriched in gene expression clusters To summarize the functional associations of gene expression clusters, we used the force directed layout that will bring into close proximity clusters and GO categories that share a significant number of genes. In the force directed layout, restricted and broad clusters separate robustly, with the notable exception of germ cell cluster 22R that associates strongly with functions typical of broad maternal genes. This connection may be due to the fact that restriction of transcripts to the germ line lineage is often a consequence of protection of maternal message from degradation in early forming pole cells. Another cluster that violates the broad versus restricted separation is cluster 8B that is enriched for genes involved in cuticle metabolism. Since formation of the cuticle effectively prevents RNA in situ hybridization, we propose that the genes in cluster 8B are likely expressed during late embryogenesis in a pattern resembling epidermal expression (similar to cluster 5R and 6R), however that this pattern cannot be visualized by the standard in situ protocol. The late spike in the average array profile of cluster 8B genes supports this notion.
Interestingly, cluster 7R, that contains genes with early (Stage 12) onset epidermal expression, clearly separates from 5R and 6R that contain genes with late epidermal expression (Stages 13–16). Early epidermal expressing genes are associated with GO terms typical for genes expressed ubiquitously (cluster 6B), as well as with GO terms for tissue specific functions such as membrane trafficking, cell polarity, motility and adhesion, which makes them similar to genes found in the early blastoderm patterning gene cluster (cluster 26R). In contrast, late epidermal clusters (cluster 5R, cluster 6R) associate clearly with cuticle formation in terminally differentiated tissues. This is the best example in our dataset of separation between regulatory developmental genes and effector genes (ref Garcia-Bellido 1973) of the terminal cell fates.
Genes in cluster 24R are expressed in yolk, mesoderm, dorsal ectoderm and anterior and posterior endoderm anlagen at blastoderm stage. Consistent with this early expression, these genes are expressed later in differentiated midgut, yolk, fat body and plasmatocytes, but also—and unexpectedly—in the central nervous system. The force directed layout suggests that these genes are functionally related to clusters 1-4R, which contain genes expressed in yolk, fat body and blood and involved in metabolite transport. Cluster 24R clearly separates from other blastoderm stage clusters especially 23R which has a similar core structure composition, suggesting that for these particular tissues, specific effector genes are required early in and throughout embryonic development. The late CNS expression is probably due to the shared requirement for transporter activities in mature nerve cells. Numerous connections between late CNS cluster 13R and the yolk, blood, fat group of clusters support this suggestion.
GO terms related to membrane trafficking such as secretory pathway, vesicle transport, Golgi apparatus, and ER assume a central position in the layout with numerous connections to diverse clusters both broad and restricted. This likely reflects the requirement of these core cellular processes in all cells, but also indicates that there are tissue specific differences in the utilization of these pathways. The modulation of these pathways is mediated by GTPase and kinase activities, which exhibit similar connectivity patterns in the force directed layout.
Central nervous system and muscle clusters associate with the expected GO terms for nerve impulse transmission and muscle contraction. Interestingly, both tissues show, despite their clear functional specialization, a common requirement for components of the extracellular matrix.
Another way to uncover relationships between gene expression specificity and gene function is to examine the representation of GO categories in individual tissues using the ‘anatograms’ (Figure 12). For example, transcriptional regulators can be seen to be strongly enriched in developing and mature nervous system (Figure 12A). Regulation of transcription initiation by sequence-specific transcription factors is the primary mechanism used to generate tissue-specific gene expression. We determined the gene expression pattern for 238 of the 684 Drosophila transcription factors with sequence-specific DNA binding domains; at least one transcription factor is found in every block of tissues recognized by our annotation hierarchy. We used our extensive and unbiased set of expression patterns of transcription factors to determine whether the two most abundant classes of DNA binding domains, the C2H2 zinc finger and the homeobox, associate with specific tissues of gene expression patterns. We found that these domains show very similar overall distributions, suggesting that they are deployed to regulate a similar range of developmental processes.
Cell adhesion molecules are similar to transcription factors in that they are expressed early in development in a number of anlagen, are later, abundant in the nervous system and are moderately enriched in differentiated epidermal derivatives. Cytoskeletal components are enriched in the nervous system and muscles but are rarely expressed in the epidermis, suggesting that the tissue relatedness observed between mesodermal and neural derivatives is dictated by shared functional requirements of these cell types . Interestingly, the tissue distribution of kinases is almost indistinguishable from the genome-wide average of all genes . Occasionally, functional and gene expression association is very strong and specific, such as detection of stimulus and Bolwig’s organ, chitin metabolism and late epidermal patterns, or helicases and gonads .
Comparison of GO and gene expression data often leads to self evident observations because many functional GO assignments are based on published gene expression patterns. We used Uniprot domains to relate gene expression regulation to truly independent sequence features shows several domains expressed highly specifically in differentiated epidermal derivatives. For example zona pellucida (ZP) proteins are transmembrane glycoproteins that were recently shown to be critical for trachea morphogenesis. These and numerous other ZP genes that we examined are expressed in the 5R/6R epidermal pattern (Figure 12K). A novel domain, DUF243 that apparently exists only in flies (X of 34 in our dataset), is found almost exclusively in the coding products of genes expressed in the late 5R pattern (Figure 12L). These tight association of functional sequence properties and patterns of gene expression provide useful insights into gene function.

No comments