Relationship between expression and function
Determining a gene’s pattern of expression is a key step towards
understanding its function during development. One way to utilize genome-wide
gene expression data is to relate each gene’s expression specificity to its
function. The functions of many genes have been determined either by direct
experimental analysis or on the basis of sequence homology. The Gene Ontology
(GO) consortium (Ashburner, Ball et al. 2000) has assembled the available data on gene function and
localization. Additionally the Uniprot database provides conserved domain
compositions for annotated genes in sequenced genomes (2007).
We
extracted the GO designation and the Uniprot domains for each gene in our
dataset and in Figure 10 plotted on a number line the percentage of genes in
each category that are found in broad or restricted clusters. We highlight
categories that are at least five times more abundant in either broad (1-20%)
or restricted clusters (80 -100%). As discussed before, broad clusters are
heavily enriched for genes involved in core cellular processed such as
translation, protein degradation, cell division and energy metabolism.
Interestingly, RNA binding proteins show strong bias for broad clusters. The
majority of transcripts for RNA binding proteins are deposited maternally into
the early embryo highlighting the necessity for mRNA regulation prior to the
onset of zygotic transcription. Restricted clusters are enriched in genes with
sequence-specific DNA-binding domains and signaling molecules and also contain
a large number of the genes involved in cuticle formation.
To examine the enrichment of GO and Uniprot categories in
individual gene expression clusters, we performed exhaustive pair-wise
comparisons. We used the binomial z-score to evaluate the statistical
significance of overlaps between pairs of gene lists derived from the different
data-sources. Because we tested significance for all functional categories
against all gene expression categories, we effectively tested multiple
hypotheses; we have applied a correction to significance estimates to
compensate for this fact. Because many GO categories are related to one
another, the standard Bonferroni adjustment (ref) provides a relatively poor
correction factor. We determined the empirical chance distribution by
performing a large number of random permutations of gene functional assignments
and determining the rate at which we attained particular p-values. We
interpolated these results using a log-linear regression function fitting the
empirical distribution (Materials and methods).
The results of this
analysis are shown in Supplementary Table 3, which lists all GO and Uniprot
categories significantly enriched in gene expression clusters To
summarize the functional associations of gene expression clusters, we used the
force directed layout that will bring into close proximity clusters and GO
categories that share a significant number of genes. In the force
directed layout, restricted and broad clusters separate robustly, with the
notable exception of germ cell cluster 22R that associates strongly with
functions typical of broad maternal genes. This connection may be due to the
fact that restriction of transcripts to the germ line lineage is often a
consequence of protection of maternal message from degradation in early forming
pole cells. Another cluster that violates the broad versus restricted
separation is cluster 8B that is enriched for genes involved in cuticle
metabolism. Since formation of the cuticle effectively prevents RNA in situ
hybridization, we propose that the genes in cluster 8B are likely expressed
during late embryogenesis in a pattern resembling epidermal expression (similar
to cluster 5R and 6R), however that this pattern cannot be visualized by the
standard in situ protocol. The late spike in the average array profile of
cluster 8B genes supports this notion.
Interestingly, cluster 7R,
that contains genes with early (Stage 12) onset epidermal expression, clearly
separates from 5R and 6R that contain genes with late epidermal expression
(Stages 13–16). Early epidermal expressing genes are associated with GO terms
typical for genes expressed ubiquitously (cluster 6B), as well as with GO terms
for tissue specific functions such as membrane trafficking, cell polarity,
motility and adhesion, which makes them similar to genes found in the early
blastoderm patterning gene cluster (cluster 26R). In contrast, late epidermal
clusters (cluster 5R, cluster 6R) associate clearly with cuticle formation in
terminally differentiated tissues. This is the best example in our dataset of separation
between regulatory developmental genes and effector genes (ref Garcia-Bellido
1973) of the terminal cell fates.
Genes in cluster 24R are
expressed in yolk, mesoderm, dorsal ectoderm and anterior and posterior
endoderm anlagen at blastoderm stage. Consistent with this early expression, these
genes are expressed later in differentiated midgut, yolk, fat body and
plasmatocytes, but also—and unexpectedly—in the central nervous system. The
force directed layout suggests that these genes are functionally related to
clusters 1-4R, which contain genes expressed in yolk, fat body and blood and
involved in metabolite transport. Cluster 24R clearly separates from other
blastoderm stage clusters especially 23R which has a similar core structure
composition, suggesting that for these particular tissues, specific effector
genes are required early in and throughout embryonic development. The late CNS
expression is probably due to the shared requirement for transporter activities
in mature nerve cells. Numerous connections between late CNS cluster 13R and
the yolk, blood, fat group of clusters support this suggestion.
GO terms related to
membrane trafficking such as secretory pathway, vesicle transport, Golgi
apparatus, and ER assume a central position in the layout with numerous
connections to diverse clusters both broad and restricted. This likely reflects
the requirement of these core cellular processes in all cells, but also
indicates that there are tissue specific differences in the utilization of
these pathways. The modulation of these pathways is mediated by GTPase and
kinase activities, which exhibit similar connectivity patterns in the force
directed layout.
Central nervous system and
muscle clusters associate with the expected GO terms for nerve impulse
transmission and muscle contraction. Interestingly, both tissues show, despite
their clear functional specialization, a common requirement for components of
the extracellular matrix.
Another way to
uncover relationships between gene expression specificity and gene function is
to examine the representation of GO categories in individual tissues using the
‘anatograms’ (Figure 12). For example, transcriptional regulators can be seen
to be strongly enriched in developing and mature nervous system (Figure 12A). Regulation of transcription initiation by sequence-specific
transcription factors is the primary mechanism used to generate tissue-specific
gene expression. We determined the gene expression pattern for 238 of the 684
Drosophila transcription factors with sequence-specific DNA binding domains; at
least one transcription factor is found in every block of tissues recognized by
our annotation hierarchy. We used our extensive and unbiased set of expression
patterns of transcription factors to determine whether the two most abundant
classes of DNA binding domains, the C2H2 zinc finger and the
homeobox, associate with specific tissues of gene expression
patterns. We found that these domains show very similar overall distributions,
suggesting that they are deployed to regulate a similar range of developmental
processes.
Cell adhesion molecules
are similar to transcription factors in that they are expressed early in
development in a number of anlagen, are later, abundant in the nervous system
and are moderately enriched in differentiated epidermal derivatives. Cytoskeletal components are enriched in the nervous system and muscles
but are rarely expressed in the epidermis, suggesting that the tissue
relatedness observed between mesodermal and neural derivatives is dictated by
shared functional requirements of these cell types . Interestingly,
the tissue distribution of kinases is almost indistinguishable from the
genome-wide average of all genes . Occasionally, functional and
gene expression association is very strong and specific, such as detection of
stimulus and Bolwig’s organ, chitin metabolism and late epidermal patterns, or
helicases and gonads .
Comparison of GO and
gene expression data often leads to self evident observations because many
functional GO assignments are based on published gene expression patterns. We
used Uniprot domains to relate gene expression regulation to truly independent
sequence features shows several
domains expressed highly specifically in differentiated epidermal derivatives. For example zona pellucida (ZP) proteins are transmembrane glycoproteins
that were recently shown to be critical for trachea morphogenesis. These and
numerous other ZP genes that we examined are expressed in the 5R/6R epidermal
pattern (Figure 12K). A novel domain, DUF243 that apparently exists only in
flies (X of 34 in our dataset), is found almost
exclusively in the coding products of genes expressed in the late 5R pattern
(Figure 12L). These tight association of functional sequence properties and
patterns of gene expression provide useful insights into gene function.
Post Comment
No comments