Organization of gene expression data by hybrid clustering approach
To further elucidate the recurring
expression patterns, we developed a gene clustering approach that allowed us to
incorporate the quantitative temporal expression data obtained from the
microarray experiments together with the qualitative, but spatially-rich, data
on expression patterns from the CV annotations. We implemented this approach
within the framework of fuzzy c-means clustering (Bezdek 1981; deGruijter 1988) and developed a gene
similarity metric that assigns different weights to the contribution of the
microarray and annotation data (Materials and Methods).
In order to understand the
complementary characteristics of the in
situ expression data versus the microarray expression data, we tried a
number of clustering conditions. When we clustered genes using microarray data
alone and ignored the annotations, genes with extremely disparate spatial
patterns clustered together. For example, tissue-specific genes for many
terminally differentiated tissues begin a steep rise in expression level around
hour 10 as organ formation begins, and array profiles for unrelated
sets of tissues such as differentiated muscle and epidermis were
indistinguishable. For these genes, the in situ data was more informative.
Conversely, when we clustered genes
using the annotations alone (data not shown), genes with tightly restricted
spatial expression clustered well, but genes with ubiquitous or relatively
unrestricted expression patterns clustered poorly, forming many small groups
often co-clustered with tightly restricted genes. DNA replication enzymes and mitochondrial ribosomal proteins , are expressed in
all cells but vary in expression levels between tissues and developmental
stages. This resulted in such “ubiquitous” genes being assigned annotation
terms corresponding to the tissues in which they are most highly expressed;
that is, because of the limited dynamic range of the in situ hybridization method, quantitative differences can be
mistaken for qualitative differences. Thus when we relied solely on the
annotation terms, we found that broadly expressed genes often became
co-clustered with genes having tightly restricted spatial patterns of
expression. In contrast, microarray data have a very broad dynamic range and
accurately reflected profiles of expression independent of absolute levels; for
this reason, clusters of broadly expressed genes involved in basic processes,
such as DNA replication, were remarkably tight even though the genes appear to
be expressed at widely varying levels
Our goal was to find
a proper balance between the contributions of annotation similarity versus
microarray similarity to the overall similarity score. We desired a score that
would minimize the contribution of microarray similarity for cases like those
genes in, which have almost identical array profiles but very diverse
in situ annotation profiles. On the
other hand, we wanted a score that would use array similarity to distinguish
the similar annotation profiles of broadly expressed genes, such as those in. We therefore used
an asymmetric mixture function that varied the contribution of microarray data based
on the similarity of the annotation data. Similarity for microarray profiles
was calculated using a simple correlation metric, while similarity for in situ annotation profiles was
calculated using a custom metric that independently weighted the contribution
of each developmental stage (Materials and Methods).
The fuzzy c-means
algorithm is fuzzy in the sense that each gene was assigned to one or more
clusters (Bezdek 1981; Gasch and Eisen 2002). As multiple
independent transcriptional control elements can drive the expression of a
single gene in different tissues or at different times in development, this is
a desirable property for this particular clustering problem. However, despite
extensive experimentation with different parameters and similarity metrics, and
trying various other clustering algorithms (refs Ben),
the large range of expression patterns generated clusters with ambiguous
boundaries. Replication experiments using random initialization variables
resulted in clusters that were qualitatively similar except that some genes
were redistributed (data not shown). Results were similar regardless of the
number of clusters chosen or the similarity metric used. To overcome this, each
gene was assigned a score for each cluster, and this score was used to rank the
most prototypical members of the cluster first and the most ambiguous ones
last. We defined a cutoff that restricted the genes in each cluster to include
only the core genes that belong
exclusively to that cluster (Materials and methods).
Because our
clustering algorithm explicitly makes use of the microarray data, we limited
the analysis to those genes present in our microarray study. Of 4,759 genes
expressed in the embryo, we had valid microarray expression data for 4,496. The
best fuzzy c-means run grouped these genes into 39 clusters, and each cluster
was designated as either broad or restricted. Clusters containing a
significant number of genes annotated as “ubiquitous” were designated as broad, as were clusters containing
primarily genes with unrestricted maternal only expression (Materials and methods).
We also decided to include as broad
those clusters of genes exhibiting maternal expression early and midgut-only
expression late. Many genes annotated in this way encode the
mitochondrial ribosomal proteins and other presumably ubiquitous mitochondrial
proteins, suggesting that the absence of staining in early to middle embryonic
stages may be a limitation of the staining method. Using these criteria, 10 of
the 39 clusters were designated broad, and 2,549 (56.7%)
genes were assigned to these clusters. The remaining 1,947 (43.3%) genes exhibited
highly restricted patterns and were assigned to 29 clusters
Comparison of any
two expression patterns based on relatively few sample images from an RNA in situ hybridization assay is inherently
a subjective judgment call. The individual examples of the broad expression
patterns look superficially similar and are clearly recognized as a coherent
group. Yet when scrutinized carefully and independently of each other, resulted
in slightly different annotations. The approach we took relaxes the requirement
for precise and consistent annotation and leverages the quantitative aspect of
microarray time-course data. Supplementing in
situ data with microarray time-course data therefore not only serves as an
independent measurement of the expression profile but also facilitates the
grouping of similar profiles. Similar difficulties with precise and consistent
description of quantitative expression patterns are to be expected in any
developmental model system and our experience suggests that a parallel
microarray time-course analysis should be an integral part of any such project.
Post Comment
No comments