Breaking News

Organization of gene expression data by hybrid clustering approach

To further elucidate the recurring expression patterns, we developed a gene clustering approach that allowed us to incorporate the quantitative temporal expression data obtained from the microarray experiments together with the qualitative, but spatially-rich, data on expression patterns from the CV annotations. We implemented this approach within the framework of fuzzy c-means clustering (Bezdek 1981; deGruijter 1988) and developed a gene similarity metric that assigns different weights to the contribution of the microarray and annotation data (Materials and Methods).
In order to understand the complementary characteristics of the in situ expression data versus the microarray expression data, we tried a number of clustering conditions. When we clustered genes using microarray data alone and ignored the annotations, genes with extremely disparate spatial patterns clustered together. For example, tissue-specific genes for many terminally differentiated tissues begin a steep rise in expression level around hour 10 as organ formation begins, and array profiles for unrelated sets of tissues such as differentiated muscle and epidermis were indistinguishable. For these genes, the in situ data was more informative.
Conversely, when we clustered genes using the annotations alone (data not shown), genes with tightly restricted spatial expression clustered well, but genes with ubiquitous or relatively unrestricted expression patterns clustered poorly, forming many small groups often co-clustered with tightly restricted genes. DNA replication enzymes  and mitochondrial ribosomal proteins , are expressed in all cells but vary in expression levels between tissues and developmental stages. This resulted in such “ubiquitous” genes being assigned annotation terms corresponding to the tissues in which they are most highly expressed; that is, because of the limited dynamic range of the in situ hybridization method, quantitative differences can be mistaken for qualitative differences. Thus when we relied solely on the annotation terms, we found that broadly expressed genes often became co-clustered with genes having tightly restricted spatial patterns of expression. In contrast, microarray data have a very broad dynamic range and accurately reflected profiles of expression independent of absolute levels; for this reason, clusters of broadly expressed genes involved in basic processes, such as DNA replication, were remarkably tight even though the genes appear to be expressed at widely varying levels
Our goal was to find a proper balance between the contributions of annotation similarity versus microarray similarity to the overall similarity score. We desired a score that would minimize the contribution of microarray similarity for cases like those genes in, which have almost identical array profiles but very diverse in situ annotation profiles. On the other hand, we wanted a score that would use array similarity to distinguish the similar annotation profiles of broadly expressed genes, such as those in. We therefore used an asymmetric mixture function that varied the contribution of microarray data based on the similarity of the annotation data. Similarity for microarray profiles was calculated using a simple correlation metric, while similarity for in situ annotation profiles was calculated using a custom metric that independently weighted the contribution of each developmental stage (Materials and Methods).
The fuzzy c-means algorithm is fuzzy in the sense that each gene was assigned to one or more clusters (Bezdek 1981; Gasch and Eisen 2002). As multiple independent transcriptional control elements can drive the expression of a single gene in different tissues or at different times in development, this is a desirable property for this particular clustering problem. However, despite extensive experimentation with different parameters and similarity metrics, and trying various other clustering algorithms (refs Ben), the large range of expression patterns generated clusters with ambiguous boundaries. Replication experiments using random initialization variables resulted in clusters that were qualitatively similar except that some genes were redistributed (data not shown). Results were similar regardless of the number of clusters chosen or the similarity metric used. To overcome this, each gene was assigned a score for each cluster, and this score was used to rank the most prototypical members of the cluster first and the most ambiguous ones last. We defined a cutoff that restricted the genes in each cluster to include only the core genes that belong exclusively to that cluster (Materials and methods).
Because our clustering algorithm explicitly makes use of the microarray data, we limited the analysis to those genes present in our microarray study. Of 4,759 genes expressed in the embryo, we had valid microarray expression data for 4,496. The best fuzzy c-means run grouped these genes into 39 clusters, and each cluster was designated as either broad or restricted. Clusters containing a significant number of genes annotated as “ubiquitous” were designated as broad, as were clusters containing primarily genes with unrestricted maternal only expression (Materials and methods). We also decided to include as broad those clusters of genes exhibiting maternal expression early and midgut-only expression late. Many genes annotated in this way  encode the mitochondrial ribosomal proteins and other presumably ubiquitous mitochondrial proteins, suggesting that the absence of staining in early to middle embryonic stages may be a limitation of the staining method. Using these criteria, 10 of the 39 clusters  were designated broad, and 2,549 (56.7%) genes were assigned to these clusters. The remaining 1,947 (43.3%) genes exhibited highly restricted patterns and were assigned to 29 clusters
Comparison of any two expression patterns based on relatively few sample images from an RNA in situ hybridization assay is inherently a subjective judgment call. The individual examples of the broad expression patterns look superficially similar and are clearly recognized as a coherent group. Yet when scrutinized carefully and independently of each other, resulted in slightly different annotations. The approach we took relaxes the requirement for precise and consistent annotation and leverages the quantitative aspect of microarray time-course data. Supplementing in situ data with microarray time-course data therefore not only serves as an independent measurement of the expression profile but also facilitates the grouping of similar profiles. Similar difficulties with precise and consistent description of quantitative expression patterns are to be expected in any developmental model system and our experience suggests that a parallel microarray time-course analysis should be an integral part of any such project.

No comments