Vera
Pancaldi
,
Falk
Schubert
and
Jürg
Bähler
*
Department of Genetics, Evolution and Environment and UCL Cancer Institute, University College London, Darwin Building, Gower Street, London, UK WC1E 6BT. E-mail: j.bahler@ucl.ac.uk
First published on 15th December 2009
Genome-wide gene expression is re-programmed in response to external or internal factors such as environmental stress or genetic mutation, respectively, or as a function of endogenous processes such as cell proliferation or differentiation. Here we integrate expression profiling data that have been collected by our laboratory since 2001 and that interrogate more than 900 different experimental conditions. We take advantage of this large data set to rank all genes based on their variability in gene expression across the different conditions. The most variable genes were enriched for functions such as stress response, carbohydrate metabolism and trans-membrane transport, and these genes were underrepresented for introns and tended to be close to telomeres. We then compared how overall gene regulation and variability of gene expression across conditions is affected by environmental or genetic perturbations, and by endogenous programmes. Meiotic differentiation and environmental perturbations led to substantially greater gene expression variability and overall regulation than did genetic perturbations and the transcriptional programme accompanying cell proliferation. We also used the integrated data to identify gene regulation modules using two different clustering approaches. Two major clusters, containing growth- and metabolism-related genes on one hand and stress- and differentiation-related genes on the other, were reciprocally regulated across conditions. We discuss these findings with respect to other recent reports on the regulation and evolution of gene expression.
We define gene regulation as the amount by which a gene is up- or down-regulated averaged over different conditions. With the word variability we indicate the width of the distribution of these values, expressed by the standard deviation. The two measures provide different information as a gene could have a high regulation, in the case where it is found to be always differentially expressed, while having low variability if the differential expression is always in the same direction and of similar amount. Little is known from a global perspective whether and how the overall regulation and variability of gene expression differ in response to environmental or genetic perturbations and during endogenous programmes.
In a pioneering study Hughes at al.,1 determined expression signatures in budding yeast in response to 300 mutations or chemical treatments, and this data compendium in turn allowed to predict the roles of uncharacterized perturbations and drug targets. These authors did not make a systematic comparison of any global differences in gene regulation between environmental or genetic perturbations, Luscombe et al.uncovered intriguing differences in transcriptional networks in budding yeast from two types of experiments: internal cell-cycle regulation, an endogenous programme, and response to different environmental perturbations, reflecting exogenous conditions.2 No data from genetic perturbations were used for this analysis. Recent studies indicate that genes up-regulated during stress show more variable expression, which reflects their promoter structure and could be advantageous under changing environments and in turn promote evolvability of gene regulation.3,4–8
Noisy gene expression can thus provide a driving force for phenotypic variation and evolutionary innovation and is itself subject to natural selection.3,9–16
Meta-analyses of genome-wide expression data can help to get the most from the huge amount of information available. Various clustering methods have been popular to tease out and visualize functional gene groups within microarray data sets.17–20
Simple correlation of gene expression has been applied to extract biologically meaningful information from gene expression data21 as well as to reverse-engineer genetic networks.22,23
Here we take advantage of microarray data sets from the fission yeast Schizosaccharomyces pombe that have been collected by our laboratory over the past eight years using a wide range of biological conditions. We explore how different types of perturbations (environmental vs. genetic) and endogenous programmes affect the variability and overall regulation of gene expression on a global scale. We also describe major regulatory clusters emerging from these data.
Other experiments have interrogated intrinsic gene regulation programmes during cell proliferation and meiotic differentiation.35–38 Most of these data have been sampled using time courses to study the dynamic changes of gene expression during the different biological processes. In addition, many of these gene expression programmes have also been studied after different genetic perturbations such as deletion, mutation, or overexpression of genes. Gene expression signatures of numerous genetic perturbations have also been analyzed in steady-state conditions compared to wild-type cells.14,39–55 Furthermore, in the analyses below we also include several unpublished microarray data sets. This large collection of data is unique in that all the experiments were performed in the same laboratory using standardized conditions. In total, the data set encompasses 1272 microarray hybridizations (including replicate hybridizations), which provide 188 different steady-state experiments and 153 different timecourse experiments, encompassing 778 time points (Fig. 1). The raw microarray data are normalized as described in ref. 56 and gene expression levels are relative to wild-type reference samples.
![]() | ||
Fig. 1 Microarray experiment summary showing the number of conditions in different experimental categories. |
The transcripts showing the most variable expression across conditions included a sequence orphan (SPAC23H3.15c), small heat-shock protein genes (hsp16 and hsp9), the metallothionein gene zym1,57 the uracil regulatable gene urg131 and thiamine-regulatable gene nmt1.58 Two non-coding transcripts, prl65 and prl44, also showed highly variable expression levels, although these transcripts were only present in a later version of our microarrays and were therefore measured in fewer conditions. The range of relative expression levels for these highly variable transcripts was between 3200-fold for urg1 and > 64400-fold for SPAC23H3.15c. This value was found for a few late time points of the same experiment and it corresponds to the dynamic range of the microarray scanner, suggesting that the relative values recorded for this gene in those conditions are even lower than the sensitivity of the instrument. The second most variable transcript was that of nmt1 (>53
000-fold). In comparison, the least variable transcripts only ranged in relative expression levels between 3- to 6-fold across all conditions. Among the least variable genes were ppb1, encoding the calcineurin phosphatase catalytic subunit,59ssr4, encoding a SWI/SNF and RSC complex subunit,53usp106 and usp107, encoding U1 snRNP-associated proteins, and vps45 and snx3, with likely functions in protein sorting and secretion.60
We next pulled out lists of the 500 most and 500 least variable genes, followed by examination for Gene Ontology (GO) enrichments.61 The most variable genes were mainly enriched for terms relating to response to stress or stimulus (p = 10−38), carbohydrate catabolic processes (p = 10−9) and transmembrane transport (p = 10−6). The least variable genes included terms relating to mRNA metabolic processes (p = 10−11), mitochondrial translation and organization (p = 10−6), intracellular protein transport (p = 10−5), vesicle mediated transport and protein localization (p = 10−3) (Supplementary Tables S2 and S3, ESI).‡
The top-500 genes were also analysed for enrichments with gene lists produced in different microarray experiments as well as for different properties of genes or proteins. Notably, the most variable genes were significantly closer to the telomeres (two-sided Wilcoxon rank sum test, p < 10−8) than would be expected by chance. This bias does not reflect any general correlation across the chromosomes between expression variability and distance from telomeres, but rather a specific enrichment of variable genes close to chromosome ends (data not shown). Gene clusters close to chromosome ends are induced in environmental conditions such as nitrogen starvation and may be regulated by chromatin remodelling.35,40
The telomeric regions might thus be ‘hotspots’ of variability in gene expression, which could promote cell survival under changing conditions. Moreover, the most variable genes were significantly under-enriched for introns (two-sided Wilcoxon rank sum test, p < 10−8). This observation is consistent with the finding that highly regulated genes are intron-poor, possibly reflecting selection against introns in genes whose expression levels need rapid adjustment to external or internal challenges.62 As expected, the Core Environmental Stress Response (CESR) genes were highly enriched amongst the most variable genes (p < 10−8), with approximately 35% of the 500 genes being up-regulated and 10% being down-regulated as part of the CESR. Both environmental and genetic perturbations frequently lead to activation of the CESR (e.g.ref. 25,40,63).
![]() | ||
Fig. 2 Comparison of gene regulation in response to environmental or genetic perturbations. (a) Plot of the mean regulation of genes (log2 of expression ratios) across all genetic and environmental perturbations (Spearman correlation = 0.69). Red: 121 genes that were manipulated for genetic perturbations; yellow: 11 genes that were manipulated for simultaneous environmental and genetic perturbations. (b) Plot of the variability of gene expression across conditions in response to environmental or genetic perturbations (Spearman correlation = 0.83). Coloured genes as in (a). Experiments including simultaneous environmental and genetic perturbations were excluded from the analysis in (a) and (b). (c) Comparison of standard deviation of all genes within one type of perturbation: genetic, environmental, or both. 50% of all conditions are included in the box (interquartile range), and whiskers extend to 1.5 times of it. (d) Same analysis as (c) but including only a single time point per time course experiment. |
A possible bias in assessing the variability of gene expression under genetic perturbations is represented by the limited number of genes that have been perturbed in total (121 genes). These genes have been chosen for their known or suggested roles in the biological processes of interest. Accordingly, they are enriched for GO terms related to regulatory mechanisms, chromatin modification, and transcriptional control. The mutated genes do not behave atypically in the scatter plots of Fig. 2, indicating that the total number of experiments in which these genes are not perturbed is sufficiently large so that the perturbed genes do not create any bias. Of the 121 perturbed genes, 11 genes were also used as genetic perturbations in combination with environmental perturbations. As a control, we have repeated parts of the analysis eliminating the conditions where mutants of these 11 genes are exposed to environmental perturbations and found no difference in the results (Supplementary Fig. S2, ESI).‡
Fig. 2c and 2d show gene variability within conditions belonging to the three perturbation types (genetic, non-genetic or both), using the distribution of the standard deviation of regulation over all genes for each perturbation. Whereas in Fig. 2c all the data from time courses was used in the analysis, Fig. 2d shows the results of including in the analysis only a single time point per experiment. The aim was to avoid possible biases due to the correlation between the different time points within each experiment. This analysis confirms that environmental perturbations cause a larger gene expression response than genetic ones.
The distribution of the log-ratios of transcript levels averaged over all conditions for each gene was approximately normal and centred around zero (Fig. 3a), the latter being a consequence of the normalization procedure applied to the data.56 This property allowed us to use the standard deviation of this distribution as a measure for the variability of gene expression under the different conditions. Environmental perturbations led to overall more consistent regulation, i.e. in the same direction, in gene expression across conditions than genetic perturbations (Fig. 3a), meaning that either genetic regulation triggered less regulation or it led to more random changes between conditions, resulting in an average regulation closer to zero for most genes. The differing standard deviations show that environmental conditions tended to affect many genes consistently across conditions (either up- or down-regulation), whereas genetic modifications produced less consistent regulation. Combining genetic and environmental perturbations reinforced this bias leading to a wider distribution. This effect was even more pronounced in meiotic differentiation, which led to effects similar to a combination of genetic and environmental perturbations. Cell cycle progression, however, induced little overall regulation.
![]() | ||
Fig. 3 Comparison of different sets of conditions: genetic perturbations (red), environmental perturbations (blue), both environmental and genetic perturbations (black), cell cycle (yellow) and meiotic differentiation (green). (a) Distribution of average regulation over all conditions. (b) Distribution of average of absolute value of regulation over all conditions. (c) Distribution of the standard deviation of regulation over all conditions (gene variability). |
Fig. 3b shows the distribution of the average of the absolute value of regulation, revealing that the low regulation for genetic perturbations and cell cycle was due to generally low regulation and not to alternating large positive and negative values compensating each other. Environmental perturbations produced more regulation than genetic ones, while cell cycle progression produced little regulation and was similar to the genetic conditions. Combining genetic and environmental perturbations produced slightly more regulation than environmental perturbations alone. Finally, meiotic conditions produced the most regulation.
Fig. 3c shows the distribution of the standard deviation of gene regulation across conditions, reflecting gene expression variability. The environmental perturbations produced a greater variability than the genetic ones and meiotic differentiation produced the largest variability. Combining genetic and environmental perturbations produced a variability comparable to the environmental perturbations alone. When only the cell cycle progression was taken into account, the variability in regulation was slightly less than for genetic perturbations. The high variability induced by environmental perturbations was due to both an increase of single-gene variability (Fig. 3c) and to an increase in the number of genes that were regulated (Fig. 2a). Notably, the experiments interrogating the cell cycle, which reflect endogenous regulatory programmes, showed a similar low expression variability as did the genetic perturbations, and accordingly showed much less variability than the environmental perturbations. We conclude that environmental perturbations lead to stronger gene regulation and greater variability in gene expression across conditions than genetic perturbations or endogenous programmes. Meiosis, as an internal programme induced by external conditions, behaves more like a strong environmental perturbation.
Fig. 4 shows a hierarchical clustering applied to the top-500 most variable genes, using all the conditions tested. The conditions were grouped by the main types of experiments. A GO term enrichment analysis was performed based on major gene clusters. Depending on the size of the clusters considered, different enrichments were obtained. A division between two main clusters was evident: one was enriched for stress (p < 10−5) and meiotic differentiation (p < 10−3), generally up-regulated genes, and the other for biosynthesis (p < 10−8) and metabolism (p < 10−2), generally down-regulated genes. We also notice how these two sets of genes are broadly regulated in opposite directions, reinforcing the idea that the stress response and maximal growth programmes are mutually exclusive in the cell.3 In the conditions tested, the perturbations are likely to stimulate stress response and limit cell growth, consistent with what is observed. More detailed cluster descriptions are provided in Fig. 4 and in Supplementary Table S5 (ESI).‡
![]() | ||
Fig. 4 Hierarchical clustering of top-500 most variable genes including all microarray experiments. The colour legend shows the entire range of observed regulation (log2 of expression ratios). Selected GO categories that were enriched in clusters are highlighted as follows. Cluster 1 (magenta): iron related functions, cluster 2 (dark green): cytokinesis, cluster 3 (orange): translation, cluster 4 (blue): vitamins and thiamine, cluster 5 (cyan): metabolism, cluster 6 (yellow): stress response, cluster 7 (grey): protein folding, cluster 8 (green): conjugation, and cluster 9 (red): meiosis. The experimental conditions are divided in major groups as indicated on top: starvation, treatment with drugs, cell-cycle, meiosis and stress. The unassigned conditions are from various other experiments, mainly addressing chromatin modification, transcription, and mRNA decay. See Supplementary Table S5 (ESI)‡ for more details. The clusters with no number did not present any biological significant enrichment. |
The same procedure was carried out using only the genetic perturbations, only the environmental perturbations, or combined genetic and environmental perturbations (Supplementary Fig. S3, ESI).‡ Clusters enriched for similar GO terms were evident in all three cases, partially overlapping with what was found with the combined analysis of Fig. 4. Similarly, we observed two well separated, major clusters: genes involved in stress response and meiotic differentiation on one hand, and genes involved in metabolism and transport on the other. The different perturbations provided different information on the regulatory systems. In general, a higher number of conditions helped to produce richer and more detailed regulatory modules.
A second approach for the identification of biological modules involves clustering applied to the gene correlation matrix. For each gene pair among the top-500 most variable genes, the Pearson correlation over all the 956 experimental conditions was calculated. This value was taken as a measure of the ‘regulatory relatedness’ between different gene pairs. When clustering the matrix of all the gene-to-gene correlations, we grouped together the genes that showed similar correlation to all the other genes in the list, thus identifying gene clusters that showed similar correlation profiles (Fig. 5). By definition, all elements on the diagonal of the matrix are equal to unity, as they represent the correlation of a set of values with itself. Similarly, the appearance of bright yellow squares along the diagonal indicates clusters of genes with similar profiles across the experimental conditions. The dark red areas, on the other hand, show clusters of genes that are negatively correlated with each other, that is they tend to be regulated in opposite directions. As before, we included a GO enrichment analysis (Fig. 5). Again, we see a distinction between genes related to stress (p = <10−2) and meiotic differentiation (p < 10−4) opposed to biosynthesis (p < 10−7) and metabolism (p < 10−3). More detailed cluster descriptions are provided in Fig. 5 and in Supplementary Table S6 (ESI).‡
![]() | ||
Fig. 5 Hierarchical clustering of the correlation matrix of the top-500 most variable genes including all experimental data. Selected GO categories that were enriched in clusters are highlighted as follows. Cluster 1 (grey): protein folding, cluster 2 (purple): amino acid biosynthesis and nitrogen-related terms, cluster 3 (orange): translation, primary metabolism and biosynthesis, cluster 4 (green): conjugation, cluster 5 (yellow): response to stress, cluster 6 (red): meiosis and cell cycle, cluster 7 (pink): cell differentiation and sporulation. Note that the figure is symmetric on one diagonal, and the bright yellow squares reflect regulatory modules. The colour legend shows the correlation values from inverse (dark red) to positive (bright yellow) along with the distribution of the matrix values. The clusters with no number did not present any biological significant enrichment. See Supplementary Table S6 (ESI)‡ for more details. |
We also clustered genes separately using only genetic or environmental perturbations (Supplementary Fig. S4, ESI).‡
Combining all the available conditions in the calculation of the gene-to-gene correlations produced more clusters with clear enrichments for GO terms. The clusters obtained were similar to the ones obtained for hierarchical gene clustering. Many apparent regulatory modules were not enriched for GO terms, which could reflect connections between sets of genes that are not accurately covered by the GO ontology or heterogeneous in function. It would be interesting to investigate some of these unknown regulatory modules to further tease out the biological meaning of the structure of the correlation matrix.
Whereas in experiments involving environmental perturbations the cells are monitored as they are being exposed to the threat, in the case of a genetic perturbation the newly created strains have time to adapt to the new genetic condition over a few generations. A further assessment should be undertaken as soon as cells have undergone a genetic modification, to capture the transient gene expression response as was done for the environmental perturbations. We predict that this transient response would be much stronger, similar to responses to environmental perturbations. Hence, the large difference in gene regulation observed between genetic and environmental perturbations could reflect the difference in the experimental timing (transient response vs. steady-state after multiple generations) rather than the nature of the perturbations themselves.
However, we can explain these observations in the light of previous studies. Although genetic perturbations are clearly more deeply imposed on the organism, as they are inheritable, they seem to disrupt the cellular expression programme less than external factors. The sub-division between genetic and environmental perturbations analyzed here may be related to the distinction between endogenous and exogenous conditions introduced in ref. 2. These authors observe that two types of conditions elicit the activation of different parts of the regulatory network. When dealing with endogenous conditions, the response is based on a highly combinatorial control of multiple transcription factors that regulate few targets, creating a sub-network with high in-degrees, long path lengths and high cluster coefficients. In contrast, the response to exogenous conditions involves a sub-network with high out-degrees, short path lengths and low levels of clustering. Biologically, this difference might represent a rapid large-scale response to external perturbations opposed to a carefully coordinated rearrangement for internal programmes.
If we assume a parallel between endogenous conditions and genetic perturbations on one side with exogenous conditions and environmental perturbations on the other, we can compare the topological network features and the variability and regulation of gene expression measured here. The higher levels of variability in gene expression observed for environmental perturbations and meiotic differentiation partially reflect an increase in the number of genes being regulated. A faster and more extensive gene expression response, propagating through a less tightly controlled network, could explain such patterns. Meiotic differentiation leads to gene expression patterns more similar to a strong environmental perturbation than to an endogenous programme. Just like the conditions where genetic and environmental perturbations are combined, meiotic differentiation may combine both exogenous and endogenous aspects of the regulatory network, thus leading to strong changes in the expression programmes.
We speculate that there are two ways in which cells respond to challenges and threats to their survival. If the threat comes from an internal genetic perturbation that will endure over generations, the cells compensate and prepare a permanent adjustment of the regulatory network, helping them cope with the disruption in the long term. This adaptation, which may follow stronger short-term gene expression responses similar to those triggered by environmental perturbations, is probably optimized to involve only the minimal necessary changes, as the endogenous sub-network of the cell is bound to be tightly regulated. A drastic and permanent change in the expression programme would possibly jeopardize the state of dynamic equilibrium inside the cell, leading to compromised growth or even death. In the case of external challenges, for example in the form of potentially damaging changes to the cells’ environment, the response is immediate but transient to deal with the emergency. If the stress persists, however, global gene regulation will also adjust to new steady-state levels that are closer to, but distinct from the situation in unstressed cells.25,66 This steady-state condition may then be similar to cells living with a genetic perturbation.
Environmental challenges are likely to stimulate gene expression variability between single cells within the population through noisy gene regulation, which can promote survival of some cells that “get it right”.3,10,12,14 An interesting question is how these large rearrangements in gene expression are compatible with the observed high levels of robustness and cell survival. An analysis of the possible origins of this robustness from an evolutionary point of view is presented in ref. 15, where robustness is defined as how likely a system is to undergo random changes without impairment in its function. Two types of robustness are distinguished: robustness to genetic change such as random mutations in the genome, and robustness to non-genetic changes such as noise in cellular processes and changes in the environment.67Ref. 15 argues that robustness to genetic change is not an adaptation to genetic mutation but a secondary effect of acquired robustness to non-genetic changes, which are more ubiquitous and have stronger effects in the variation of phenotypes.68 Robustness against non-genetic change also increases fitness against genetic change and, importantly, it seems to be inheritable. It is therefore possible that cellular adaptation to internal noise and changing environmental conditions is at the origin of cells’ robustness to the genetic modifications. Although it has been suggested that noise has been selected against, possibly constraining the evolvability of gene expressionfor dosage sensitive genes,69 noise in development is seen as a fundamental factor in explaining the increase of organisms’ robustness through evolution.70
We ranked genes based on their expression variability across conditions. The most variable genes were related to stress response, meiotic differentiation, and metabolism. Two clustering approaches were used to identify biological modules within the most variable genes. In both cases, the GO term enrichment analysis highlights a general sub-division of the genome into two large gene clusters that are reciprocally regulated: genes related to stress response and meiotic differentiation on the one hand, and genes related to biosynthesis, metabolism, and translation on the other. This finding reflects the bipolar transcriptome (growth vs. stress response) that needs to balance between rapid cell proliferation, but relative stress sensitivity, or maximal stress resistance, but slow growth or quiescence.3–6,63,71,72
Ideally, the clusters obtained should be analyzed in a broader framework that is less restrictive than the GO categories. Recently, an alternative classification scheme was suggested where identified gene modules were found to be biologically highly relevant, although they were not sharing GO annotation.73 Some clusters, especially among the ones obtained by clustering the gene correlation matrix, remain biologically unexplained. This could either point to new biology which is not adequately covered by the current GO annotation, or to functionally heterogeneous regulatory modules.
One shortcoming of the presented approach is the assumption that time points within time course experiments can be treated as independent conditions. A more rigorous approach would entail measuring gene correlation through the time course in different experiments. Current work is being devoted to the development of an adaptation of the algorithm presented in ref. 74, which could be used to measure gene-to-gene correlation. Preliminary results show that this technique could identify targets of the same transcription factor.
To conclude, the method presented can be easily used to mine large sets of gene expression data or, as it becomes more and more available, to RNA-seq data75 and is immediately applicable to other organisms as well as to entire ecosystems.76
For most time courses, all data is normalized to time point 0 of the isogenic reference; these time point 0 measurements were discarded as they were normalised to themselves yielding a high number of values equal to 1, which would create a bias for expression variability in the subdivision between genetic and non-genetic conditions. All the available biological replicates and dye swaps were averaged, a total of 1272 hybridizations. The logarithm of the normalized ratios was taken and no further scaling was performed. The data was then filtered to leave only experiments where more than 85% of the genes were measured and only genes that were measured in at least 80 conditions, eliminating zero time points in time courses, leaving 4939 genes and 956 conditions.
Although there are more advanced methods to treat this type of data,80 we decided for now to make this simplifying assumption, and we plan to address the issue in further work.
Footnotes |
† This article is part of a Molecular BioSystems themed issue on Computational and Systems Biology. |
‡ Electronic supplementary information (ESI) available: Supplementary Figures S1–S4 and Tables S1–S6. See DOI: 10.1039/b913876p |
This journal is © The Royal Society of Chemistry 2010 |