Angela
Re‡
a,
Davide
Corá‡
bd,
Daniela
Taverna
cd and
Michele
Caselle
*bd
aCIBIO-Centre for Integrative Biology, University of Trento, Via delle Regole 101, I-38100 Trento, Italy. E-mail: re@science.unitn.it
bDepartment of Theoretical Physics, University of Torino and INFN, Via Pietro Giuria 1, I-10125 Torino, Italy. E-mail: cora@to.infn.it; caselle@to.infn.it; Fax: +39 011-6707214; Tel: +39 011-6707205
cDepartment of Oncological Sciences, University of Torino and Molecular Biotechnology Center, Via Nizza 52, I-10126 Torino, Italy. E-mail: daniela.taverna@unito.it
dCenter for Complex Systems in Molecular Biology and Medicine, University of Torino, Via Accademia Albertina 13, I-10100 Torino, Italy
First published on 19th June 2009
In this work, we describe a computational framework for the genome-wide identification and characterization of mixed transcriptional/post-transcriptional regulatory circuits in humans. We concentrated in particular on feed-forward loops (FFL), in which a master transcription factor regulates a microRNA, and together with it, a set of joint target protein coding genes. The circuits were assembled with a two step procedure. We first constructed separately the transcriptional and post-transcriptional components of the human regulatory network by looking for conserved over-represented motifs in human and mouse promoters, and 3′-UTRs. Then, we combined the two subnetworks looking for mixed feed-forward regulatory interactions, finding a total of 638 putative (merged) FFLs. In order to investigate their biological relevance, we filtered these circuits using three selection criteria: (I) GeneOntology enrichment among the joint targets of the FFL, (II) independent computational evidence for the regulatory interactions of the FFL, extracted from external databases, and (III) relevance of the FFL in cancer. Most of the selected FFLs seem to be involved in various aspects of organism development and differentiation. We finally discuss a few of the most interesting cases in detail.
Among the various important consequences of this approach, a prominent role is played by the notion of “network motifs”. The idea is that a complex network (say a regulatory network) can be divided into simpler, distinct regulatory patterns called network motifs, typically composed of three or four interacting components that are able to perform elementary signal processing functions. Network motifs can be thought of as the smallest functional modules of the network and, by suitably combining them, the whole complexity of the original network can be recovered.
In this paper we shall be interested in “mixed” network motifs involving both transcriptional (T) and post-transcriptional (PT) regulatory interactions, and in particular we shall especially focus our attention on the mixed feed-forward loops. Feed-forward loops (FFLs) have been shown to be one of the most important classes of transcriptional network motifs.1,2 The major goal of our work is to extend them to those also including post-transcriptional regulatory interactions.
Indeed, in the last few years it has become more and more evident that post-transcriptional processes play a much more important role than previously expected in the regulation of gene expression.
Among the various mechanisms of post-transcriptional regulation, a prominent role is played by a class of small RNAs called microRNAs (miRNAs), reviewed in refs. 3 and 4. miRNAs are a family of ∼22 nt small non-coding RNAs, which negatively regulate gene expression at the post-transcriptional level in a wide range of organisms. They are involved in different biological functions, including developmental timing, pattern formation, embryogenesis, differentiation, organogenesis, growth control and cell death. They certainly play a major role in human diseases as well.5,6
Mature miRNAs are produced from longer precursors, which in some cases cluster together in so-called miRNA “transcriptional units” (TU),7 and their expression is regulated by the same molecular mechanisms that control protein-coding gene expression. Even though the precise mechanism of action of the miRNAs is not well understood, the current paradigm is that in animals, miRNAs are able to repress the translation of target genes by binding, in general, in a Watson–Crick complementary manner to 7 nucleotides (nts) long sequences present at the 3′-untranslated region (3′-UTR) of the regulated genes. The binding usually involves nts 2–8 of the miRNA, the so-called “seed”. Often, the miRNA binding sites at the 3′-UTR of the target genes are over-represented.8–14
All these findings, in addition to the large amount of work related to the discovery of transcription factor binding sites (for a recent review, see for instance ref. 15), suggest that both transcriptional and post-transcriptional regulatory interactions could be predicted in silico by searching over-represented short sequences of nts present in promoters or 3′-UTRs, and by filtering the results with suitable evolutionary or functional constraints.
Stemming from these observations, the aim of our work was to use computational tools to generate a list of feed-forward loops in which a master transcription factor (TF) regulated a miRNA, and together with it, a set of target genes (see Fig. 1a). We performed a genome wide “ab initio” search, and we found in this way a total of 638 putative (merged) FFLs. In order to investigate their biological relevance, we then filtered these circuits using three selection criteria: (I) GeneOntology enrichment among the joint targets of the FFL, (II) independent computational evidence for the regolatory interactions of the FFL, extracted from the ECRbase, miRBase, PicTar end TargetScan databases, and (III) relevance to cancer of the FFL as deduced from their intersection with the Oncomir and Cancer gene census databases.
![]() | ||
Fig. 1
Feed-forward loops. (a) Representation of a typical mixed feed-forward loop (FFL) analyzed in this work. In the square box, TF is the master transcription factor; in the diamond-shaped box miR represents the microRNA involved in the circuit, while in the round box, the Joint Target is the joint protein-coding target gene (JT). Inside each circuit, –• indicates transcriptional activation/repression, whilst ![]() |
In a few cases some (or all) of the regulatory interactions that composed the feed-forward loop were found to be already known in the literature, with their interplay in a closed regulatory circuit not noticed, thus representing an important validation of our approach. However, for several loops we predicted new regulatory interactions, which represent reliable targets for experimental validation.
Let us finally notice that in this work we only discuss the simplest non-trivial regulatory circuits (feed-forward loops). However, our raw data could be easily used to construct more complex network motifs. For this reason, we make them accessible to the interested investigators as ESI.†
![]() | ||
Fig. 2 Flow-chart of our pipeline for the identification of the mixed feed-forward regulatory loops. We built two independent but symmetrical pipelines for the construction of a transcriptional and, separately, a post-transcriptional regulatory network in humans. On the left: we defined a catalogue of core promoter regions around the transcription start sites (TSS) for protein-coding and miRNA genes in the human genome. We then applied a genome-wide sequence analysis strategy in order to identify a catalogue of human putative transcriptional regulatory motifs and the corresponding regulated genes. In so doing, the key ingredients used were statistical properties of short DNA words (oligo analysis) and conservation to mouse, implemented in an alignment-free manner (conserved over-representation). On the right: a similar strategy was used, starting from a catalogue of 3′-UTRs in humans, to obtain a catalogue of human post-transcriptional regulated genes, with a focus for miRNA-mediated interactions. We fixed 0.1 as the false discovery rate (FDR) level for both the two motifs discovery pipelines. At the end, the two regulatory networks were merged to extract the complete dataset of closed mixed feed-forward loops (FFLs), as defined in Fig. 1a, and the results were filtered according to three different procedures: by looking for (I) significant functional (Gene Ontology) annotations between the joint targets of the FFLs, (II) independent computational evidences for the regulatory interactions of the FFLs, and (III) relevance to cancer. See Materials and Methods for details. |
We then identified, separately for humans and mice, sets of genes (protein-coding plus miRNAs) sharing over-represented oligonucleotides (oligos), 6–9 nts long, in their associated promoter regions. Next, we selected the oligos for which the human and mouse sets contained a statistically significant fraction of orthologous genes. In doing so, we used a binomial model for the assessment of over-representation and an alignment-free evolutionary methodology for the identification of conserved oligos, as previously used in refs. 16 and 17. This approach was also extended to the putative promoters of miRNA genes. All the sequences were repeat-masked, and we took into account either redundancy due to superposition of the same genomic areas or protein-coding exons, or correction for CG content of the sequences themselves. As a final result, we ended up with a catalogue of cis-regulatory motifs conserved in the core promoter regions of human and mouse protein-coding or miRNA genes, each endowed with a score (the p-value of the evolutionary conservation test, described in the Materials and Methods). We then applied corrections for multiple testing and ranking, setting 0.1 as the false discovery rate (FDR).
The last step was the association of the serving motifs with known transcription factor binding sequences (TFBSs), where possible, to obtain a list of putative TF–target gene interactions. To this end we used the TRANSFAC18 database and the list of consensus motifs reported in ref. 13.
Fixing 0.1 as the FDR level, we obtained a catalogue of 2031 oligos that could be associated to known TFBSs for a total of 115 different TFs. These 2031 oligos targeted a total of 21159 genes (20
972 protein-coding and 187 miRNAs), and almost every gene in the Ensembl19 database was present at least once in our network. In parallel to that, our motif discovery procedure further identified 20
216 significant motifs but for which we were not able to make any strong association with known TFBSs consensus.
The dataset of associations between motifs and genes represents our transcriptional regulatory network and was the starting point for the circuits identification (see the ESI, supplementary files S3 and S4†). A relevant role in the following will be played by the subnetwork describing the transcriptional regulation of miRNAs. This subnetwork involves 110 TFs (out of 115 of the whole network) targeting a total of 187 miRNAs (see the ESI, supplementary file S4†).
We were able to obtain a list of 5030 different “single target circuits”, each of them defined by a single TF as master regulator, a single mature miRNA and a single protein-coding joint target. We then grouped together all the single target circuits sharing the same pair of TF and miRNA and obtained as final result, 638 “merged” circuits, each composed by a known TF acting as master regulator, a mature miRNA and a list of protein-coding joint targets (see Fig. 1a). These 638 circuits involved a total of 2625 joint target genes, 101 transcription factors and 133 miRNAs. The number of joint targets in these circuits ranged from 1–38 and 74% of the circuits targeted up to 10 genes.
The raw data relative to these circuits can be found in the ESI, supplementary file S6.†
Besides the motifs used to build the above described circuits, we have several other cis-regulatory upstream motifs in our transcriptional networks that could not be related to a known TFBS. These motifs can be considered as new, putative, regulatory sequences16 and, even if we are not able to associate a precise TF (or any other kind of regulatory mechanism) to them, we decided to extend the above construction to these sequences as well. In these cases it would be too difficult to reconstruct the variability of the binding site for the corresponding putative unknown TF, so we decided to construct only the FFL in which the exact same unidentified and fixed motif was present in the upstream region of both the target protein-coding gene and the co-regulating miRNA and, as above, closed the loop only if the target gene was also a target of the considered miRNA.
In this way, we obtained 4035 different circuits, which included various motifs with different sizes on the promoter regions: 170, 6 nts long; 128, 7 nts long; 440, 8 nts long; 3297, 9 nts long. The number of joint targets in these circuits, after merging on the same cis-regulatory motifs, ranged from 1–5 and 79% of the circuits targeted one single gene.
All the raw data concerning these fixed-motif circuits can be found in the ESI, supplementary file S7.†
As a final result of this analysis, we end with a list of 32 merged mixed feed-forward loops (corresponding to 380 single-target FFLs). These circuits involve a total of 344 joint target protein-coding genes, 24 TFs and 25 mature miRNAs. We report in Table 1 a selected list of such loops with a subset of the most representative Gene Ontology enriched annotations; the complete list of results is available in the ESI, supplementary file S8.†
FFL id | JTs | Fisher test p-value | Gene Ontology characterization |
---|---|---|---|
AP-4|hsa-miR-133b | ADORA1 AP1GBP1 | 7.42e-5 | endocytosis (P) |
AREB6|hsa-miR-126 | STRBP HERPUD1 CARD14 TRIM4 NP_995324.1 | 4.01e-6 | cellular developmental process (P) |
EGFL7 PIK3R1 WFDC12 CDKN2A KLF10 C17orf70 | 3.63e-5 | regulation of osteoclast differentiation (P) | |
RORB FBXL2 PPP3CB | 6.20e-5 | leukocyte differentiation (P) | |
AREB6|hsa-miR-375 | PCSK6 LRP5 HABP2 USP6 GUF1 CNN3 PTPN4 | 1.94e-5 | anterior/posterior pattern formation (P) |
XR_017284.1 ATPAF1 LCN1L1 NLGN3 LRFN1 AQP4 | 7.86e-5 | regionalization (P) | |
TCF2 | |||
C-REL|hsa-miR-126 | ARHGAP22 DSCR1 EGFR PIK3R2 Q96N05_HUMAN | 2.64e-6 | regulation of cell migration (P) |
TOX2 PIK3R1 PARP16 ADAMTS9 EGFL7 | 2.97e-6 | phosphoinositide 3-kinase regulator activity (F) | |
4.00e-6 | regulation of cell motility (P) | ||
4.74e-6 | regulation of locomotion(P) | ||
C-REL|hsa-miR-199a | ENO3 DDR1 SP2 CCNL1 PALLD | 9.10e-5 | transmembrane receptor protein tyrosine kinase activity(F) |
ELF-1|hsa-miR-342 | C22orf15 ADAMTS5 CCDC32 IBRDC2 C5orf24 UBE4B CCR2 RPE PHB Q6PK04_HUMAN | 2.97e-6 | protein ubiquitination during ubiquitin-dependent protein catabolic process (P) |
ER|hsa-miR-135b | GBE1 HCN2 CD99L2 TTC21A BSN RNASE11 NP_787078.1 PRLR | 4.11e-5 | cellular protein complex assembly(P) |
ANGPT2 Q49AQ9_HUMAN | |||
ZNF69 FAM129A FMOD IL11 ISCA1 PR285_HUMAN | |||
CITED1 TGM2 MUSK DEFB123 MFSD3 C17orf28 | |||
NP_057628.1 LZTS2 | |||
HMGIY|hsa-miR-152 | EDG1 Q86V52_HUMAN DMRTA2 SLC25A32 FGF1 ITGA5 MEOX2 EPAS1 ZNF33A ADAM17 MAPK6 RNF182 | 6.48e-5 | angiogenesis (P) |
ICSBP|hsa-miR-223 | ADM GAST PRL GTDC1 FOXO3A | 1.40e-6 | hormone activity (F) reproductive process (P) multicellular organism reproduction (P) |
2.18e-5 | |||
7.49e-5 | |||
IRF1|hsa-miR-126 | EGFR EGFL7 GOLPH3 BDH2 ZADH2 | 8.01e-5 | regulation of cell migration (P) |
IRF-7|hsa-miR-26a | VAX1 GALNT10 CA3 EIF2S1 NDUFA4 | 8.01e-5 | regulation of cell migration (P) |
ARP19_HUMAN FBXO42 RPIA FBXL19 ALS2CR2 | 6.25e-5 | cellular response to stress (P) | |
XR_017723.1 GSK3B DBR1 TTC13 NT5DC1 | |||
MYC|hsa-miR-17-5p | BICC1 STK33 VSX1 EDD1 SLC24A4 NFAT5 E2F1 | 9.40e-0 | cellular metabolic process (P) primary metabolic process (P) |
C21orf25 C9orf117 MYNN MAPK1 | 9.56e-5 | ||
MYOD|hsa-miR-140 | ANK2 TSSK2 EIF2AK1 HMX2 THY1 ALAS2 UROC1 | 7.20e-6 | hemoglobin metabolic process (P) organ development (P) |
CDKL4 PPARA CYBB PPL CDS2 ZIC3 | 6.61e-5 | ||
SRY|hsa-miR-26a | FANCA GSK3B RPIA Q6ZQV3_HUMAN ALS2CR2 | 2.68e-5 | protein export from nucleus (P) |
KIF1C RG9MTD2 CDS1 BAG4 PPP2R3C | 5.64e-5 | anti-apoptosis (P) |
• The Evolutionary Conserved Regions database (ECRbase24) is a collection of evolutionary conserved regions, promoters and TFBSs in vertebrate genomes, based on genome-wide alignments created mainly with the Blastz program. Even if both our pipeline and ECRbase are based on evolutionary conservation, this ingredient is implemented in a very different way in the two approaches. ECRbase looks for conserved blocks identified via whole-genome alignments, while we implemented evolutionary conservation using an alignment-free approach. In this way, we were able to validate 216 TF–target gene links and 98 TF–miRNA links.
•Ref. 25 is a computational study of miRNA biogenesis. The regulatory interactions reported in ref. 25 are of particular interest for our assessment procedure since their pipeline is very different from ours. With this tool, we were able to validate 64 TF–miRNA links. It is interesting to notice that these 64 miRNAs were controlled by only nine transcription factors (the important role of these “hub” TFs was already noticed in ref. 25)
• The miRBase,26 PicTar11 and TargetScan9 databases are by now an accepted standard in the miRNA literature. They are based on strategies that are definitely different from our pipeline and are somehow complementary in their approaches. In this way, we were able to validate the miRNA–target gene link for 607 circuits (503 by miRBase, 343 by PicTar and 560 by TargetScan).
The results of these comparisons are summarized in Table 2a, while Table 3 reports the top ten TFs ranked by out-degree and the top ten miRNAs scored by in-degree.
(a) | General view: | |
---|---|---|
Number of annotated links | Number of circuits | |
3 | 75 | |
2 | 207 | |
1 | 334 |
Detailed view: | ||
---|---|---|
Link type | Number of circuits | |
TF -> miR: | 150 | |
ECRbase: 98 | ||
PMID 17447837: 64 | ||
TF -> JT: | 216 | |
ECRbase: 216 | ||
miR -> JT: | 607 | |
miRBase: 503 | ||
PicTar: 343 | ||
TargetScan: 560 | ||
(b) | FFL id | JTs |
---|---|---|
AML1|has-miR-223 | RHOB DNAJB13 NDUFA3 TBC1D17 NP_001007596.1 IGSF21 SPTLC2 WNT2B RIPK3 ELF5 SLC2A11 C13orf31 FOXO3A | |
LEF1|hsa-miR-138 | MYO3A NP_775790.1 RNMTL1 ZNF704 GPR124 NOTUM KRT83 FGF6 ITK | |
MAZ|hsa-miR-34a | CA9 AKTIP SLC6A3 | |
MEF-2|has-miR-133a | BRUNOL4 PLCL2 | |
SMAD-3|hsa-miR-200b | MAP3K3 MAGED4 MAGED4B BAZ2A EBAG9 ZNF323 SCN5A WBP1 | |
SOX5|hsa-miR-302d,c,c*,b,b* | TSPAN6 E2F2 WWC3 SIDT1 NFX1 C20orf7 GSPT1 ACO1 CHD6 GLT25D1 C19orf40 CLEC10A TNS3 PI15 ZNF291 NP_060887.1 ATP6V0D2 HTR3B LATS2 MAT1A FAM128B CDCP2 GNPDA2 SRGAP2 MON1A C10orf28 HNRPUL2 PBK NP_001034885.1 ZFP42 C9orf31 LRRIQ2 FAM22A TMCO2 HLA-DOA C4A C4B C6orf15 | |
YY1|hsa-miR-101 | Q9HCM6_HUMAN NACA3P PRKD3 PFDN6 RAB15 ARID1A LRRC4 RAB5A FGD6 ARHGAP1 C17orf39 RBM25 NP_060164.3 STC1 FAM114A1 RNF213 Q96NB8_HUMAN |
TF | Out-degree | miRNA | In-degree |
---|---|---|---|
MEIS1 | 31 | hsa-mir-148b | 15 |
ER | 30 | hsa-mir-203 | 14 |
SRY | 29 | hsa-mir-181d | 13 |
HNF-1 | 27 | hsa-mir-99a | 12 |
SOX-5 | 27 | hsa-mir-125b-2 | 12 |
LEF1 | 23 | hsa-mir-423 | 11 |
AREB6 | 22 | hsa-mir-129-2 | 11 |
NCX | 18 | hsa-mir-149 | 11 |
SRF | 18 | hsa-mir-214 | 11 |
C-REL | 17 | hsa-mir-296 | 11 |
In Table 2b we report a selection of a few circuits which turned out to be validated by the above tests. In the ESI, supplementary file S8,† one can find the complete list of results.
We filtered our results looking for circuits containing at least one cancer-related miRNA or target gene. To identify cancer related genes, we used the list of oncomiRs reported in ref. 27 and 28, while for the protein-coding target genes we compiled a list of genes showing mutations in cancer based on the Cancer Gene Census catalogue.
In particular we found 24 circuits in which at least two cancer-related genes (e.g. an oncomiR and a target or a TF and an oncomiR) were present (see Table 4). The full list of cancer-related circuits is available in the ESI, supplementary files S9 and S10.†
FFL id | TF | miRNA | JTs |
---|---|---|---|
AP-1|hsa-miR-142-3p | hsa-miR-142-3p | DDIT3 | |
ATF-1|hsa-miR-199a* | hsa-miR-199a* | MTCP1 | |
ATF6|hsa-miR-199a* | hsa-miR-199a* | MTCP1 | |
ER|hsa-miR-375 | TPR, USP6 | ||
HIF-1|hsa-miR-199a* | hsa-miR-199a* | MTCP1 | |
HNF-3|hsa-let-7a | hsa-let-7a | CCND2 | |
HNF-3|hsa-let-7f | hsa-let-7f | CCND2 | |
HNF-3|hsa-miR-30a-5p | MYH11, BCL9 | ||
HNF-3|hsa-miR-30c | MYH11, BCL9 | ||
HSF2|hsa-let-7a | hsa-let-7a | MYCN | |
HSF2|hsa-let-7f | hsa-let-7f | MYCN | |
HSF2|hsa-miR-199a* | hsa-miR-199a* | MYCN | |
IRF|hsa-miR-125b | hsa-miR-125b | BCL2 | |
IY|hsa-miR-296 | RPL22, BCL2 | ||
MYC|hsa-miR-17-5p | MYC | hsa-miR-17-5p | |
MYC|hsa-miR-19a | MYC | hsa-miR-19a | |
MYC|hsa-miR-20a | MYC | hsa-miR-20a | |
NF-Y|hsa-miR-223 | APC, ATF1 | ||
OCTAMER|hsa-miR-125b | hsa-miR-125b | IRF4 | |
PAX-4|hsa-miR-125b | hsa-miR-125b | IRF4 | |
SOX-5|hsa-miR-125b | hsa-miR-125b | SS18 | |
SOX-5|hsa-miR-29a | EXT1,COL1A1 | ||
SRY|hsa-miR-221 | hsa-miR-221 | CCND2 | |
SRY|hsa-miR-412 | BRAF, ATIC |
CAGACAATG|hsa-miR-125b | hsa-miR-125b | IRF4 | |
GGACTGCAA|hsa-miR-200c | hsa-miR-200c | MTCP1 | |
GCCAACTGA|hsa-miR-199a* | hsa-miR-199a* | MTCP1 | |
GCCCCCC|hsa-miR-200a | hsa-miR-200a | TFRC | |
ACTTCACCC|hsa-miR-125b | hsa-miR-125b | BRD4 | |
CGGGAAAAG|hsa-miR-125b | hsa-miR-125b | BRD4 | |
GGCAATTTA|hsa-miR-19a | hsa-miR-19a | CCND1 | |
AGAACTAAT|hsa-miR-19a | hsa-miR-19a | CCND1 | |
CAGGTTGCA|hsa-miR-200c | hsa-miR-200c | MTCP1 | |
AATTAGTTC|hsa-miR-19a | hsa-miR-19a | CCND1 | |
ATCATTTTA|hsa-miR-125b | hsa-miR-125b | IRF4 | |
AACCAGACA|hsa-let-7e | hsa-let-7e | SDHC | |
GGATCTTAA|hsa-let-7a | hsa-let-7a | CCND2 |
![]() | ||
Fig. 3
Graphical representation of Type I and Type II circuits.
TF is the master transcription factor, miR represents the microRNA involved in the circuit and Joint Target is the joint target gene. Inside each circuit, → indicates transcription activation, whilst ![]() |
![]() | ||
Fig. 4
Graphical representation of the c-Myc|E2F1|hsa-miR-20a circuit, with its extension to E2F2. The c-Myc|E2F1|hsa-miR-20a is the only feed-forward circuit already validated experimentally, as stated in the literature. Its components are embedded in a more sophisticated network, in particular, when mining our database we recognized the interplay with E2F2. E2F2 is down-regulated by hsa-miR-20a at the post-transcriptional level, and it is a direct transcriptional target of E2F1 itself. –• indicates transcriptional activation/repression, whilst ![]() |
In order to quantify the over-representation we performed a set of randomization tests. The results are reported in the ESI, Fig. S1 and details are available in the supporting text.† Briefly, we carried out three types of randomizations:
•Random reshuffling of miRNA promoters and seeds. We rebuilt the entire database of mixed feed-forward circuits (i.e. the entire pipeline designed in Fig. 2), but using randomly shuffled versions of the miRNA promoters and random sets of 7-mers as miRNA seeds. The principle of this procedure was to perform the same analysis of correlation between transcriptional and post-transcriptional regulatory networks, but considering the connection between the two regulatory layers a randomized version of the real known miRNAs, in terms of their in-degree (the miRNA promoter) and out-degree (the miRNA seed).
•Edge switching. We applied a randomization strategy on the real transcriptional and post-transcriptional regulatory network obtained with our pipeline, similar to the one used in ref. 32. The edge switching strategy is able to randomize the real network, preserving the individual degree of each node in the network.
•Complete node replacement. We applied a second, more drastic, randomization strategy on the real transcriptional and post-transcriptional regulatory network obtained with our pipeline, in this case with no constraint on the randomization procedure.32
The results reported in the ESI, Fig. S1 (panel A)† show that for the three randomization strategies, the number of circuits recognized in the real regulatory network is statistically higher than the one found in the random versions (random reshuffling of miRNA promoters and seeds: Z = 3.5; edge switching: Z = 8.3; complete node replacement: Z = 8.9). However, it is important to notice that the actual number of mixed feed-forward loops identified in the randomized versions of the regulatory network is always rather large. Thus, even if the over-representation is statistically significant, it would be very inefficient (i.e. it would lead to a large number of false positive identifications) to use it as the only tool to identify functionally relevant mixed FFLs. Interestingly, our results are in good agreement with a similar analysis reported in ref. 32. This is particularly significant since our approach and that of ref. 32 for the identification of TF and miRNA regulatory interactions are totally different. In ref. 32, the authors presented the first genome-scale Caenorhabditis elegansmiRNA regulatory network that contains experimentally mapped transcriptional TF → miRNA interactions, as well as computationally predicted post-transcriptional miRNA → TF connections. They then looked at the properties of mixed feedback loops, comparing their findings with network randomizations: the average number of loops in randomized networks was always about half the number of real loops they identified.
We observe over-representation of GO terms describing several aspects of organism development such as differentiation, proliferation, apoptosis, programmed cell death and cellular migration. These results are in good agreement with the predictions about the biological meanings of the FFLs reported in ref. 20. Specifically, our data provide evidence for functions of several circuits in the cardiac and skeletal, neural and hematopoietic cell lineages.
A similar pattern emerges if we look at the single-gene enrichment analysis. Multi-cellular organisms development, cell differentiation, cell proliferation and apoptosis directly annotate, respectively, 108, 56 and 48 target genes included in the annotated circuits.
Finally it is interesting to notice that several circuits seem to be involved, according to the GO analysis, in basal mechanisms of post-translational regulation such as protein amino acid phosphorylation and in the ubiquitin cycle (with as much as 57 annotated genes).
All these observations agree with the idea that the mixed (T-PT) motifs and in particular the feed-forward loops that we discuss in this paper play a fundamental role in all those processes (like tissue development and cell differentiation), which are characterized by a high degree of complexity and require the simultaneous fine tuning of several different players. Strikingly, it is worth noting that this result was obtained here with a completely ab initio bioinformatics sequence analysis strategy.
The choice of the c-Myc TF is not random. Besides being a very interesting TF, it is present in several of our FFLs and as such, it plays a central role in the transcriptional side of our regulatory networks. In particular, the first example that we shall discuss below contains c-Myc as master TF.
Looking at the intersection between the 2088 targets of ref. 33, and the 1979 predicted by our analysis, we found 253 targets in common, corresponding to a p-value of 1.1 × 10−6 (Fisher test). This result is even more impressive if compared with the number of intersections of the Zeller dataset with the list of c-Myc targets reported in the TRANSFAC database.18 Out of 235 TRANSFAC targets, only 27 were present in Zeller’s dataset, corresponding to a p value of 0.21.
As a further test, we performed the same comparison for the transcriptional network obtained choosing as promoter the (−500/+100) region around the TSS as promoter. In this case we found 1612 putative c-Myc targets, in which 203 were in common with the dataset of ref. 33, corresponding to a slightly higher p-value p = 8.4 × 10−5.
In order to complete this analysis we also performed the randomization tests and the comparison with the c-Myc database discussed above for the (−500/+100) FFLs. We found comparable results with those obtained in the (−900/+100) case: the number of circuits of the real regulatory network turned out to be statistically higher than the ones found in the random simulations. In particular, for the first two tests, we found an improvement of the Z values, while for the third one, we found slightly worse values of Z. All these results are reported in the second panel of ESI, Fig. S1.† Also for the c-Myc analysis, we found results comparable with those obtained in the (−900/+100) case, with a slight worsening of the p-value of the intersection. More precisely the c-Myc targets in the (−500/+100) transcriptional network turned out to be 1612, of which 203 were in common with the Zeller c-Myc dataset, corresponding to a p-value of 8 × 10−5.
We consider all these findings as an indication of the overall robustness of our results.
In ref. 21, the authors studied various types of feed-forward and feedback loops involving miRNAs, their target genes and transcriptional regulators as a tool to explain the (anti-)correlations between the expression levels of miRNAs and of their target genes. This study relied on a predicted miRNA-mediated network and did not use the transcriptional regulatory network of miRNAs that was unavailable at that time. Hence, to the best of our knowledge, no actual explicit loops were identified (see also ref. 32).
In ref. 22, the authors used pre-compiled TF- and miRNA-mediated networks, and studied global and local properties of the two networks separately. Additionally, they provided a catalogue of network designs in the co-regulated network, including feed-forward loops. Both the TF- and the miRNA-mediated networks in ref. 22 were obtained from sequence-based identification of regulatory features in promoters and 3′-UTRs. This makes the study in ref. 22 more comparable to ours than that in ref. 21. For this reason, we decided to perform a more detailed comparison with our results. Unfortunately, this study did not report explicitly the circuits (including joint target genes) but only provided a list of 16 pairs of co-regulating TFs and miRNAs involved in feed-forward loop. We obtained these pairs using as input the PSSMs (position specific scoring matrices) and microRNAs listed in the supplementary Table S2 of ref. 22 and then mapping the PSSMs to the corresponding transcription factors. We compared this list with our results. It turns out that none of these predictions are contained in our dataset. A detailed comparison of the two pipelines shows that there are a few important reasons behind this disagreement:
• Different annotation for mature miRNA identifiers due to the older miRBase release used in ref. 22 (8.2 vs. 9.2): e.g. pairs involve miR-10 in ref. 22, while miRBase 9.2 reports miR-10a and miR-10b; similarly for miR-142 and miR-142-5p,-3p.
• Different assignment of mature miRNAs to pre-miRNAs: e.g. in ref. 22 the authors assign miR-7 to mir-7-1, while miRBase 9.2 assigns miR-7 to mir-7-3.
• Different organization of pre-miRNAs in transcriptional units: in ref. 22, miRNAs are clustered in precursors according to physical proximity, while we relied on human/mouse conserved transcriptional units reported in ref. 7.
• Different definition of miRNA promoters: ref. 22 uses 10 kb upstream of the 5′-most pre-miRNA for each cluster, while we used 1 kb upstream of the 5′-most pre-miRNA for each transcriptional unit.
• Different solutions for predicted transcription factor binding sites: ref. 22 uses PSSMs from TRANSFAC release 8.3, and using pre-compiled lists of interactions available in the UCSC hg17 genome assembly, while we mainly mapped ab initio conserved and over-represented motifs to transcription factor binding sites.
• Different solutions for predicted mature miRNA binding sites: ref. 22 uses TargetScan (release 3.0) and PicTar (picTarMiRNA4Way track in the UCSC genome browser) while we mapped conserved and over-represented motifs in 3′-UTRs to mature miRNAs by means of miRBase release 9.2.
As a final comment on this comparison, let us stress that probably one of the major novelties of the present analysis with respect to existing works is the particular attention we paid to the definition of miRNA promoters and in the search of their putative binding sequences. Accordingly, besides the final list of FFLs, we consider as one of our most interesting results the subset of our transcriptional regulatory network involving miRNAs as targets. This subnetwork includes a total of 110 TFs targeting 187 miRNAs and is reported in the ESI, supplementary file S4.†
We first present a case in which our pipeline is able to predict circuits already known in the literature and for which all the links are experimentally validated: this is the case of the circuits involving c-Myc as master TF, and hsa-miR-17-5p and hsa-miR-20a as post-transcriptional regulators. In particular, one of the predicted joint target genes results in being the E2F1 gene, in this way closing the circuit exactly on the target gene experimentally assessed and used as a major example in the discussion of ref. 20.
In the remaining examples some (or all) of the genes embedded in the circuits were already annotated to related functions in the literature but their combination in a closed FFL was not noticed. We consider these cases as further successful validations of our approach.
•The c-Myc, hsa-miR-20a/miR-17-5p circuit
In this circuit, c-Myc is the master TF and hsa-miR-20a the post-transcriptional regulator. This circuit contains eleven joint targets, among which is E2F1. The complete list of joint targets is reported in Table 1. The FFL involving E2F1 is well known in the literature. It was discussed for the first time in ref. 29 and is expected to play a role in the control of cell proliferation, growth and apoptosis. With our analysis, we could identify several other genes sharing the same regulatory pattern of E2F1 and we expect that at least some of them could be involved in the same biological processes. In this respect, it is interesting to find among the other targets NFAT5, which is known to play a critical role in heart, vasculature, muscle and nervous tissue development. Similarly, it seems interesting to find MAPK1, which, like E2F1, is an anti-apoptotic gene. These observations could suggest a similar functional role also for the remaining joint targets.
This circuit also allows us to discuss how our data could be used to obtain more complex regulatory motifs. Combining different entries of our databases, it is easy to find a circuit involving, besides c-Myc, hsa-miR-20a and E2F1, also E2F2, which turns out to be simultaneously targeted by E2F1 and by hsa-miR-20a (see Fig. 3). This is a rather non-trivial result, since it is well known that different TFs of the E2F family tend to act together in a concerted way. We see in this example a simple network motif in which this cooperative action is present and is tightly regulated.
•The AREB6, hsa-miR-375 circuit
One of the most interesting entries of Table 1 is the feed-forward loop that involves the transcriptional repressor zinc-finger E-box binding homeobox 1 AREB6 (also known as ZEB1), hsa-miR-375 and a set of 14 joint target genes. Owing to the following observations, we surmise its function in embryonic development and the physiology of the pancreas. ZEB1 is a crucial inducer of the embryonic program ‘epithelial-mesenchymal transition’ (EMT) that facilitates tissue remodelling during embryonic development. miR-375 is essential for embryonic pancreatic islet development, as well as for endocrine pancreas function, where it was demonstrated to regulate the process of exocytosis of insulin during glucose-stimulated insulin release.34 Notably GO analysis globally annotates the set of target genes to patterning in embryonic development, which is consistent with the regulatory roles of ZEB-1 and miR-375. Moreover, the hypothesis of a function in insulin secretion is strengthened by the following observations:35 reports of strong evidence that EMT can provide cells for replacement therapy in diabetes; among the target genes, HNF1β (also known as TCF2) is responsible for MODY,36 a form of diabetes characterized by defective insulin secretion of pancreatic β-cells.
•The MEF-2, hsa-miR-133a circuit
This is one of the entries of Table 2. It contains only two joint targets: BRUNOL4 and PLP2, but the presence of BRUNOL4 turns out to be highly non-trivial. In fact, the myocyte enhancing factor-2 (MEF-2), hsa-miR-133a and the RNA-binding protein BRUNOL4 have been shown to altogether control cardiomyocyte hypertrophy. In this case, it is also possible to envisage a feedback effect, because cardiac repression of BRUONOL4 activity disrupts alternative splicing of MEF-2 and leads to cardiac hypertrophy.37 Finally, it is important to stress that the regulatory interaction between MEF-2 and hsa-mir-133a, which we predicted with our in silico analysis, was indeed observed experimentally in ref. 38.
•The C-REL, hsa-mir-199a circuit
Another interesting circuit relates C-REL, a member of the NFKB family, and miR-199a. MiR-199a has been identified as a miRNA signature in human ovarian cancer. miR-199a down-modulation in epithelial ovarian cells is reported in ref. 39 and, interestingly, miR-199a has lately been shown to affect NFKB activity in ovarian cancer cells.40 Among the joint targets for this circuit, let us mention: DDR1, a receptortyrosine kinase, whose expression is restricted to epithelial cells and significantly high in epithelial ovarian cells, and Sp2, which is a transcriptional repressor of the tumor suppressor gene CEACAM1 in epithelial cells.41
•The HSF2, hsa-let-7f circuit
Looking at the cancer-related list, one of the most interesting entries is the one which relates the transcription factor, HSF2, and the hsa-let-7f miRNA. The DNA-binding protein heat shock factor-2 (HSF2) and hsa-let-7f jointly regulate a number of target genes such as MYCN, ESPL1, PLSCR3, PDCD4, MTO1 and FMO2. Several observations point to an involvement of this circuit in cell cycle progression with relevant implications in cancer. HSF2’s role in cancer is being elucidated42 by the observation that it functions as a bookmarking factor, not only for heat shock responsive genes, but also for genes that are involved in the regulation of cell apoptosis and proliferation (such as Hsp90, Hsp27 and c-Fos). Among the target genes, the MYCN oncogene is crucial in neuronal development, and its amplification is currently the only molecular marker adopted in neuroblastoma clinical treatments. The MYC family oncogenes are known to deregulate cell cycle progression, apoptosis and genomic instability. In neuroblastoma cell lines, N-Myc can induce genomic instability by centrosome amplification. Interestingly, HSF2 and hsa-let-7f regulate the extra spindle poles like-1 (ESPL1) that mediates mitotic sister chromatid segregation. The programmed cell death-4 (PDCD4) is also linked to progression through the cell cycle by mediating MAPK kinase activity and JNK activity. The phospholipid scramblase-3 (PLSCR3) is a mitochondrial integrator of apoptotic signals. Interestingly, also the mitochondrial translation optimization-1 homolog (MTO1) and the flavin containing monooxygenase-2 (FMO2) promote local effects on mitochondria. Finally, MYCN has recently been reported as a direct target of miR-34a. Here we add that let-7f targets MYCN. Notably let-7f belongs to the let-7 family of oncomiRs and, in particular, let-7f has been found to be involved in cell aging.43
As a final remark, we would like to stress that interesting convergence of cooperative biological functions can also be observed in circuits in which we were not able to identify a putative master TF, and therefore were not processed with our assessment pipeline. As an example let us mention the UST gene (Ensembl id: ENSG00000111962), which is involved in heparan sulfate-dependent growth factor signaling during myogenesis and in ion buffering; UST linked to hsa-miR-1 (see the ESI, supplementary file S7†), which in turn promotes skeletal muscle proliferation and differentiation, and is involved in heart electrical conductions as well.44
As a concluding remark it is important to stress that we consider the present work only as a first step along this research line. For both technical and biological reasons, it is likely that we missed several regulatory circuits in our network. We discussed in detail the technical issues and the related problems. Let us comment here on one of the main biological issues, which should certainly be addressed in future works. One of our main assumptions is that we can associate a well defined promoter to a well defined gene. However several recent studies on the widespread presence of alternative splicing and transcription start sites (TSS) (see for instance ref. 45) show that this is probably a restrictive choice. Moreover, alternatively spliced isoforms of the same gene may have completely different functions and play different roles in the regulatory network. More generally the notion of “gene” by itself is experiencing a deep redefinition in the last few years.46 Notwithstanding this, the good agreement that we found with some existing experimental data suggests that our approach may represent a reliable step toward a better understanding of gene regulatory networks, and in particular, it could give some useful insight on the complex interplay of their transcriptional and post-transcriptional layers.
An important role in our analysis is played by the notion of “transcriptional units” (TU), which are clusters of miRNA hairpins located in nearby positions along the DNA, and supposed to be transcribed together in a single poly-miRNA precursor.7 Both cDNA and EST expression data7,48 support the idea that miRNAs belonging to the same TU are co-transcribed. For this reason, we shall treat them as a unique (miRNA) gene and associate the same promoter (the one corresponding to the transcriptional start site (TSS) of the transcriptional unit) to all the miRNAs belonging to the TU.
Taking together isolated miRNAs and TUs, we were able to identify a total of 130 miRNA precursors for the human genome and the corresponding 130 orthologues for the mouse genome. 68 out of 130 were non-genic and 62 were located within a KNOWN gene. A direct inspection showed that 53 of these genic pre-miRNAs shared the same orientation with the host gene, while the remaining nine had the opposite orientation. These 130 precursors corresponded to a total of 193 mature miRNAs. These mature miRNAs and their “seeds” represented the list of input motifs for the target search algorithms and the bases of our discussions.
The list of TUs, their most 5′-upstream members, their genomic coordinates, their locations relative to protein-coding genes and additional orthology annotations can be found in the ESI, supplementary file S1,† for humans and mice. We then provide the corresponding mature miRNAs used in this study in supplementary file S2,† for humans and mice.
• If the pre-miRNA was non-genic, we selected the region ranging from nt − 900 upstream to nt + 100 downstream of the 5′-start of the pre-miRNA.
• If the pre-miRNA was genic, with the same orientation of the host gene, we used the promoter region selected for the host gene.
• If the pre-miRNA was genic, but with opposite orientation with respect to the host gene, we again selected the region ranging from nt − 900 upstream to nt + 100 downstream of the 5′-start of the pre-miRNA.
In all these cases, we then repeat-masked and exon-masked the sequences as we did for the protein-coding genes discussed above. Repeat-masking was performed with the default values provided by Ensembl.
Merging together protein-coding and miRNA promoters, we ended up with a collection of 21446 human and 21
944 mouse regulatory sequences.
It is worth noticing that, differently from the promoter case, the 3′-UTR sequences have different sizes. The average length of human or mouse 3′-UTR regions was ∼1157 nts or ∼982 nts, respectively.
We mapped the predicted TFBSs stored in those databases onto our promoter regions according to genomic coordinates, for protein-coding and miRNA genes. To avoid mismatches due to different masking and/or misannotations, we assigned the binding of ECRbase TF to our gene only if the complete sequence contained in the ECRbase was present in our promoter sequence.
Footnotes |
† Electronic supplementary information (ESI) available: Description of oligo analysis and randomizations for network motifs analysis, randomization results for the network motifs analysis of mixed feed-forward loops, and supplementary files S1–S11. See DOI: 10.1039/b900177h |
‡ Equal contribution |
This journal is © The Royal Society of Chemistry 2009 |