Karin S.
Kassahn†
,
Nic
Waddell†
and
Sean M.
Grimmond
*
Queensland Centre for Medical Genomics, Institute for Molecular Bioscience, Brisbane, Australia. E-mail: s.grimmond@uq.edu.au
First published on 4th February 2011
The development of next-generation sequencing technologies has enabled the transcriptome to be measured and characterized at a level which was previously unattainable. Shot gun sequencing of RNAs, or RNA-Seq as it is known, is providing the means to simultaneously survey locus activity, transcript-specific expression, sequence content of transcripts and transcriptome discovery. This article discusses the current state of RNA-Seq, its potential for redefining transcriptomics and some of the challenges associated with this revolutionary technology.
![]() Karin S. Kassahn | Karin Kassahn's research interests span a broad range of questions in genetics and evolutionary biology. She has pioneered the use of microarray technology to study coral reef ecosystems and has recently joined the Australian cancer genome sequencing initiative as part of the International Cancer Genome Consortium. Her current work focuses on mutation detection in DNA and RNA next-generation sequencing data and the integration of high-throughput sequencing datasets. |
![]() Nic Waddell | Nic Waddell is a cancer researcher at the Queensland Centre for Medical Genomics. Her research has focused on the identification of genomic aberrations in cancer cells and the classification of tumours based on gene expression and genomic sequence. Her current research is focused on the analysis of next generation sequencing data to identify biomarkers and alternative treatment strategies in cancer. |
![]() Sean M. Grimmond | Sean Grimmond is Director of the Queensland Centre for Medical Genomics, at the Institute for Molecular Bioscience, The University of Queensland. His research focuses on defining the underlying genetic networks controlling tumorigenesis and mammalian development through whole genome, transcriptome and epigenome sequencing. |
Insight, innovation, integrationThe transcriptome describes the actively expressed genes in a given sample. A survey of the transcriptome is thus crucial for understanding the functional output of the genome in biological systems. Next-generation sequencing of the transcriptome, or RNA-Seq, enables the detection of transcription at an unprecedented scale allowing single nucleotide resolution. It enables identification of novel transcripts and splice variants and can be used to detect allele specific expression, the expression of simple mutations and fusion genes. RNA-Seq approaches are being applied in the study of human diseases including cancer and complement many recent large-scale genome sequencing initiatives. This review describes the current state-of-the-art in transcriptome sequencing and discusses opportunities and challenges when applying this technology. |
![]() | ||
Fig. 1 A diagram showing a USCS genome browser view with gene model, splicing events and wiggle plot. RNA-Seq can be used to quantify gene expression and to identify known and novel splicing events. For example, PSMA4, a gene encoding a subunit of the proteasome complex, is represented by three isoforms in the current RefSeq gene models and by one in the Ensembl gene models. All junctions describing the Ensembl isoform were identified by individual reads mapping across these known junctions (track: observed junctions). If paired-end data is available, the mapping positions of the forward and reverse reads in conjunction with the expected library insert size can be used to predict junctions (track: predicted junctions). The same strategy can also predict novel exon combinations (track: novel junctions). In this example, many of the novel exon combinations are supported by AceView gene models; e.g. the AceView gene models predict an isoform PSMA4.eApr07 which skips exon 5 of the Ensembl transcript ENST00000044462 and the RNA-Seq paired-end data provides evidence that this isoform is expressed. The wiggle plot (positive strand hits) shows the read density across the gene and, in combination with the junction information discussed above, informs about exon usage. In this particular example, expression was largely restricted to known exons, but in other instances RNA-Seq has become a powerful tool to identify novel transcription. For simplicity, only single reads (or read pairs) are shown to represent each junction. |
More recently still, our understanding of transcriptional complexity has been further expanded by the discovery of many transcripts which lack an open reading frame.5,6 The function of non-coding RNAs and their role in cis- and trans-regulation of gene activity is now well established; with examples of sense-antisense expression, the expression of long-noncoding RNAs, and the generation of microRNAs (or miRNAs). The latter are now widely reported in the literature and have been shown to interfere with translation of specific targets and potentially play roles in defining locus boundaries and modifying chromatin status.
Given these recent advances in our understanding of transcriptional complexity in mammals, it is commonly acknowledged that the reference transcriptomes in public databases such as Ensembl or UCSC represent only a fraction of the actual transcript diversity of an organism or even that of a given cell. Based on the amount of transcription from unannotated regions that is commonly observed in RNA-Seq experiments,7–9 the full transcript repertoire may be 25% greater than what is represented by current gene models. In this context, RNA-Seq offers exciting opportunities to quantify gene expression as well as to discover novel, previously undescribed splice variants. As such, RNA sequencing of mRNA was first used to map transcribed sequences in the yeast genome10 and since has been applied to a variety of systems.
With the advent of massively parallel next-generation sequencing, it is now possible to assay transcription at a level not previously practicable. For example, RNA quantification based on RNA-Seq is thought to have a greater dynamic range compared to array-based approaches because read counts do not suffer from the same saturation and sensitivity limitations as array fluorescence signals.11,12 Furthermore, compared to array-based approaches, RNA-Seq has the advantage that novel mRNAs, alternative start and polyadenylation sites and splicing events such as exon skipping, alternative 5′ and 3′ splice sites and novel exon usage can be detected.8,10,13 In this review we will describe the options available for RNA-Seq, show how RNA-Seq is being utilized and discuss the advantages and challenges of the technology.
![]() | ||
Fig. 2 A schematic overview of the steps involved in RNA-Seq. RNA is extracted and incorporated into a library prior to purification and sequencing. The sequence reads undergo quality control and are then mapped to a reference genome and to a library of exon junctions. The expression level of transcripts is determined using tag counts and analyses are performed to determine known and novel splicing events. Paired-end sequencing can be used to validate and refine novel expression events detected with RNA fragment libraries. |
The volume of data which is generated from a single RNA-Seq run is large and significant computing infrastructure is necessary to store, map and analyze these data, although future advances in compute technology are likely to meet the demands of RNA-Seq experiments very soon. Nevertheless, at the present time, many laboratories do not have access to the computing infrastructure and bioinformatic resources necessary to perform next-generation sequencing. As an alternative to local compute clusters, analysis pipelines that use cloud computing are being developed. Crossbow (http://bowtie-bio.sourceforge.net/crossbow/index.shtml) is a pipeline that runs the read aligner Bowtie14 and the SNP calling software, SoapSNP15 on the Amazon EC2 compute cluster. As next-generation sequencing technologies become more readily available and more widely utilized, we expect that there will be an increase in the use and development of such virtual compute environments. Data security for computing in cloud environments will become of paramount importance, especially for studies handling sensitive patient data.
Several factors have been shown to affect quantification, such as the depth of sequencing and the length of the transcript to be quantified. The depth of sequencing will determine the ability to detect and quantify rare transcripts,7 while the length of the transcript affects the probability with which tags are detected and hence the statistical power to detect differential expression.20 Specifically, longer transcripts are over-represented among differentially expressed transcripts compared to shorter transcripts.20 Given that some gene classes tend to be composed of genes of longer length, this transcript length bias may affect downstream interpretation of the types of pathways dysregulated in comparisons of different experimental treatments. Development of more-sophisticated statistical analysis approaches for RNA-Seq data will thus be of paramount importance.
These challenges left aside, RNA expression quantification from RNA-Seq is typically performed in the following analysis steps. Initially, RNA-Seq reads are mapped to the genome as well as to a library of exon junctions. To this end, several software applications are available with RNA-Mate having become a popular choice for mapping SOLiD sequencing reads,21 while BWA22 and Bowtie23 are popular choices for mapping Illumina sequencing reads. Novel exons may be identified by the presence of a cluster of tags that map outside known exons.
Given our incomplete knowledge of eukaryotic transcriptome diversity, the library of exon junctions against which sequence reads are mapped in a first instance commonly includes all theoretically possible exon combinations for a given gene locus, so that novel combinations of known exons can be identified in the sample. Nevertheless, mapping will still fail if the read spans a junction that is not represented in the junction library or one that involves one or two novel exons. To overcome these challenges, the software QPALMA was developed for de novo junction mapping.24 This software uses information from known splice sites, including intron length models, to train a support vector machine that can then be applied to identify novel junctions in a test sample. However, the software requires the availability of a set of known exon junctions for training and has long computational run times. More recently, an ab initio method for the detection of splice sites was developed, TopHat, that does not rely on a training set of known splice sites and which has favorable computational run times.25 TopHat first maps reads to the genome and assembles sequence reads into putative exons based on read coverage in a stretch of contiguous sequence. Exon junctions are then identified by first comparing against known splice junctions and then considering all pairings of known splice donor and acceptor sites within a region. Non-canonical donor and acceptor sites are not considered at present. In this context, the availability of paired-end and mate pair data can help reduce the number of possible splice combinations that need to be considered and thus significantly speed up analysis time as well as potentially provide more accurate exon junction predictions. Finally, a comprehensive RNA-Seq analysis package, Alexa-Seq, has been recently published that allows identification and quantification of alternative transcripts.26 Alexa-Seq uses an internal database to store ‘expression features’ that can be defined by any gene model. RNA-Seq reads are then mapped against these features to identify known and predicted transcript isoforms. Differentially expressed features can also be determined using this software; this feature-based approach can be used to determine the relative use of different transcript isoforms in a given system.
An alternative approach to transcriptome discovery is based on the de novoassembly of transcriptomes and would appear particularly useful for the identification of exon skipping, intron retention and novel, alternative splicing events. Recent advances in this area use de Bruijn graphs and overlapping k-mers to assemble short reads into contigs.27,28 However, sequencing errors significantly complicate the analysis and also result in very long computational analysis times. A recent software application, ABySS, tries to reduce the computational load by using a parallel implementation of the algorithm.27 The approach was able to assemble approximately 30 million bases of human transcriptome sequence into some 800000 contigs, representing some 1 percent of the genome.27 Despite these advances, the de novoassembly of mammalian transcriptomes remains challenging and significant advances in this area may have to await a new generation of next-gen sequencing technologies capable of producing longer sequence reads.
A large proportion of the published RNA-Seq studies are from sequencing small RNA cDNA libraries derived from the small size selected fraction of total RNA. MicroRNA and other small non-coding RNAs are involved in transcriptional regulation.36 RNA-Seq has been used to characterize novel and known short RNAs in a variety of systems37–39 and tools for discovering new miRNAs in deep sequencing data have been developed.40,41 RNA-Seq of the short RNAs has also led to the identification of a new class of small RNAs called tiRNAs or transcription initiation RNAs.42 These tiRNAs are approximately 18nt in length and associated with highly expressed RNA transcripts and RNA polymerase II binding sites.
RNA-Seq has also been applied to gain a better insight into the processes of transcription and translation. Ribosome profiling, whereby sequencing is performed on ribosomal protected transcripts which are actively being translated has the potential to quantify the proteins that a cell produces and provides a unique opportunity to assay protein translation.43 Global run-on sequencing (GRO-seq) is able to map and quantify nascent RNA which is associated with transcriptionally engaged polymerases.44 This assay highlighted the complexity of transcription by revealing the presence of transcriptionally active polymerases situated upstream and in opposite orientation to the gene. RNAs produced by these divergent polymerases were identified in low concentrations in the small RNA fraction and likely have a role in transcriptional regulation.
With regards to quantifying the expression levels of individual transcripts, diagnostic junctions are only indicative of the presence of a specific isoform, but on their own can not provide accurate isoform quantification. For example, when diagnostic events for more than one variant transcript are identified it is impossible to be certain they haven't both arisen from a novel transcript that has not been previously observed. Furthermore, because some alternative exons are found in multiple transcripts, reads mapping to such exons can not be unambiguously assigned to a single RNA.
Technical challenges associated with the library generation method also need to be resolved. It is well recognized that biases and sequence errors can be introduced during RNA fragmentation and cDNA synthesis. Another area requiring improvement is the amount of starting material required for RNA-Seq experiments. Recent studies involving direct sequencing of RNA47 and single blastocyst transcriptome sequencing48 have shown the potential of single molecule sequencing and new library approaches to improve input requirements.
Footnote |
† These authors contributed equally to the work. |
This journal is © The Royal Society of Chemistry 2011 |