Open Access Article
Dan Pu
ab and
Pengfeng Xiao*b
aSchool of Bioinformatics, Chongqing University of Posts and Telecommunications, Chongqing, 400065, China
bState Key Laboratory of Bioelectronics, School of Biological Science and Medical Engineering, Southeast University, Nanjing, 210096, China. E-mail: xiaopf@seu.edu.cn; Fax: +86-25-83793310; Tel: +86-25-83793310
First published on 16th August 2017
A real-time decoding sequencing developed by our group offers long read length, high sequencing accuracy and high compatibility, making it have great potential in high throughput sequencing (HTS) platforms. Here, we first discuss its potential advantages in HTS in terms of read length, sequencing accuracy, and turnaround time. We then discuss its disadvantages including homopolymers and chain decoding mistakes. How to handle these two major disadvantages is also discussed with respect to resequencing and de novo sequencing. We also provide the characteristics of this technology for HTS in terms of error-correcting and discriminating SNP/deletion/insertion. Finally, the existing sequencing platforms with which this technology is compatible are discussed. This technology is not only compatible with the first-generation sequencing platform, but also the second-generation and even the third-generation sequencing platforms. It will further improve the advantages of existing sequencing platforms (read length of PGM and 454 system) and compensate some disadvantages of other next generation sequencing (NGS) platforms (sequencing accuracy of PGM sequencer). We fully hope it will provide a new promising technology for researchers and customers to extend applications of the current and upcoming platforms in almost every area in life and biomedical sciences.
000
000 to $1000 because of extraordinary advancements in DNA sequencing technologies.1,2 The road to this milestone involved many sequencing technologies and these sequencing technologies have profoundly altered our understanding of biology, human diversity, and disease. Emerging sequencing technologies can be grouped into first-, second-, and third-generation sequencing. The automated Sanger method is considered as a ‘first-generation’ technology.3 Although the first human genome sequence was interrogated by this technology, limitations of low throughput and high cost of this technology showed a need for new and improved technologies to sequence large numbers of human genomes in parallel at lower cost.
To ameliorate these limitations, the second-generation sequencing (SGS) was developed. It might be categorized as neither sequencing-by-synthesis (SBS) based nor sequencing-by-ligation (SBL) based.3,4 SBS-based strategies, which are achieved by using DNA polymerase to extend many DNA strands in parallel, may be real-time or synchronous-controlled.5 The real-time SBS strategy identifies newly incorporated nucleotide ‘on the fly’ without interrupting the synthesis process (e.g., Roche/454,6 Ion PGM7). Synchronous-controlled approaches are achieved by using nucleotide substrates that are reversibly blocked or by simply adding only one kind of nucleotide at a time. These strategies include Illumina8 and Fluorogenic DNA sequencing in PDMS microreactors.9 SBL-based strategies use ligase enzymes to extend DNA strands instead of DNA polymerases. These strategies contain AB SOLiD10 and Polonator.11 Currently, due to longer read length and higher throughput of SBS-based platforms, the sequencing market is dominated by SBS-based platforms (e.g., Illumina HiSeq 2000 (Illumina), 454 GLS FLEX (Roche), and Ion PGM™ sequencer (Ion Torrent)) instead of SBL-based platforms (e.g., SOLiD 2.0 (Applied Biosystem)). One of the hallmark features of SGS technologies is their massive throughput at a modest cost, with hundreds of gigabases of sequencing now possible in a single run for several thousand dollars.3 However, the complexity associated with DNA library preparation and the biases introduced by polymerase chain reaction (PCR) amplification may limit broad applications to human genome resequencing.
Single-molecule sequencing (SMS) approaches, which are referred as the third-generation sequencing (TGS), were then proposed to ameliorate these limitations.12 These technologies can roughly be binned into three different categories: (i) SBS technologies in which single molecules of DNA polymerase are observed as they synthesize a single molecule of DNA (SMRT sequencing (Pacific Biosciences) and HeliScope (Helicos BioSciences)); (ii) nanopore-sequencing technologies (MinION (Oxford Nanopore Technologies)); and (iii) direct imaging of individual DNA molecules using advanced microscopy techniques (sequencing using fluorescence resonance energy transfer (FRET) (VisiGen Biotechnologies)).13 These technologies have advantages over current SGS in terms of throughput, turnaround time, and read length. However, while cost and time have been greatly reduced, the error rates of TGS reach up to 4–11%
13–15 and are relatively higher than SGS. Furthermore, signal acquisition is still a challenge in TGS.
In addition to the improvements of current technologies, other promising technologies which can offer higher throughput, longer read length, and lower error rate are eager to be proposed. Not long ago, our group proposed a real-time SBS-based decoding sequencing technology in which a template is determined without directly measuring base sequence, but by decoding two sets of encodings obtained from two parallel sequencing runs. Although both this decoding sequencing and the SOLiD system are achieved by decoding the codes (or encodings in the decoding sequencing) into base sequence instead of by directly obtaining base sequence, this decoding sequencing is different from the SOLiD system in terms of sequencing chemistries and decoding procedures. As for the SOLiD system, it is based on sequencing-by-ligation (SBL), and 8-mer probe which contains ligation site (the first base), cleavage site (the fifth base), and 4 different fluorescent dyes (linked to the last base) which are applied to interrogate the libraries. The SOLiD cycle of hybridization and ligation, imaging, and probe cleavage is repeated ten times to yield ten color calls spaced in five-base intervals. The extended primer is then stripped from the immobilized templates and another ligation round is performed with a ‘n − 1’ primer. After 5 rounds of sequencing, color codes from the five ligation rounds are aligned to a reference genome to decode the inquired sequence.3 Decoding is achieved by prepending the leading base to result in k-mer color codes from which the base sequence can be reconstructed from the first to the last.
However, decoding sequencing is based on real-time SBS instead of SBL. In addition, natural nucleotides instead of fluorescent dyes labeled probes are used; thus, there is no need to image and cleave probes, requiring many fewer steps in each cycle. This technology applies any two of the three sets of dual mononucleotide additions AG/CT, AC/GT, and AT/CG to interrogate a fragment in two parallel runs, and two sets of encodings, which contain information about the possible types and numbers of incorporated base(s) in each cycle, can be acquired. The template sequence can be reconstructed by decoding the two sets of encodings using a decoding scheme in which all the encodings and their degraded encodings from the two sets of encodings are compared with each other from the first to the last. The identical nucleotide between the two compared encodings is right the incorporated nucleotide. This strategy can increase read length and reduce error rate.16 Until now, this technology was successfully applied to simultaneously genotype several SNPs in a single run.17 Additionally, when it is applied to quantitatively analyze SNPs, it allows for the detection of alleles with frequencies as low as 3%, which is more sensitive than the ∼5% to ∼7% level detected by conventional pyrosequencing.18
Here, we first discuss the potential advantages of this technology for HTS in terms of read length, sequencing time, and accuracy. We then discuss two major disadvantages containing chain decoding mistakes and homopolymers. How to handle these disadvantages is also proposed in terms of resequencing and de novo sequencing in this technology. Finally, we discuss the existing platforms with which this technology is compatible. We fully hope it will provide a new promising technology for researchers and customers.
:
1. In the 3rd and the 4th cycles, a mixture of dATPaS and dGTP (di-base AG) and a mixture of dCTP and dTTP (di-base CT) are added. That is to say, sequencing procedures are carried out by sequentially adding a mixture of dATPaS and dGTP (di-base AG) and a mixture of dCTP and dTTP (di-base CT). After two sequencing runs, the strategy decodes the two sets of encodings using a decoding scheme which is achieved first by degrading all the encodings whose number of the incorporated bases is more than one, and then by comparing the two sets of encodings from the first to the last. The identical nucleotide between the two compared encodings is right the incorporated nucleotide. The decoding procedures are shown in Fig. 1B.
A two-color code matrix is also used to encode sequencing signals. Two-color code(s) which contain(s) the information about the possible type(s) of incorporated base(s) also can be applied to represent sequencing signals in each cycle.17 Four color codes (blue, green, yellow, and red) and twelve two-color codes are applied to encode for the sixteen possible di-bases (Fig. 1C). The color codes, blue, green, yellow, and red, represent 4 monodibases AA, CC, GG, and TT, respectively. The remaining two-color codes are used to encode for the twelve di-bases. Additionally, the number of two-color code(s) represents the number of incorporated nucleotide(s) in each cycle. The two-color codes should satisfy the following requirements: first, two different di-bases that have the same first (or the same second) base get different two-color codes. For example, two-color codes (AC) ≠ (AT); two-color codes (AC) ≠ (TC). Second, monodibases get different color codes. For example, two-color codes (AA) ≠ (CC). Third, a di-base and its reverse get the same color code. For example, two-color codes (AC) = (CA). All the same two-color codes are shown in Fig. 1C. When a template is sequenced by both one-base addition and di-base addition, the corresponding color codes are shown in Fig. 1D and E. The system can reconstruct the base sequence based on a scheme that the same color code between the two compared two-color codes is right the incorporated nucleotide (see Fig. 1E). For example, the first two-color code which is half green and half red in the 1st run is compared with the first one which is half blue and half green in the 2nd run, and the same color between the two compared codes is green. Thus, the incorporated nucleotide is base C in this cycle. Continuing in this way, all bases including the last one can be decoded.
In one-base addition technology, general average read length is about 0.5 bp in a sequencing reaction when homopolymers are taken into account.16 When the first base is measured, three sequencing reactions are acquired in order to ensure that at least one base is measured in each template. However, in this decoding sequencing, general average read length is about 1.5 bp in a sequencing reaction when homopolymers are considered.16 Since two nucleotides are simultaneously added into each reaction, two sequencing cycles in a single sequencing run can ensure that at least one base is determined in each template. Therefore, if we assume that the reaction efficiency and time are identical in both the two technologies, the shortest read length of the decoding technology would be three times longer than that of one-base addition technology. This can be corroborated in our previously published work.16 In Fig. S1,† when a sequence is interrogated by the decoding sequencing, the minimum read length per cycle is 1.583 bp (57/36 = 1.583) while a maximum read length per cycle is 2.111 bp (57/27 = 2.111). Both the maximum and the minimum read lengths are more than 1.5 bp per cycle. In contrast, in the one base addition technology, average read length is 0.474 bp (27/57 = 0.474) per cycle. When the given cycles are greater (e.g., 500 cycles) in both the technologies, read lengths measured in the decoding sequencing (750 bp) are greatly longer than that in the one base addition technology (250 bp). Thus, the decoding sequencing has significantly increased the potential read length.
First, we assume that the inquired sequence length is a constant N. Suppose that four nucleotides A, G, C, and T appear in a sequencing reaction at random; we compare the expectation value of the inquired sequencing cycles in one-base addition technology and in decoding sequencing, respectively. In the one-base addition technology, nucleotides A, G, C, and T are cyclically added. When the first base is measured and the downstream bases are independent, the test probability of the second base is 1/3 (without considering homopolymers). Thus, as for one-base addition technology, the expectation value of the required cycles is E1 = 1 + 3(N − 1). Here, N is segment length to be tested. For the decoding sequencing, two nucleotides are simultaneously added. When the first base is interrogated and the downstream bases are independent, the test probability of the second base is 1. Thus, the expectation value of the required cycles should be E2 = 1 + (N − 1) in a single run. Since the decoding sequencing technology contains two sequencing runs, the expectation value of the required cycles is E3 = (1 + (N − 1)) × 2 in the two sequencing runs. When the fragment length is N, the difference of required cycles between the two technologies is E1 − E3 = N − 2. Thus, the longer inquired sequence is, the fewer number of cycles the decoding sequencing needs. Therefore, time required for the decoding sequencing is much less than that for the one-base addition technology when a much longer sequence is interrogated. Moreover, the two sequencing runs in the decoding sequencing can be parallelly carried out; thus time required for two sequencing runs is the same as that for a single run. When the fragment length is constant, the decoding sequencing needs fewer cycles (and thus less operation time), compared to the one-base addition technology.
Second, we assume the number of cycles is constant in a single run. When the decoding sequencing is carried in two unparallel runs, the operation time which contains degeneration and primer extension is two times that of the one-base addition technology. However, the time-cost is undoubtedly worthwhile when the truth that read length per cycle is longer than that of the one-base addition technology is considered.
| The first set of encodings | The second set of encodings | Decoded bases | |
|---|---|---|---|
| a Underlined segments and the bases in the boxes indicate differences between the original data and the reference. | |||
| Reference | ![]() |
AC2GT1AC1GT1AC2GT1AC1 | ![]() |
| Original data | ![]() |
AC2GT1AC1GT1AC2GT1AC1 | ![]() |
As with re-sequencing, data analysis depends on reference sequences for follow-up data analysis since reference sequences exist. In the decoding sequencing, each template is interrogated in two sequencing runs, and two sets of two-color codes are obtained. These two-color codes are compared with the two-color codes of the reference instead of the reference base sequence. Therefore, the procedures are as follows: (i) the reference base sequences are translated into two-color code sequences under different dual mononucleotide additions (AG/CT, AC/GT, and AT/CG) by software. (ii) The two-color codes of reference sequences are compared with those of the original data to get information for mapping with a newly developed mapping algorithm. Two situations exist when the two-color codes of original data do not completely align with the two-color codes of references. First, one set of two-color code sequences perfectly match to the corresponding two-color code sequence of reference, but the other has one mismatched two-color code. The mismatched two-color code is referred to as sequencing errors. Second, both the two sets of two-color code sequences have several mismatched two-color codes and a certain number of (such as 18 consecutive two-color codes) matched consecutive two-color code segments. In this case, mismatched fragments are removed, and a certain number of matched consecutive code segments are reserved and used for assembly.
As de novo sequencing, there is no reference available, and thus the original sequencing data cannot align to the reference sequence. In general, a template can be interrogated by any two sets of dual mononucleotide additions from AG/CT, AC/GT, or AT/CG, and the sequence is able to be reconstructed by any two sets of two-color codes (or encodings). However, chain decoding mistakes may occur in any sets of two-color codes. In such case, a template should be interrogated by three sets of di-base combinations AG/CT, AC/GT, and AT/CG in three sequencing runs, and three sets of two-color codes (or encodings) will be obtained. The three sets of two-color codes (or encodings) can provide more information to decode the original sequence, reducing the chain decoding errors.
When raw data from high-throughput DNA sequencing are used to assemble de novo genome sequence, a corresponding number of coverage is required. Since a template is interrogated by at least two parallel sequencing runs in this decoding technology, each base is independently detected at least twice. This may dramatically reduce or even eliminate the chain decoding errors, just as the error correction in SOLiD system.26,27
| Title | Sequence (5′–3′) |
|---|---|
| a Templates T1–T3 represent the templates used. SP represents sequencing primer. The underlined segments are the hybridization regions with the sequencing primer SP. The segments in the bracket are SNP/deletion/insertion. | |
| T1 | ![]() |
| T2 | ![]() |
| T3 | ![]() |
| T4 | ![]() |
| SP | GCCAGGCGGATGTACGGTACG |
For re-sequencing, a reference sequence is available for assembly. The reference base sequence is first converted into two-color code sequence under each di-base addition (AG/CT, AC/GT, and AT/CG) by software. Given a set of two-color codes from any di-base addition, the two-color codes of references are compared with those of original sequencing reads. Between the two sets of two-color codes obtained in two sequencing runs, when one has an ambiguous number of homopolymers in a given fragment, the other must be clear encodings in this fragment. Thus, when the two sets of two-color codes are compared with the two sets of two-color codes of references, one will perfectly match, but the other cannot match only in a given fragment. The solution is to use ambiguous alignment to align the two-color codes of original sequences to those of references by previously designing a dynamic range of base number. Take template T1 as an example, when the template is sequenced by AC/GT and AG/CT, respectively, two sets of two-color codes (or encodings), S1 and S2, are obtained (Fig. 2A). S2 has an ambiguous number of homopolymers in a given fragment while S1 has clear encodings in this fragment. By using ‘ignore’ strategy, the homopolymer segments are discarded and the remaining parts are applied to decode the base sequence.
As for de novo sequencing, reference sequence is not available. Thus, the original encodings cannot align with the reference encoding sequences, making the unambiguous alignment of sequencing reads impossible. In such cases, the template needs to be sequenced using three different di-base additions (AG/CT, AC/GT, and AT/CG) in three sequencing runs, and three sets of encodings will be obtained. Among the three sets of encodings, if homopolymers exist in one set of encodings, the other two sets of encodings in this region must be without homopolymers. Therefore, correct sequence could be decoded by using the two sets of encodings without homopolymers. In Fig. 2B, three sets of encodings (S1, S2, and S3) are obtained. S2 contains homopolymers, but no homopolymer exists in S1 and S3. Thus, the correct sequence could be decoded by using S1 and S3.
Conventional pyrosequencing is a real-time SBS technology which is based on adding a limiting amount of dNTP bases one at a time to control DNA synthesis and it yields detectable light by a cascade of enzymatic reactions.30 Since this technique is SBS-based and uses natural nucleotides during synthesis, it is compatible with decoding sequencing technology. Based on the conventional pyrosequencing platform (PSQ 96MA system), we validated this decoding sequencing technology in our previous works (we called it pyrosequencing with di-base addition).16 We found that it was perfectly compatible with a conventional pyrosequencing platform and that generated visible light was proportional to the number of incorporated nucleotides when the number of identical nucleotides was less than 7 bp.16 Moreover, the read length of pyrosequencing with di-base was nearly 1.5 times that of conventional pyrosequencing, therefore further improving its read length. Increasing read length will reduce the cost of DNA analysis and extend its applications. For example, several SNPs were simultaneously examined in a single run which were unable to be examined by conventional pyrosequencing.17
Roche/454 system uses emulsion PCR31 and pyrosequencing technology.29,32 Therefore, the decoding sequencing is compatible with Roche 454. The key advantage of the Roche/454 system is its longer sequence reads. If it is combined with the di-base addition, the read length will be further improved and thus reduce its cost. It will be more suited for de novo sequencing of new genome. However, similar to conventional pyrosequencing, a major limitation of 454 system relates to homopolymers owing to the lack of a terminating moiety. If the decoding sequencing is used on this platform, the limitation can be remedied by solutions we have proposed.
Ion Personal Genome Machine (PGM) uses semiconductor sequencing technology. When a nucleotide is incorporated into the DNA molecules by the polymerase, a pyrophosphate is released and hydrolyzed to yield a proton. Instead of detecting the light change in pyrosequencing technology, this technology detects the change in pH to recognize whether the nucleotide is added or not. Similar to pyrosequencing technology, the amount of voltage detected is proportional to the number of incorporated nucleotides.7,33 Both the decoding technology and the semiconductor sequencing technology use unlabeled nucleotides and are based on real-time SBS technology. Thus, the decoding sequencing is compatible with PGM. When the dispensation orders of AG/CT, AC/GT, and AT/CG are applied, each time the chip is flooded with one di-base after another, and if it is not the correct nucleotide, no voltage will be found; if there are 2 nucleotides incorporated, then double voltage is detected. Based on the amount of the detected voltage, the encoding(s) in each reaction cycle can be obtained. Since the decoding sequencing has potential of increasing read length, the combination of PGM and the decoding sequencing will further improve the read length, making it much more useful for clinical applications and small labs.
Pyrosequencing and semiconductor sequencing use natural nucleotides and do not need chemical cleavage. However, they use instantaneous light emission or detection of the electrochemical signals which need constant monitoring, and each individual reaction requires a separate detector, thus affecting their throughput. In 2011, Sims et al. developed Fluorogenic DNA sequencing in PDMS microreactors.9 This technology combined the advantages of pyrosequencing and semiconductor sequencing with fluorescent labeled nucleotide monomers (TPLFNs). When a TPLFN is incorporated into the template immobilized in the micro-reactor by a DNA polymerase, a recessive fluorescently labeled pyrophosphate will be released. The pyrophosphate is degraded to yield fluorescent light. Based on the principle of the decoding sequencing, this technology is compatible with the decoding sequencing. It was confirmed that the original accuracy rate of this technology was about 99% and it not only provided the benefits of pyrosequencing, such as fast turnaround and one-color detection, but also had high throughput. When it is combined with this decoding sequencing, accuracy rate and throughput will be further improved.
The HeliScope from Helicos Biosciences is based on true single molecule sequencing technology and relies on the cyclic interrogation of a dense array of sequencing features.34 It applies a highly sensitive fluorescence detection system to directly interrogate single DNA molecules via sequencing by synthesis. Besides, no terminating moiety is present on the labeled nucleotides. Thus, our technology is compatible with this platform. If it is combined with the decoding sequencing, the read length will be further improved and thus reduce the cost of this platform. In addition, sequencing accuracy is the major problem in TGS. A two-pass strategy was developed to improve raw sequencing accuracy.35 However, if it is combined with the decoding sequencing, at least two sequencing runs are carried out, and the sequencing accuracy will be improved without performing two-pass strategy.
Another attractive advantage of decoding sequencing is its compatibility. According to its principle, it is not only compatible with the first-generation sequencing platform, but also the NGS and even the TGS platforms. Due to limited conditions in our laboratory, we only validated its feasibility on a conventional pyrosequencing instead of the NGS or TGS platforms. Decoding technology has the potential of further improving the advantages of sequencing platforms, such as read length of PGM and 454 systems. Moreover, it also enables one to compensate some disadvantages of other NGS platforms. For example, HeliScope Genetic Analysis System will compensate the disadvantage of low sequencing accuracy when it is used. Although this technology has been compatible with only a few platforms so far, it will also be compatible with others as new techniques emerge from time-to-time. We expect it will broadly expend applications of current and upcoming sequencing platforms.
Decoding sequencing has shown advantages in some unique niches. However, two major disadvantages including the chain decoding mistakes and homopolymers exist. Similar to other SBS-based technologies, homopolymers have previously prevented it from being applied more broadly since repetitive DNA sequences are abundant in bacteria and mammal human genomes.38 These regions are difficult to sequence even in relatively small genomes. Here, corresponding solutions have been provided to remedy limitations in terms of re-sequencing and do novo sequencing.
Current HTS platforms provide a huge variety of sequencing applications to many researchers and projects, and they make it possible for research groups to generate longer read lengths very rapidly at substantially lower costs. When decoding sequencing is applied with compatible platforms, its applications may be further expended. We fully hope it will increase the applicability of current and upcoming platforms in almost every arena in life and biomedical sciences.
Footnote |
| † Electronic supplementary information (ESI) available. See DOI: 10.1039/c7ra06202h |
| This journal is © The Royal Society of Chemistry 2017 |