Xutong
Liu
,
Enyang
Yu
,
Qixuan
Zhao
,
Haobo
Han
* and
Quanshun
Li
*
Key Laboratory for Molecular Enzymology and Engineering of Ministry of Education, School of Life Sciences, Jilin University, Changchun 130012, China. E-mail: quanshun@jlu.edu.cn; hanhaobo@jlu.edu.cn; Fax: +86-431-85155200; Tel: +86-431-85155200
First published on 14th January 2025
DNA is considered as an ideal supramolecular material for information storage with high storage density and long-term stability. Enzymes, as green and sustainable tools, offer several unique advantages for DNA-based information storage. These advantages include low cost and reduced generation of hazardous wastes during DNA synthesis, as well as the improvements in data reading speed and data recovery accuracy. Moreover, enzymes could achieve scalable data steganography. In this review, we introduced the exciting application strategies of enzymatic tools in each step of DNA information storage (writing, storing, retrieval and reading). We further address the challenges and opportunities associated with enzymatic tools for DNA information storage, aiming at developing new techniques to overcome these obstacles.
Conventional data storage devices, such as magnetic, optical and solid-state devices, have several drawbacks that limit their future applications.18 Notably, the rapid growth of global digital data production is driven by the increasing shift of industrial networking demands to the cloud and the proliferation of internet-connected smart devices.19 As data grow exponentially, conventional storage devices approach their physical limitations and struggle to keep pace with digital storage requirements. Moreover, the storage of digital data on a zettabyte scale requires the use of significant physical space. However, datacenters for centralized physical storage have huge electricity consumption and require additional energy for thermal balance and cooling, which in turn drives up the long-term maintenance costs.20 Furthermore, silicon-based information storage systems have limited data retention time, which is also a great challenge for existing storage devices.21 Thus, there is an urgent need for development of alternative storage media to bridge this ever-widening gap. Besides, the great demand for information storage has facilitated the integration of biotechnology and information technology, driving the advancement of DNA-based information storage.
The process of DNA information storage involves four key steps. Digital information is converted into DNA sequences by DNA synthesis (writing). DNA sequences are either encapsulated in silica for long-term storage in vitro or integrated into bacterial genomes or plasmids in vivo to ensure their stability (storing). To extract data from the stored DNA, the target DNA sequences are selectively accessed through PCR to separate from oligonucleotide pools (retrieval). Subsequently, the DNA molecules are sequenced to convert them back into digital data (reading). The major concern is the chemical DNA synthesis, which is often used in the writing of digital information. However, the process is tedious and time-consuming with toxic by-products, and meanwhile an exponential increase in cost will occur as the storage capacity scales up.22–25 In contrast, enzymatic DNA synthesis provides a sustainable, ecofriendly and cost-effective strategy.26 In addition, enzymes have gained attraction in the field of DNA information storage, which includes writing,27,28 rewriting,28–30 retrieval,31,32 and data steganography.33–35 With the development of molecular biology, various enzymes have been utilized as tools to ensure efficient biological catalysis for DNA processing. For instance, engineered DNA polymerases could be used in the synthesis of unnatural nucleic acids for long-term data storage and steganography.36,37 In in vivo DNA data storage, enzymes play a critical role in DNA manipulation at the molecular level, which drives the advancement of DNA-based data storage towards further commercialization.
In this feature article, the key role of enzyme-mediated processes was highlighted in DNA-based information storage systems, especially for the four key steps (writing, storing, retrieval and reading). The critical challenges and opportunities were also discussed for use of enzyme molecules as tools for the scalability and sustainability of DNA information storage.
The technological process of a synthetic DNA-based storage system consists of four steps: writing, storing, retrieval and reading. An overview of the steps involved in DNA information storage is illustrated in Fig. 1. Conversion of digital data from binary information (bits) into DNA sequences is initially achieved using error-correction codes,40,41 such as 00 for A, 01 for C, 10 for G and 11 for T.
The encoding process introduces physical redundancy, which serves to prevent the data loss during storage. Following the encoding of DNA, oligonucleotides are synthesized by solid-phase phosphoramidite chemical synthesis or enzymatic synthesis.42,43 Nevertheless, the production of toxic waste is inherent in chemical DNA synthesis, thereby rendering the enzymatic DNA synthesis to be a sustainable alternative.44 Moreover, the stability of DNA is another critical feature in DNA information storage. The DNA strands can be preserved to shield the stored information from environmental degradation by freezing DNA molecules in solution, drying the DNA samples, or encapsulating DNA molecules in silica nanoparticles.45 In addition, the insertion of synthetic DNA into plasmids or the integration into the genome of living cells could be potentially used in the intracellular DNA data storage.46,47 Furthermore, selectively accessing the target DNA sequence is referred to as random access, which is more challenging in DNA-based data storage than in digital storage media. Random access in DNA data storage mainly focuses on PCR utilization.48 This process can be supported by selective methods such as magnetic bead extraction with probes mapped to data blocks or PCR using primers associated with data blocks.49,50 Thus, the target DNA is selectively retrieved and amplified. Subsequently, DNA molecules are read to recover digital data by sequencing techniques, such as Illumina and nanopore sequencing.51 The success of this step depends on the sequencing coverage and the error rate experienced throughout the decode process.52 Taken together, emerging technologies in synthetic biology that involve the above processes will be pivotal in transforming DNA data storage into a commercially available technology. This transformation will be facilitated by the advent of enzymatic DNA synthesis, digital microfluidics-based random access, and high-throughput sequencing.
The typical solid-phase synthesis of oligonucleotides via phosphoramidite chemistry is described in Fig. 2. Current technologies employ a solid support to build up a sequence, nucleotide by nucleotide, through a four-step synthesis cycle.56–59 The initial step is the reaction of nucleotides with the protected active group, which has been pre-attached to the solid phase carrier CPG, in the presence of trichloroacetic acid. This process serves to expose the 5′-hydroxyl group, which will be utilized in the subsequent coupling step. In the coupling step, the raw material for DNA synthesis is the phosphoramidite-protected nucleotide monomer. The monomer is mixed with the activator tetrazolium, leading to the formation of the nucleoside phosphite activation intermediate. The intermediate is activated at the 3′-end, while the 5′-hydroxyl group remains to be dimethoxytrityl (DMT)-protected. Subsequently, the compound undergoes a condensation reaction with the free 5′-hydroxyl group of the nucleotide attached to CPG in solution. In the third step, known as capping, the unreacted 5′-hydroxyl group attached to the CPG needs to be closed to prevent the extension in subsequent cycles. Acetylation is commonly used to close the hydroxyl group once the coupling reaction completes. In the fourth step, oxidation occurs where the phosphinylidene form is transformed into a more stable phosphotriester. This conversion takes place in the presence of iodine, which is dissolved in tetrahydrofuran acting as an oxidant.
![]() | ||
| Fig. 2 DNA synthesis using a solid-phase phosphoramidite chemistry method. The synthetic cycle involves four steps: deprotection, base coupling, capping and oxidation. | ||
The major drawbacks of chemical synthesis of DNA are the use of hazardous chemicals and the inability to synthesize oligonucleotide sequences longer than 200 bp.60,61 Meanwhile, the accuracy of the synthesized oligonucleotides decreases with the elongation of DNA chains. The length limitation of chemical DNA synthesis affects the storage of large-scale DNA information.62 Therefore, it is necessary to encode information in DNA by physically dividing the encoded digital information into different blocks to form short DNA fragments, or using DNA assembly and ligation to form long DNA fragments. These practical issues may hinder further industrialization of chemical DNA synthesis for data storage.
Terminal deoxynucleotidyl transferase (TdT) is a low-fidelity DNA polymerase of the X Family,64 which has been explored for the application in DNA information storage and gene synthesis. Enzymatic DNA synthesis using TdT has been considered as a natural candidate method, which can catalyze the template-free addition of nucleotides on the 3′-end of a DNA strand. The synthesis of novel DNA strands was controlled by the composition of nucleotide substrates. In 1959, Bollum first described that TdT could be used as a ssDNA polymerase for the template-free de novo DNA synthesis.65 Before this research, DNA polymerases were always used in the amplification of dsDNA, which requires templates and primers. Afterwards, Bollum revealed the enzymatic mechanism of TdT-mediated DNA synthesis.66 Reversible 3′-protecting groups are required to prevent further addition of nucleotides and control strand elongation. However, natural TdT is difficult to recognize the substrate of 3′-blocked dNTPs. To overcome the limitations of TdT with 3-blocked dNTPs, Palluk et al. designed a strategy based on the use of TdT–dNTP conjugates to control the synthesis of oligonucleotides.67 As shown in Fig. 3A, TdT was conjugated to a dNTP molecule via a cleavable linker, and the tethered dNTP can be synthesized on the ssDNA primer to block the further elongation owing to steric hindrance. The linker can be cleaved to drive the deprotection for subsequent extension. The reversible termination of chain extension by TdT–dNTP conjugates can realize the enzymatic de novo synthesis of a 10-mer oligonucleotide. In addition, Lu et al. constructed an engineered TdT from Zonotrichia albicollis (ZaTdT), which could solve the issue of the incompatibility between 3′-ONH2-dNTPs and the catalytic cavity of ZaTdT.68 The engineered ZaTdT showed a 1000-fold higher catalytic activity with 3′-ONH2-dNTPs than the commonly used MmTdT.
To overcome the natural promiscuity of TdT, Lee et al. realized the controlled TdT extension activity with apyrase by degrading free nucleotides,69 as shown in Fig. 3B. This strategy could be used in the synthesis of short homopolymeric blocks, which then encode digital information by the transition between nucleotides. To further improve the efficiency of TdT-based enzymatic de novo DNA synthesis, Lee et al. designed a multiplexed synthesis method using photolithography to selectively control the extension activity of TdT in an array (Fig. 3C).70 Co2+, the essential metal cofactor of TdT, was trapped with the photocleavable caging molecule, which was cleaved by UV light to be released. Thus, the catalytic activity of TdT can be regulated in the multiplexed oligonucleotide synthesis in the array by a mask pattern of UV light. This method can individually synthesize 12 unique oligonucleotides, which could simultaneously encode 110 bits of digital data on the array surface. TdT has been proved to be a far better candidate in enzymatic oligonucleotide synthesis compared to its predecessors. Nevertheless, the TdT-based enzymatic DNA synthesis still requires optimization to accurately address the on-demand synthesis of oligonucleotides and reduce the generation of failure strands.
![]() | ||
| Fig. 4 Enzymatic synthesis of modified DNA and XNA by engineered DNA polymerases. (A) Primer extension of engineered DNA polymerase on DNA, RNA, and 2′-O-methyl templates. Reproduced from ref. 72 with permission from the American Chemical Society, copyright 2021. (B) TNA oligonucleotides amplified from DNA templates for data storage against the nuclease exposure. Reproduced from ref. 73 with permission from the American Chemical Society, copyright 2020. (C) One kilobase LNA synthesis by engineered DNA polymerase. Reproduced from ref. 74 with permission from the American Chemical Society, copyright 2020. (D) Mirror-image DNA information storage for chiral steganography. Reproduced from ref. 76 with permission from Springer Nature, copyright 2021. | ||
Furthermore, due to the degradation resistance of nucleic acid analogues, engineered DNA polymerases can copy DNA templates into XNAs with the potential for long-term DNA data storage. Yang et al. developed an engineered family B DNA polymerase with the ability to synthesize α-L-threofuranosyl nucleic acid (TNA), resulting in the faithful reading of information encoded in TNA,73 as shown in Fig. 4B. The 22
349 bytes of digital information were encoded in 7451 unique DNA oligonucleotides of 83 nt and written into TNA. The engineered DNA polymerase transferred the information between DNA and TNA in a write–store–read cycle to meet the sequencing demands. Through the information writing in TNA, storage files could be completely recovered from the nuclease exposure, revealing that TNA is a biologically stable system for long-term information storage. Meanwhile, engineered DNA polymerase could produce TNA strands of up to 200 nt, providing a viable option to improve the storage capacity. To further increase the capability of XNA synthesis, Hoshino et al. developed the variants of KOD DNA polymerase with high efficiency and fidelity allowing the synthesis of 1000-nt locked nucleic acid (LNA),74 as shown in Fig. 4C.
L-DNA, the enantiomer of natural D-DNA, is an ideal nucleic acid analogue because it is completely resistant to nuclease, but it has identical kinetic and thermodynamic properties to D-DNA.75 Furthermore, L-DNA is unable to form contiguous Watson–Crick base pairs with D-DNA. To overcome the inability of natural DNA polymerases to recognize L-DNA, Fan et al. developed an effective mirror-image DNA information system by chemically synthesizing a mirror-image Pfu DNA polymerase,76 as shown in Fig. 4D. The mirror-image DNA polymerase could achieve the synthesis of a 1500-nt L-DNA strand. The information stored in L-DNA is more resistant to the biodegradation in the natural environment than that in D-DNA. Moreover, chimeric D-DNA/L-DNA molecules were designed to transmit false and secret messages. Using mirror-image Pfu DNA polymerase, the L-DNA sequence in the chimeric DNA key could be successfully amplified. Thus, mirror-image DNA polymerase provides a better solution for data steganography in DNA information storage to encrypt key data.
Furthermore, the method of physical encapsulation was proposed to improve the stability of DNA data storage. Koch et al. designed a DoT storage architecture that encapsulated DNA in silica beads and then used it as a material for 3D printing,81 as shown in Fig. 5A. The 0.3% weight of DNA from 3D-printed Stanford Bunny that contained a 45-kb digital DNA blueprint was amplified from the previous generation and encapsulated into the next generation for multi-round replication. The potential for long-term storage was demonstrated by creating five successive generations of rabbits without the loss of information. Additionally, encoded DNA was loaded in polymethyl methacrylate, which was cast in the shape of a lens. This approach provided a novel strategy of physical steganography, which could secretly store digital information in daily objects. Antkowiak et al. developed an approach for long-term data storage, in which the dehydrated DNA was encapsulated in silica nanoparticles and added on the surface of glass to protect it from degradation (Fig. 5B).82 The accelerated aging experiment revealed that the encapsulated DNA was more stable than unprotected DNA, which could not realize the information recovery. Although the in vitro strategies improved the stability of DNA information storage, rewriting and retrieval of large-scale DNA information storage remain a challenge that needs to be addressed.
![]() | ||
| Fig. 5 Strategies of in vitro long-term DNA data storage. (A) Schematic architecture of the 3D-printed Stanford Bunny, which contained the encapsulated DNA library. Reproduced from ref. 81 with permission from Springer Nature, copyright 2020. (B) Overview of digital microfluidics (DMF)-compatible silica nanoparticles encapsulating the DNA library for DNA data storage. Reproduced from ref. 82 with permission from Wiley, copyright 2022. | ||
Serine integrases are capable of catalyzing the recombination between att sites on both linear and circular DNA substrates.84,85 The outcome of this process depends on the specific position and orientation of the att sites, which can result in the integration, excision or inversion of DNA. Sun et al. developed a recombinase-based site-specific genome engineering toolbox, which can assemble the synthesized 3–8 kb DNA fragments into 51-kb DNA fragments and integrate into the bacterial genomes,86 as shown in Fig. 6A. After a continuous passage of 2000 generations, complete DNA information could still be successfully retrieved. This strategy revealed that the integration of stored information into the bacterial genome environment was stable and error-proof for replication. In contrast to the storage of oligonucleotide pools, the cell growth could automatically regenerate data, thereby avoiding the data loss caused by the long-term storage and frequent retrieval.
![]() | ||
| Fig. 6 The in vivo DNA information storage. (A) Schematic integration of information DNA into bacterial genomes. (B) Schematic illustration of DNA information storage and rewriting within living cells. (C) The process of randomly rewriting the digital data from the cell pool. Reproduced from ref. 87 with permission from Wiley, copyright 2024. (D) Schematic view of writing digital information in the form of precise DNA sequence edits on pre-made DNA molecules. | ||
Other strategies for in vivo DNA data storage mainly focused on the CRISPR-Cas system.87–91 As an adaptive immune system for bacteria, the CRISPR-Cas system could protect the bacteria by acquiring invader DNA and integrating it into the CRISPR array. The system exhibits a significant impact on genetic engineering and has been successfully used in the data rewriting process. Farzadfard et al. designed a dual-plasmid system based on CRISPR-Cas12a-λRed to rewrite the information in vivo,90 as shown in Fig. 6B. The info-plasmid included crRNA and the encoded information sequence, and the help plasmid was used to express Cas12a and λRed. The Cas12a guided by crRNA could selectively cleave the target DNA sequence and then the λRed replaced the target DNA fragment to recombine the info- plasmid via homologous arms for information rewriting. The ratio of rewriting cells and the accuracy of rewritten information could reach 94%. Cas9 is another Cas protein that has been used in the intracellular rewriting of DNA information storage.87 The CRISPR/Cas9 system was developed to rewrite the image with the same gRNA in yeast cells (Fig. 6C). The targeted cells could be selectively removed by counterselection operation. The Trp1 and Ura3 genes of untargeted cells were eliminated, preventing their growth on the synthetic medium without tryptophan. Moreover, gRNA activated the targeted cells to completely acquire the Ura3 gene, which were non-survivable in the medium containing 5-fluoroorotic acid (5-foa) to promote the counterselection. Thus, the targeted cells were completely removed from cell pools. Finally, the cells containing new information were added into the cell pools to achieve the rewriting of specific information. The dCas9, a variant of Cas9, still recognized and bound to specific DNA targets, with no cleavage activity. This property of dCas9 could be exploited in cooperation with the mutagenic protein APOBEC3A to achieve the programmable system of information rewriting,91 as shown in Fig. 6D. When dCas9 bound the targeted DNA to form a nucleotide R-loop structure, APOBEC3A efficiently facilitated the mutation of dC to dT in the displaced strand of the R-loop. Subsequently, the rewriting DNA sequence could be read via sequencing. These enzymatic tools provided a useful strategy for rewriting or editing data rather than chemically resynthesizing DNA, thus reducing the cost of storing information in DNA.
Designing unique primers for every data file is a widely used method for random access by DNA polymerase-based PCR amplification.92 The challenge of this method is designing primers that do not conflict with the payloads. As shown in Fig. 7A, 35 files were encoded and segmented into more than 13 million oligonucleotides with a unique file ID as primers.93 Each file could be retrieved to be recovered with no errors. Furthermore, nested PCR has been proposed to increase the number of file addresses in storage systems. Since the probability of potential off-target molecular interaction increases with the capacity of the system, the addresses must be sufficiently different from each other in sequences so that the number of addresses is limited and thus restricts the total capacity of the system. The purification with magnetic beads was integrated with PCR primers to solve the challenge in the random access of large-scale DNA information storage,94 as shown in Fig. 7B. A 9-kb file was selectively retrieved from a 5-TB database through a unique PCR primer. The primer was chemically modified and bound to the functionalized magnetic beads and then physically separated from unbound oligonucleotides by emulsion PCR for future reuse. In addition, the nested PCR primer was designed to expand the number of possible addresses and combined with functionalized magnetic beads to further improve the capacity of the database. This strategy could be used to store and access individual files containing at least GBs of data.
![]() | ||
| Fig. 7 Random access in DNA data storage. (A) Design a primer library for PCR-based random access. (B) Using nested, hierarchical primer for addresses. (C) Silica capsules with dye-labelled orthogonal barcodes used to select the specific data by fluorescence-activated sorting. Reproduced from ref. 95 with permission from Springer Nature, copyright 2021. (D) Thermoresponsive proteinosomes using fluorescence-assisted sorting for repeated access in DNA data storage. | ||
To improve the throughput of file retrieval, fluorescence-activated sorting is also a potential approach for data access.95 The plasmid DNA was encapsulated by positively charged silica via electrostatic interaction (Fig. 7C). The orthogonal ssDNA barcodes, describing key features of image for file selection, were modified on the surface of silica capsules. Fluorescent labelled oligonucleotide probes were bound to the barcode on the surface of silica particles by annealing, enabling the sorting of the target file from a pool of 106 data. To further avoid the PCR crosstalk, a strategy based on the temperature-dependent semipermeable microcompartment was designed for repeated PCR-based access from complex file pools,96 as shown in Fig. 7D. Semipermeable microcompartments were constructed using protein–polymer conjugates, enabling the localization of biotinylated DNA files in proteinosomes. The proteinosomes were thermoresponsive with reversible temperature-controlled membrane permeability, releasing amplified dsDNA at PCR temperature, thereby significantly reducing the PCR crosstalk. Magnetic particles were incorporated into proteinosomes to retrieve target DNA files by magnetic separation, allowing reliable repeated access to DNA-encoded data.
Erasing is another feature to store different sets of data by DNA polymerase-based PCR.97,98 The specific files were addressed by PCR and deleted by the cleavage of restriction endonuclease. Thus, specific files were successfully erased, and new files were repeatedly loaded into sets of data. However, the chances of off-target interaction are most likely to occur between files that have a higher similarity to the address sequences of the desired file. Orthogonal barcode design and chemical modification of probes could be helpful to achieve and extend to achieve a higher capability of data access in the future.
Alpha-hemolysin (α-HL) and Mycobacterium smegmatis porin A (MspA) have been extensively utilized in biological pores.106 α-HL is a membrane channel protein forming 1.4 nm internal-diameter β-barrel transmembrane pores.107,108 It has been widely used in the detection of single-stranded nucleic acid molecules due to its ability to form rigid nanopores with consistent diameters. The transmembrane β-barrel of an engineered α-HL pore contains three recognition sites that can be used to identify all four DNA bases in an immobilized single-stranded DNA molecule,109 as shown in Fig. 8A. Stoddart et al. initially considered that two recognition sites (R1 and R2) in the transmembrane region might be favorable for sequencing, because each base is read twice, first at R1 and second at R2.109,110 The built-in proof-reading mechanism could improve the overall sequencing quality. However, more than two recognition sites could not be practical as it is difficult to assess the ionic current generated by three recognition sites from electrical noise. To address this challenge, Stoddart et al. proposed to modify the R1 recognition site in α-HL by mutagenesis to enhance a nucleotide-detecting induction site.110 The hydrophobic and bulky side chains provide steric barriers to ion flow, which improve the discrimination of nucleobases at R1 to yield accurate signal recording. Ayub et al. attempted to reduce the number of recognition sites in the α-HL pore by using truncated pores.111 By deleting and mutating amino acids on the β-barrel, a pore could be created with just two recognition sites. Compared to the wild-type α-HL with 5-nm long β-barrel, the pores with shortened β-barrels were proved to be more suitable for high-resolution nucleotide sensing. They could bind the positively charged β-cyclodextrin, permitting the continuous recognition of individual nucleoside monophosphates.
Moreover, Stranges et al. constructed a nanopore-based sequencing-by-synthesis (Nanopore-SBS) approach, using a set of nucleotides with polymer tags to allow the discrimination of nucleotides in a biological nanopore,112 as shown in Fig. 8B. A high-throughput sequencing platform was built using nanopore sensors, enabling parallel sequencing of multiple DNA templates at the single molecular level. This approach provided real-time single-molecule electronic DNA sequencing data with single-base resolution.
α-HL is a structurally stable nano-detection device, but the limited pore size (∼1.4 nm) has restricted its application in the analysis of ssDNA, RNA or small molecules. Additionally, the 5-nm long cylindrical β-barrel of α-HL presents a structural limitation in the accurate sequencing that dilutes the ion current specific to individual nucleotides and yields minor differences between nucleotides.113
Despite MspA having a better recognition performance than α-HL, a common challenge in nanopore sequencing is the rapid DNA translocation,116 with speeds exceeding 1 nucleotide per microsecond in both α-HL and MspA. Phi29 DNA polymerase was demonstrated to possess superior performance in ratcheting DNA through the nanopore to slow the rate of translocation.117 An engineered MspA mutant combining phi29 DNA polymerase was elucidated to disassemble ionic currents in single-stranded DNA molecules into individual nucleotide signals.118 In contrast to previous DNA translocation tests that were poorly controlled, the addition of motor protein reduced the fluctuation in translocation kinetics, thus improving the data quality.105 The MspA nanopores can accurately sequence the phiX174 genome up to 4500 bases in length by the above methods,119 as shown in Fig. 9A.
One of the biggest advantages of nanopores over Illumina in terms of data output is single-molecule sequencing of the extended alphabet or the ability to sequence not only natural nucleotides but also chemically modified nucleotides.120 Ledbetter et al. used the nanopore sequencing system based on MspA/Hel308 DNA helicase to evaluate the replication fidelity of six nucleotides consisting of four natural letters, dTPT3, and dNaM by time-varying voltage,121 as shown in Fig. 9B. The nanopore moved DNA with two steps per nucleotide to produce two distinct ion-current segments. Thus, this nanopore sequencing is sensitive to nucleotide modifications and the unique structure of unnatural nucleotides. Moreover, Thomas et al. achieved extended nanopore sequencing of four synthetic DNAs using MspA nanopores to evaluate the signal range (Fig. 9C).122 The nanopore system of MspA combined with Hel308 DNA helicase could detect and accurately differentiate all eight different nucleotides, and the conductance signals of the four synthetic DNA occupied a larger dynamic range than those of the standard letters. Thus, the application of extended alphabet could further improve the density in DNA data storage.
Despite its great potential, DNA information storage still faces many challenges owing to the limitation of the physical technology platform. Storing 1 TB of data requires the synthesis of billions of oligonucleotides. However, the current DNA synthesis throughput could not meet the demands in DNA data storage. When oligonucleotides are synthesized over 100 nt, their purity and accuracy will gradually decrease, thus limiting the commercialization of DNA information storage. Another major impact of DNA data storage is the inability to synthesize long DNA sequences by chemical routes. Recent research mainly focuses on the synthesis of oligonucleotides of <200 nt, which could be equally cleaved from long DNA fragments for massively parallel DNA synthesis. Additionally, each oligonucleotide must possess addressing information to ensure data reconstruction. Thus, the shorter the oligonucleotide, the more the oligonucleotides required, resulting in a higher proportion of addressing information and a reduction in information density.
Enzymes provide effective tools for addressing these issues in DNA information storage. For instance, TdT is a particularly promising enzyme for de novo DNA synthesis, overcoming the contamination of organic reagents and the length limitation in the progress of phosphoramidite chemical synthesis. To date, DNA chains with ∼8 kb could be successfully synthesized via the TdT-based strategy. The superior synthesis length and speed are far beyond the reach of phosphoramidite chemistry synthesis. The scalability of DNA polymerase to recognize unnatural nucleic acid substrates provides new tools for DNA steganography and cryptography. Additionally, the exploration of key proteins in nanopores has the potential to build DNA sequencing technology of natural or unnatural nucleic acids for information decode. Meanwhile, DNA polymerase selectively amplifies target DNA fragments by designing primers, and thus the DNA polymerase-based PCR technology facilitates the data retrieval from large scale data pools. Besides, integrase and CRISPR-Cas nuclease enable the in vivo rewriting and erasing of stored data.
Although these enzymes showed great potential in DNA information storage, there are still some issues to be solved. The main challenge in applying TdT to the programmed DNA synthesis is to control the ordered polymerization of nucleotides. Controllable strategies for polymerization should be established via the management of reaction steps such as the degradation of substrates and the release of divalent ion cofactors. In addition, DNA polymerase with high fidelity is an efficient low-error tool to expand the data storage capacity and preservation stability in natural or novel DNA storage architectures. Moreover, the resolution of single nucleobase and the stability of long reads are important directions for future nanopore sequencing technology. Improving the reading accuracy of base sequencing technologies is also an important direction to facilitate the data steganography in DNA information storage. The challenges faced by each step in DNA data storage are interconnected. Thus, it is crucial to integrate the enzymes that play key roles in the different processes of DNA information storage. The aim is to ensure these enzymes with suitable efficiency, boost their synergistic effects, and prevent any adverse impacts on their activity. Directed evolution and de novo design of proteins have been demonstrated to be effective strategies for engineering enzymes with desired properties or improved functionality. Deep learning and artificial intelligence methods trained on large scale sequence and structure datasets provide researchers with powerful assistance in “writing” proteins from scratch and creating proteins with entirely new shapes and molecular functions. We believe that the enzymes could be tailored to promote the high-throughput automated DNA data storage, thereby providing a robust and scalable platform to fulfill the industrial requirements.
| This journal is © The Royal Society of Chemistry 2025 |