Open Access Article
Shuntaro
Takahashi
*ab,
Michiaki
Hamada
cd,
Hisae
Tateishi-Karimata
ab and
Naoki
Sugimoto
*a
aFIBER (Frontier Institute for Biomolecular Engineering Research), Konan University, 7-1-20 Minatojima-Minamimachi, Chuo-ku, Kobe 650-0047, Japan. E-mail: shtakaha@konan-u.ac.jp
bFIRST (Graduate School of Frontiers of Innovative Research in Science and Technology), Konan University, 7-1-20 Minatojima-Minamimachi, Chuo-ku, Kobe 650-0047, Japan
cDepartment of Electrical Engineering and Bioscience, Graduate School of Advanced Science and Engineering, Waseda University, 3-4-1 Okubo, 169-8555, Shinjuku-ku, Tokyo, Japan
dCellular and Molecular Biotechnology Research Institute (CMB), National Institute of Advanced Industrial Science and Technology, 2-3-26 Aomi, Koto-ku, Tokyo 135-0064, Japan
First published on 21st August 2025
Nucleic acids (NA), namely DNA and RNA, dynamically fold and unfold to perform their functions in cells. Functional NAs include NA enzymes, such as ribozymes and DNAzymes. Their folding and target binding are governed by interactions between nucleobases, including base pairings, which follow thermodynamic principles. To elucidate biological mechanisms and enable diverse technical applications, it is essential to clarify the relationship between the primary sequence and the catalytic activity of NA enzymes. Unlike methods for predicting the stability of NA duplexes, which have been widely used for over half a century, predictive approaches for the catalytic activity of NA enzymes remain limited due to the low throughput of activity assays. However, recent advances in genome analysis and computational data science have significantly improved our understanding of the sequence–function relationship in NA enzymes. This article reviews the contributions of data-driven chemistry to understanding the reaction mechanisms of NA enzymes at the nucleotide level and predicting novel NA enzymes with catalytic activity from sequence information. Furthermore, we discuss potential databases for predicting NA enzyme activity under various solution conditions and their integration with artificial intelligence for future applications.
Proteins are the primary functional molecules in living organisms, but NAs themselves also exhibit functional activity. Recent advances, such as the Nobel Prize-winning methods for protein structure prediction, have made the design of functional proteins more feasible.5 However, even with empirical knowledge of protein synthesis and folding, practical challenges persist in deploying designed proteins for nanotechnology and medical applications.6 On the other hand, NAs are easier to prepare chemically and biosynthetically than proteins, owing to their simple chemical composition and high water solubility. Therefore, research on their functional application in technology and medicine is highly active. Functional NAs were first discovered in self-splicing RNA for introns.7,8 Such catalytic RNA sequences called ribozymes are widespread across genomes and participate in biological processes such as tRNA processing,9 rolling circle viral genome replication,10 and peptide bond synthesis.11 Various ribozyme families composed of different sequences and structures, such as hammerhead (HH),12 hepatitis delta virus (HDV),13 Varkud satellite (VS),14 and hairpin,15 have been isolated from cells and viruses. Moreover, bioinformatics approaches have contributed to the identification of other ribozyme families, including twister,16 twister-sister,17 hatchet,17 and pistol.17 HH, HDV, and twister family members have relatively small molecular sizes and high catalytic activities. Thus, these family members have frequently been used for in vitro and in vivo applications to generate RNAs with precise termini.16,18–20
Each ribozyme has its own primary sequence, which folds into the secondary and tertiary structures that form its active site (Fig. 1a). Natural ribozymes mainly catalyse the cleavage of RNA strands (Fig. 1b). The catalytic activity of RNA led to the RNA world hypothesis,21 which postulates that in prebiotic life, RNA not only replicates and transmits genetic information, but also catalyzes a number of metabolic reactions. In vitro selection techniques have generated ribozymes exhibiting various catalytic activities22,23 including phosphorylation, ligation, replication, aminoacylation, and Diels–Alder reactions.24–27 Besides RNA, DNA has also been found to exhibit similar catalytic activity in the form of deoxyribozymes (DNAzymes), which act as catalysts for Pb2+-dependent cleavage of RNA phophodiester bonds.28 RNA-cleaving DNAzyme offers an attractive modality for targeting undruggable regions of the human genome,29,30 since the solid-phase chemical synthesis of DNA makes it a more inexpensibe and scalable material than RNA. Moreover, a their targeting sequences can be designed for any mRNA, enabling the identification of highly specific, low off-target candidates for safer nucleic acid therapeutics.31,32 NA enzymes combined with aptamers (having molecular recognition ability) and complementary sequences (for DNA/RNA sequence recognition) have been actively investigated for applications such as biosensors and gene switches.33–35 Thus, the development and use of NA enzymes, including ribozymes and DNAzymes, are of great interest in biotechnology.
NA enzymes exhibit enzymatic activity through the formation of secondary and tertiary structures from primary nucleotide sequences via intramolecular base pairing and other interactions. Therefore, understanding the correlation between the sequence and function of NA enzymes is of great importance from both fundamental and applied perspectives.36 NAs, on the other hand, are subject to greater solution effects than proteins due to their nature as polyanions.37 For example, the activity of NA enzymes is determined by their tertiary structure, accepting target molecules dominated by cations and interaction with cofactors such as divalent metal ions.38 Notable targets of NA enzymes include DNA and RNA in cells, where the solution conditions are different from in vitro test tube conditions. The intracellular environment is both diverse and densely packed with biomolecules at concentrations ranging 50–400 g L−1.39 This molecular crowding profoundly affects nucelic acid conformation and stability.40 As the effect of solution conditions on the structural stability and enzymatic activity of NA enzymes is highly complex, the systematic demonstration of sequences and environment-dependent reactions is required to determine which parts of the sequence are important for the function of an NA enzyme. Accumulation of enzymatic data creates a database of the properties of NA enzymes, which can provide information as parameters to predict how the sequences of these enzymes correlate with their catalytic activity. Therefore, progress in the analysis of large RNA and DNA datasets, together with computational approaches using machine learning and artificial intelligence (AI), is expected to advance the science and technology field considerably. In this review, we introduce such recent advances in the development of NA enzymes using data-driven analysis. Additionally, we analyse a novel dataset for determining the activities of these enzymes and discuss their utility for future predictions.
The evolution process is propelled by mutations in genetic information. Consequently, the generated phenotype may fit more desired (or undesired) characteristics than the original. Thus, evolution toward the desired functions of NA enzymes relies on the “fitness” of the sequence and the efficiency of catalytic activity during each evolutional process.45 NGS technology facilitates the visualization of these processes, aiding the understanding of how NA enzyme sequences achieve catalytic function from all possible sequences across the entire fitness landscape (Fig. 2).46 The concept of a fitness landscape supports our understanding of evolutionary dynamics, describing a topography in which state variables such as genotype and phenotype are used as coordinates, and the height at each coordinate is the degree of adaptation.47 When applied to NA enzymes, the fitness landscape contributes significantly to our understanding of the improvements in catalytic activity, as knowledge of this adaptive topography makes it possible to predict what evolutionary processes a sequence population will undergo.
The first study to create a fitness landscape for NA enzymes tested the class II ligase ribozyme.48 The mutated ribozyme pool, containing >1013 sequences via error-prone replication, was incubated with the immobilized substrate strand to select the active ribozymes. After single-round selection, NGS data revealed the enrichment of the sequence containing a small Hamming distance, which is a measure of the difference between two strings of the same length compared to the original sequence (Fig. 3). Furthermore, the kinetically trapped incubation affected the enrichment of sequence reads of active ribozymes, which correlated with the observed catalytic rate constant (kobs) obtained experimentally.48 The large dataset of mutated ribozyme sequence and activity allowed the creation of a fitness landscape with >107 genotypes and phenotypes. It demonstrated the importance of sequences in the central bulge of the RNA and the distal end of paired region (helix) 3 (P3), along with other key residues characterised previously, in achieving maximal activity (Fig. 3).49,50 Thus, single or double mutations introduced through error-prone polymerase chain reaction or doped solid-phase synthesis enable the examination of fitness landscapes to identify the key sequence and structural determinants of NAzyme catalytic activity. This approach has been applied to various self-cleaving ribozymes owing to their small molecular sizes, which facilitate the efficient generation of mutants and direct assessment of mutational effects by reading enzyme activity.51–55 For example, analysis of every single and double mutant of the Osa-1-4 twister ribozyme from Oryza sativa (Fig. 4a) demonstrated its unexpected resilience to mutations, even with its compact and intricate structure. Notably, different structural components showed distinct levels of mutational sensitivity.53 A recent comprehensive mutational study analysed five self-cleaving ribozymes, including CPEB3, HDV, hairpin, and hammerhead (Fig. 4b–e), in addition to twister ribozymes, providing strcutural information about the ribozymes, including their paired regions, unpaired loops, non-canonical structures, and tertiary structural contacts.55 Additionally, NGS technology was used to study ribozyme evolution from random sequence and random structure space. In the case of Diels–Alderase ribozymes, increasing the selection pressure and analyzing the secondary structure through MFold prediction provided insights into how mutations can be rationally introduced to improve catalytic activity.56 The NGS approach can be adopted not only for ribozymes but also for DNAzymes.57 In a high-throughput kinetic analysis, 4096 DNAzyme reactions were assayed simultaneously at multiple time points to determine the observed rate constants (kobs) of 533 active mutants. These values were then used to calculate activation energies (Ea), offering detailed insights into the mutational landscape of the DNAzymes. Deep sequencing enabled this quantitative view of the sequence–function relationship, which would not have been achievable with traditional assays.
![]() | ||
| Fig. 3 Schematic illustration of population structure before and after one round of in vitro selection of ref. 48. The experimentally constructed fitness landscape clarified that the distal end of paired region (helix) 3 (P3) is a key residue for the class II ligase ribozyme, which had not been identified. | ||
The massive kinetic data on genotype and phenotype are also powerful tools for the analysis and development of NA enzymes triggered by ligand binding. One approach is to rationally develop an aptazyme, which is a combination of an aptamer and a self-cleaving ribozyme, to regulate translation by mRNA cleavage.51,58 In a previous study, all pairwise mutations in the glmS ribozyme triggered by glucosamine 6-phosphate (GlcN6P) were analysed using a custom-built fluorescent RNA array.54 This array was derived using a combined approach involving ribozyme transcription on a sequencing tip and direct measurement of single-molecule fluorescence (detected using a total internal reflection fluorescence microscope). The advantage of this approach is its ability to monitor self-cleavage over short and long timescales, which enables the differentiation of both slow and fast self-cleaving variants. More recently, a kinetic sequencing (k-seq) technique was developed to perform a more accurate kinetic analysis of ribozymes using NGS.59,60 This technique provided the rate constants and maximum amplitude of the reaction without specialised instrumentation. The k-seq technique has been used to study the fitness landscape of self-aminoacylating and glmS ribozymes.60,61 By leveraging this approach, it is possible to systematically explore how biochemical factors, such as catalytic efficiency and the Michaelis constant, influence sequence conservation, even among partially active or inactive NA variants. As mentioned earlier, massive amounts of sequence data can be linked to each kinetic parameter through recent technological advances. The activity parameters describe the chemical mechanisms of the reactions, depending on the sequence and structure of the NA enzymes, with nucleotide-level resolution. Thus, if a sequence of NA enzymes responds well, comparing it (without additional experimentation) with the response of a mutant can provide insights into the mechanism of the NA enzyme and indicate the presence of novel structural rearrangements.
In contrast to the in vitro evolution of NA enzymes, naturally evolved NA enzymes, in which the full structural diversity is observed in many classes of ribozymes found in nature, have also been targeted. Using massively parallel oligonucleotide synthesis, a diverse RNA pool was generated, enabling the direct functional testing of potential twister ribozyme sequences. This included over 1600 previously reported putative twisters and approximately 1000 new candidates derived from over a thousand different organisms.62,63 The cleavage high-throughput Assay, an NGS-based method for evaluating the activity of each potential sequence, revealed a broad structural tolerance to mutations.64 These data about the relationships between the sequence diversity and activity of the twister ribozymes could advance the computational search of the active twister ribozymes, which identified the first intrinsically active twister ribozyme in mammals.64 Although current studies are primarily focused on NA enzymes involved in cleavage, high throughput analysis has enabled the creation of large datasets for designing NA enzymes with other activities based on sequence information.
![]() | ||
| Fig. 5 Schematic illustration of MFold prediction.40 The energy dot plot displays the predicted optimal structure and one sub-optimal structure for the example sequence shown in the figure. Reproduced from ref. 40 with permission from the Royal Society of Chemistry. | ||
Loops and bulges contribute destabilising or stabilising factors that influence overall structural stability. Accounting for these energy contributions enhances the reliability of prediction systems.88 Despite its widespread use, the average accuracy of secondary structure prediction is 73% (±9%) for known canonical base pairs, indicating that the MFE approach alone has limited effectiveness in revealing true structures under different conditions. This can be attributed to the inability of the method to fully account for solution conditions, tertiary interactions, and protein binding effects. Therefore, the RNAstructure tool was developed for more reliable secondary structure predictions,99,100 integrating prediction constraints derived from experimental data—including selective 2′-hydroxyl acylation analysed by primer extension (SHAPE), enzymatic cleavage, and chemical modification accessibility.101
Although the MFE approach is based on a simple concept, various structural combinations can be generated from a single DNA or RNA sequence, resulting in the formation of several structural motifs. Thus, predicting the secondary structure of relatively long chains is difficult. In particular, NA enzymes require the formation of a tertiary structure, which makes prediction complex. Machine learning and artificial intelligence (AI) tools have performed well in addressing these issues.102 Besides MFE-based methods, non-MFE approaches relying on the centroid or maximum expected accuracy have also been developed.103–105 With over 100
000 RNA sequences now available in databases,106 the SPOT-RNA method, which is one of the representative predictions based on an advanced deep-learning technique, has been created to predict RNA secondary structures with exceptional accuracy.107 E2Efold is an end-to-end deep learning model to directly predict the RNA base-pairing matrix for the RNA secondary structure prediction.108 Ufold represents base matching on RNA sequences as “pseudo-image information” in a two-dimensional matrix.109 However, the non-MFE approach often causes overfitting of machine learning by rich parameterisation.110 These disadvantages have been minimised using a combination of MFE and non-MFE approaches to achieve greater prediction accuracy. LinearFold is based on a beam search teachnique to apply this algorithm to both machine-learned and Turner's thermodynamic models, resulting in fast and accurate prediction of the secondary structure from a long RNA strand.111,112 MXFold and MXFold2 generate folding scores, which are calculated using a deep neural network incorporating Turner's NN free energy parameters.113,114 Therefore, these studies strongly suggest that incorporating thermodynamic information can enhance the robustness of deep learning-based RNA secondary structure predictions. The representative methods for prediction of RNA secondary structure are listed in Table 1.
| Method | Concept | Feature | Ref. |
|---|---|---|---|
| NN model based prediction | |||
| MFold/UNAFold | MFE-based thermodynamics | The most commonly used prediction tool and has now been replaced by UNAFold. | 96–98 |
| RNAfold | MFE-based thermodynamics | Availability to compute the equilibrium partition functions and base-pairing probabilities. | 174 and 175 |
| Sfold | MFE-based thermodynamics | Sampling all possible structures in the Boltzmann ensemble of secondary structures. | 120 |
| RNAstructure | MFE-based thermodynamics | Database using alternative set of thermodynamic parameters compared to MFold and SHAPE data. | 99 and 100 |
| LinearX tool | MFE-based thermodynamics | Fast prediction of the secondary structure from a long RNA strand using the beam search technique. | 111 and 112 |
| NUPACK | MFE-based thermodynamics | Applicable to the prediction of pseudoknot structure. | 176 |
| PKNOTS | MFE-based thermodynamics | Applicable to the prediction of pseudoknot structure. | 177 |
| CONTRAfold | Machine learning with conditional log-linear models | Trained by the nearest neighbor models (without NN parameters) for solved RNA secondary structures with parameters corresponding to free energy. | 104 |
| ContextFold | Machine learning with max-margin framework | Trained by the nearest neighbor models (without NN parameters) using fine-grained RNA structures. | 178 |
| CentroidFold | Non MFE-based thermodynamics | The maximum expected accuracy approach | 105 |
| SimFold | MFE-based thermodynamics & machine learning with constraint generation and Boltzmann likelihood | Trained by large sets of structural as well as NN parameters for predicting the secondary structure of RNA with thermodynamic parameters. | 179 |
| MXfold | MFE-based thermodynamics & machine learning with max-margin framework | Combination of NN parameters with the structural data of RNAs trained by a method called structured support vector machine for precise prediction of substrutures. | 113 |
| MXfold2 | MFE-based thermodynamics & machine learning with max-margin framework and deep learning | Application of deep learning to learn RNA folding (helix stacking, helix opening, helix closing, and unpaired region) scores based on NN parameters | 114 |
| HotKnots | MFE-based thermodynamics & machine learning with constraint generation | Applicable to the prediction of pseudoknot structure. | 180 |
| EternaFold | Multi task machine learning methods | Trained by NN model based on different data types, SHAPE structures, and riboswitch-ligand binding affinity data for the accurate prediction of RNA structures. | 181 |
| Non-NN based model | |||
| Pfold | Probabilistic generative model using stochastic context-free grammars | Utilizing simple context-free grammars. | 182 |
| CONUS | Probabilistic generative model using stochastic context-free grammars | Comparing nine lightweight grammars for RNA secondary structure prediction. | 183 |
| TORNADO | Probabilistic generative model using stochastic context-free grammars | Describing various RNA grammars including NN models | 110 |
| SPOT-RNA | Deep learning based model | Using an ensemble of ultra-deep hybrid networks and pre-trained with a large set of non-redundant RNAs. | 107 |
| E2Efold | Deep learning based model | Predicting the probability of each nucleotide match by machine learning without any NN parameters | 108 |
| Ufold | Deep learning based model | Representing base matching on RNA sequences as “pseudo-image information” in a two-dimensional matrix. | 109 |
| Knotfold | Deep learning based model | Applicable to the prediction of pseudoknot structure. | 184 |
| RiNALMo | Deep learning based model | Utilizing the 650 million parameters RNA language model | 185 |
| RNAformer | Deep learning based model | Facilitating the application of axial attention like AlphaFold protein prediction. | 186 |
| Method | Design | Feature | Ref. |
|---|---|---|---|
| Sfold | RNA-cleaving ribozymes | Scoring of the complex formation of ribozyme and target RNA based on ΔG values obtained from the MFE-based secondary structure prediction. | 118 |
| Aladdin | HH ribozyme | Optimization of the stem stability of the ribozyme-target complex obtained from the MFE-based secondary structure prediction. | 119 |
| RiboSoft | RNA-cleaving ribozymes | Output of potential sequences with automatically minimizing the off-target effect. | 121 and 122 |
| DNAzymeBuilder | RNA-cleaving DNAzyme | The sequence design of DNAzymes based on NN parameters and | 123 |
| NAR Genom Bioinform 2023 | 10–23 DNAzyme | The sequence design of DNAzymes using NN parameters with machine learning based on the stem stability and the internal structure of the DNAzyme. | 125 |
| SequenceCraft | RNA-cleaving DNAzyme | Machine learning algorithms capable of predicting DNAzyme sequence and the potential rate constants based on various sequence-, cofactor-, and buffer-related factors. | 127 |
| Nat. Commun. 2024 | Ligase ribozyme | Providing the fitness landscape to rationally design the novel ribozymes. | 128 |
| RNAiFold | Various types of NA enzymes (HH ribozyme was demonstrated.) | Generation of ribozyme sequence under the concept to of inverse folding based on the MFE-based secondary structure prediction. | 130 |
| RfamGen | Various types of NA enzymes (glmS ribozyme was demonstrated.) | Generation of novel ribozyme sequences by deep learning of characteristics of a group of RNAs with specific functional and structural features. | 132 |
In such applications, a trans-acting DNAzyme also fascinating owing to the higher chemical stability and easier chemical synthesis than RNA. DNAzyme binds to the target RNA by forming stems as well as ribozyme, which are governed by thermodynamics, depending on the base components. DNAmoreDB and DNAzymeBuilder are pioneering web tools that can be used to design DNAzymes for any target sequence.123,124 DNAmoreDB is a comprehensive resource for DNAzymes that organises information such as sequences, selection conditions, catalysed reactions, kinetic parameters, substrates, cofactors, structural data (when available), and literature. The DNAzymeBuilder database includes the details of 44 RNA-cleaving and 93 DNA-cleaving DNAzymes, including those with RNA-like rA at the cleavage site, all of which function in a trans-cleavage manner and cleave intermolecular substrates. This internal database compiles extensive data on DNAzymes, including optimal reaction conditions, kinetic properties, types of catalysed reactions, sequence recognition, cleavage sites, and the necessary design elements to ensure optimal DNAzyme performance. Thus, predicted information on the target site, DNAzyme sequence, and catalytic activity can be obtained. To further enhance the prediction of DNAzyme activity, a machine learning approach was employed. This approach was used to identify DNAzymes capable of efficiently triaging thousands of potential molecules specific to a target RNA.125 Based on logistic regression, the developpers trained the model on published and newly generated 10–23 DNAzyme activity data incorporating (1) the energetic parameters of the enzyme/target stems and (2) the DNAzyme secondary structure derived from NN parameters of RNA/DNA hybrids,75 obtained using the secondary structure prediction tool.101,126 The analysis revealed that the binding free energy between the DNAzyme and its RNA target is the key factor influencing efficiency. However, other elements, such as the internal structure of the DNAzyme, also play a crucial role in determining its catalytic activity.125 The machine learning approach is also trained the established database from DNAmoreDB, which is called SequenceCraft.127 This approach was trained with the kobs data from 178 RNA-cleaving DNAzymes together with varying experimental conditions, including cofactor type and concentration, pH, and temperature. In this platform, a dot-bracket notation of secondary structures calculated using MFold was used to generate a numerical vector, which ensures the good prediction accuracy of the kobs values.
As shown earlier, machine learning and AI technologies have advanced and have been applied to the prediction of RNA secondary structures.102 Moreover, creating a large dataset of NA enzymes can provide sufficient teaching data for AI to output accurate predictions (Fig. 2). One approach is to study the NGS data obtained from massive mutational analyses of NA enzymes to generate a fitness landscape. The fitness landscape can provide valuable information not only for elucidating the evolutionary process of ribozymes, but also for the rational design of novel enzymes. For instance, the AI technique involving NGS-based high-throughput data enabled the understanding of the F1*U ribozyme neutral network.128 In this study, experimental evaluation of over 120
000 ribozyme sequences provided valuable empirical evidence that neutral networks can enhance the accessibility and predictability of the fitness landscape. In another study, the effects of higher-order mutations on the CPEB-3 ribozyme were also reported.129
Inverse folding has been studied for over a decade in the design of NA enzymes. RNAiFold was used as an example to design ten artificial cis-cleaving HH ribozymes by identifying RNA sequences whose MFE secondary structure corresponded to a user-defined target.130 Each of the ribozymes demonstrated functionality in a cleavage assay. However, this method has some challenges in terms of accuracy and versatility because of the difficulty in predicting tertiary interactions of nucleotides. Therefore, advanced computational approaches are required. One of these approaches, a deep generative model, which has already been applied to protein design,131 is an attractive pipeline for generating novel designs for NA enzymes. Recently, the world's first deep generative model for NA enzymes, RfamGen, was developed to support the design of artificial RNAs with desired functions and structures.132 RfamGen combines a variational autoencoder, a method widely used in deep generative modelling, and a covariance model that can classify functional RNAs from information on RNA sequences and secondary structures. These features can be learned, and artificial sequences can be generated. Computer analysis and biochemical experiments confirmed that RfamGen could stably generate RNA sequences with a structure and function homologous to the learned RNA population. The performance of RfamGen was also attributed to the application of a covariance model to a deep generative model. RfamGen was employed to generate 1000 new sequences using the glmS ribozyme that cleaves its own RNA sequence by binding to small molecules. A comprehensive analysis of the generated RNA sequences was conducted on a large scale.132 Interestingly, RfamGen showed a greater tendency to generate high-activity enzyme sequences than native sequences.132 Therefore, recent advances in computational approaches using both machine learning and AI could demonstrate predictions for the development and generation of novel NA enzymes.
Recent AI-based predictions target not only RNA-RNA interaction but also RNA-protein interaction.133 Moreover, a predictive tool for tertiary structures of NA enzymes has also been developed recently. Similar to how AlphaFold predicts protein structure from sequence information,134 the machine learning and deep learning approach such as RhoFold+, RNA-Puzzles and trRosettaRNA can be used to develop a prediction tool for NA structures, including NA enzymes.135–137 However, one issue is that the number of solved tertiary structures of NAs is small for machine learning and deep learning approaches, in contrast with the extensive datasets available for proteins. However, the simpler chemical characteristics of NAs compared to proteins may overcome this issue, allowing computational approaches to provide the necessary structural information.
To predict the NA enzyme activity in various conditions, the effect of the environment on the energetic contribution to the NA enzyme is also fundamental. As shown in Fig. 6, the targeting function via the duplex formation is driven by cation concentration, hydration and molecular crowding, whereas the catalysis function is affected by molecular crowding for the formation of the tertiary structure as well as the binding of the cofactor of the metal ion by the dielectric constant changes. In the classical secondary structure prediction of NAs based on the MFE approach, cation concentration corrections have been widely applied using improved NN parameters.73,138–140 For example, these corrections enabled the MFold database to predict structures at arbitrary NaCl concentrations.96 However, current tools rely on Turner's NN parameters with such corrections. Therefore, the effects of hydration and crowding on NA stability have not yet been considered. Solutions under cellular conditions are densely packed with biomacromolecules such as proteins and NAs. Large numbers of small molecules, such as metabolites and metal ions, are also present. Hence, incorporating the new NN parameter datasets can expand the feasability of the MFE approach for predicting NA stability and structure, especially under cellular conditions. The total concentration of macromolecules has been estimated to reach 400 mg mL−1, occupying approximately 40% of the intracellular space.141 These in vivo crowded conditions are extremely different from the diluted conditions of standard in vitro systems. Therefore, it is important to understand the physicochemical properties of NAs under molecular crowding. NA foldings and unfoldings occur in equilibrium, and are accompanied by structural changes and water interactions (Fig. 8a). The biophysical effects of molecular crowding are mostly based on the physicochemical properties of the crowders, which affect the volume and hydration effects of NA folding processes (Fig. 8b and c). The formation of a duplex makes the strand volume compact; a large cosolute tends to stabilise the duplex, whereas a small cosolute destroys it by effectively decreasing the water activity of the solution (as duplex formation accompanies hydration).142 For NA enzyme reactions, the effect of crowders on the dielectric constant and viscosity plays an important role in the reaction kinetics.143 Moreover, crowders can interact directly with NAs to stabilise the structure via CH–π interactions.144 Therefore, the behaviour of NA structures is influenced by these biophysical factors under molecular crowding conditions.145,146
To consider the effect of molecular crowding on dupex stability, the NN parameters for DNA duplexes (including self- and non-self- complementary strands) have been determined for buffers containing 100 mM NaCl with 40 wt% polyethylene glycol 200 (PEG200) (Table 3).147 Compared to NN parameters under non-crowding condition, molecular crowding exhibited different effects on each NN parameter. Moreover, the relative destabilisation of NN with only GC pairs—d(CG/GC), d(GC/CG), and d(GG/CC)—was considerably larger than that of other NN pairs. This may be attributed to the low water activity caused by PEG200, as GC pairs require more water molecules for stabilisation compared to AT pairs.148 The most remarkable difference was found among the initiation factors. The ΔH° and ΔS° corresponding to duplex initiation differed drastically under crowding conditions compared to that in the solution without cosolutes. This was because of the preferential hydration of terminal oligonucleotide pairs induced by the cosolute in the crowded environment.147 Although this is the first report of NN parameters under crowding conditions, their application should not be limited to specific environments. For example, NN parameters under crowding have been only determined under 40% PEG condition with 100 mM NaCl for the DNA/DNA duplex149 and under 20% PEG200 condition with 1 M NaCl for the RNA/RNA duplex.150 Parameters for RNA/RNA duplexes in the Eco80 artificial cytoplasm, which contains 80% of Escherichia coli metabolites and biological concentrations of metal ions, have also reported.151 To generalise the available NN parameters for various cation and crowding conditions, each value of
for a duplex was considered as the sum of contributions from the bulk structure, cations, and crowders
. The
parameters at arbitrary concentrations of NaCl can be calculated from those measured at 1 M NaCl by applying the known dependence of [Na+] for each NN base pair and regarding
as the value at 0 M [Na+].152Fig. 9 illustrates the scheme for obtaining NN parameters under the desired solution conditions. The
parameters can be determined from the linear function of changes in water activity Δaw, as duplex (de)stabilisation
in the presence of crowders correlated linearly with changes in the excluded volume of cosolutes and water activity.153 Based on this strategy, the improved NN parameters for any molecular environment can be obtained,147 This approach of adjusting NN parameters according to cation and crowder conditions has been successfully applied to RNA/RNA and RNA/DNA duplexes.154,155Table 3 shows the NN parameters of DNA/DNA, RNA/RNA, and RNA/DNA duplexes under 100 mM NaCl and 40 wt% PEG200 conditions. The parameters generally applied to different solutions for RNA/RNA duplexes are listed in Table 4. Thus, the latest NN parameters can be regarded as universal for predicting duplexes under arbitrary solution conditions. Examples of some predictions under different salt and crowding conditions are listed in Table 5.
| Sequence | (kcal mol−1) | (cal mol−1 K−1) | (kcal mol−1) |
|---|---|---|---|
| a Experiments were conducted in 10 mM Na2HPO4, 1 mM Na2EDTA, 100 mM NaCl, and 40 wt% PEG200 at pH 7.0. | |||
| DNA/DNA | |||
| d(AA/TT) | −6.5 ± 0.3 | −19.2 ± 0.8 | −0.55 ± 0.07 |
| d(AT/TA) | −9.4 ± 0.3 | −29.4 ± 0.8 | −0.28 ± 0.05 |
| d(TA/AT) | −4.3 ± 0.5 | −13.3 ± 1.3 | −0.16 ± 0.14 |
| d(CA/GT) | −13.1 ± 0.1 | −38.8 ± 0.1 | −1.00 ± 0.05 |
| d(GT/CA) | −9.2 ± 0.1 | −26.8 ± 0.1 | −0.89 ± 0.01 |
| d(CT/GA) | −3.4 ± 0.6 | −7.9 ± 1.6 | −0.91 ± 0.11 |
| d(GA/CT) | −4.9 ± 0.7 | −13.0 ± 2.1 | −0.87 ± 0.06 |
| d(CG/GC) | −6.4 ± 0.7 | −16.1 ± 2.0 | −1.38 ± 0.12 |
| d(GC/CG) | −4.2 ± 0.7 | −9.3 ± 2.0 | −1.31 ± 0.06 |
| d(GG/CC) | −4.0 ± 0.6 | −8.9 ± 2.0 | −1.25 ± 0.03 |
| Initiation per GC | −10.1 ± 0.2 | −35.1 ± 0.5 | 0.76 ± 0.06 |
| Initiation per AT | −2.9 ± 0.3 | −12.7 ± 0.9 | 1.00 ± 0.07 |
| Self-complementary | 0 | −1.4 | 0.40 |
| Non-self-complementary | 0 | 0 | 0 |
| RNA/RNA | |||
| r(AA/UU) | −10.0 ± 0.1 | −30.4 ± 0.2 | −0.57 ± 0.05 |
| r(AU/UA) | −10.1 ± 0.1 | −30.8 ± 0.1 | −0.55 ± 0.03 |
| r(UA/AU) | −11.1 ± 0.4 | −31.5 ± 0.8 | −1.33 ± 0.09 |
| r(CA/GU) | −12.1 ± 0.3 | −32.1 ± 0.6 | −2.14 ± 0.08 |
| r(GU/CA) | −10.7 ± 0.2 | −28.7 ± 0.1 | −1.80 ± 0.16 |
| r(CU/GA) | −11.2 ± 0.3 | −30.4 ± 0.3 | −1.77 ± 0.17 |
| r(GA/CU) | −11.7 ± 0.2 | −30.5 ± 0.3 | −2.24 ± 0.07 |
| r(CG/GC) | −11.1 ± 0.5 | −28.8 ± 1.1 | −2.16 ± 0.16 |
| r(GC/CG) | −13.8 ± 0.1 | −34.6 ± 0.1 | −3.07 ± 0.01 |
| r(GG/CC) | −14.8 ± 0.1 | −38.4 ± 0.1 | −2.89 ± 0.06 |
| Initiation | 4.6 ± 2.0 | −2.9 ± 6.1 | 5.50 ± 0.11 |
| Per terminal AU | 6.5 ± 0.1 | 18.2 ± 0.1 | 0.85 ± 0.92 |
| Self-complementary | 0 | −1.4 | 0.43 |
| Non-self-complementary | 0 | 0 | 0 |
| RNA/DNA | |||
| rAA/dTT | −6.6 ± 0.4 | −19.8 ± 0.2 | −0.46 ± 0.05 |
| rAC/dGT | −8.3 ± 0.3 | −23.0 ± 0.2 | −1.17 ± 0.06 |
| rAG/dCT | −7.7 ± 0.0 | −20.7 ± 0.1 | −1.28 ± 0.01 |
| rAU/dAT | −7.6 ± 0.5 | −24.0 ± 0.1 | −0.16 ± 0.03 |
| rCA/dTG | −9.0 ± 0.3 | −27.2 ± 0.3 | −0.56 ± 0.09 |
| rCC/dGG | −8.1 ± 0.1 | −20.5 ± 0.0 | −1.74 ± 0.01 |
| rCG/dCG | −7.8 ± 0.2 | −21.5 ± 0.1 | −1.13 ± 0.02 |
| rCU/dAG | −5.3 ± 0.1 | −16.5 ± 0.1 | −0.18 ± 0.02 |
| rGA/dTC | −6.8 ± 0.2 | −19.2 ± 0.1 | −0.85 ± 0.02 |
| rGC/dGC | −−8.6 ± 0.0 | −21.7 ± 0.1 | −1.87 ± 0.03 |
| rGG/dCC | −11.5 ± 0.3 | −30.9 ± 1.4 | −1.92 ± 0.00 |
| rGU/dAC | −7.1 ± 0.4 | −20.2 ± 0.6 | −0.83 ± 0.03 |
| rUA/dTA | −7.3 ± 0.5 | −22.9 ± 0.1 | −0.20 ± 0.03 |
| rUC/dGA | −5.7 ± 0.6 | −13.9 ± 0.2 | −1.39 ± 0.05 |
| rUG/dCA | −8.0 ± 0.5 | −21.5 ± 0.1 | −1.33 ± 0.04 |
| rUU/dAA | −7.2 ± 0.1 | −22.7 ± 0.1 | −0.16 ± 0.04 |
| init. per rG-dC or rC-dG | −5.0 ± 0.3 | −19.7 ± 0.3 | 1.11 ± 0.16 |
| init. per rA-dT or rU-dA | −3.0 ± 0.1 | −13.9 ± 0.4 | 1.31 ± 0.12 |
![]() | ||
Fig. 9 Schematic representation of the contributions of bulk structure, cations, and crowders to NN parameters. (Left) Variation of against the Na+ concentration, based on data from Weber's report.188 (Mid) Plot showing the excluded volume effect for RNA-PEG interactions against base pair length. (Right) Plot of the contribution of water activity on the stability of r(GAUUACGCCUG) against Δaw. | ||
and
in 100 mM NaCl, with prefactors (mcs) for different cosolutesa
| Sequence | (kcal mol−1) | (kcal mol−1) | m cs (kcal mol−1) | ||
|---|---|---|---|---|---|
| DNA/DNA | PEG/1,2 DME | EG/GLY | 1,3PDO/2-ME | ||
| a Correction factor for self-complementary sequences is 0.4 kcal mol−1 for all cosolutes, as it is independent of crowding environments. b Cation concentration is 100 mM Na+. c Crowder condition is 40 wt% PEG200. d Different cofactors were used for each crowder: polyethylene glycol (PEG), 2-methoxy ethanol (2-ME), 1,2-dimethoxyethane (1,2 DME), ethylene glycol (EG), glycerol (GLY), and 1,3-propanediol (1,3 PDO). e In the case of RNA/RNA, the excluded volume effect and water activity contribution should be considered separately for accurate prediction. f Excluded volume effect for terminal AU pairs was not considered to avoid overestimation, as it had already been considered for initiation. | |||||
| d(AA/TT) | −0.65 | 0.10 | 2.0 | 0.7 | 1.3 |
| d(AT/TA) | −0.60 | 0.32 | 6.4 | 2.2 | 4.2 |
| d(TA/AT) | −0.36 | 0.20 | 4.0 | 1.4 | 2.6 |
| d(CA/GT) | −1.23 | 0.23 | 4.6 | 1.6 | 3.0 |
| d(GT/CA) | −1.20 | 0.31 | 6.2 | 2.2 | 4.1 |
| d(CT/GA) | −1.11 | 0.20 | 4.0 | 1.4 | 2.6 |
| d(GA/CT) | −0.93 | 0.06 | 1.2 | 0.4 | 0.8 |
| d(CG/GC) | −1.85 | 0.47 | 9.4 | 3.3 | 6.2 |
| d(GC/CG) | −2.05 | 0.72 | 14.4 | 5.0 | 9.5 |
| d(GG/CC) | −1.69 | 0.44 | 8.8 | 3.0 | 5.8 |
| Initiation per GC | 0.98 | −0.22 | −4.4 | −1.5 | −2.9 |
| Initiation per AT | 1.03 | −0.03 | −0.6 | −0.2 | −0.4 |
| RNA/RNA | (kcal mol−1) | (kcal mol−1) | PEG/2-ME/1,2 DME | EG/GLY/1,3 PDO | |
|---|---|---|---|---|---|
| r(AA/UU) | −0.77 | −0.22 | 0.35 | 7.1 | 2.9 |
| r(AU/UA) | −0.52 | −0.22 | 0.19 | 3.9 | 1.6 |
| r(UA/AU) | −1.25 | −0.22 | 0.19 | 3.9 | 1.6 |
| r(CA/GU) | −1.77 | −0.22 | −0.14 | −2.9 | −1.2 |
| r(GU/CA) | −2.08 | −0.22 | 0.56 | 11.4 | 4.7 |
| r(CU/GA) | −1.76 | −0.22 | 0.25 | 5.1 | 2.1 |
| r(GA/CU) | −2.20 | −0.22 | 0.20 | 4.1 | 1.7 |
| r(CG/GC) | −2.16 | −0.22 | 0.24 | 4.9 | 2.0 |
| r(GC/CG) | −3.24 | −0.22 | 0.45 | 9.2 | 3.8 |
| r(GG/CC) | −3.08 | −0.22 | 0.43 | 8.8 | 3.6 |
| initiation | −0.77 | −0.22 | 1.63 | 33.3 | 13.7 |
| per terminal AU | −0.52 | NAf | 0.40 | 8.2 | 3.4 |
| Sequence | Solutiona | Measured (kcal mol−1) | Ref. | Predicted (kcal mol−1) |
|---|---|---|---|---|
| a Experiments were performed in a buffer containing 10 mM Na2HPO4 (pH 7.0), 1 mM Na2EDTA, and NaCl. | ||||
| d(GAGGTCGT) | 10 wt% PEG200 at 1 M NaCl | −8.4 ± 0.1 | 153 | −8.3 |
| 20 wt% PEG200 at 1 M NaCl | −7.7 ± 0.1 | −7.9 | ||
| 30 wt% PEG200 at 1 M NaCl | −7.0 ± 0.3 | −7.0 | ||
| d(ATGCGCAT) | 20 wt% PEG1000 at 1 M NaCl | −8.2 ± 0.3 | 187 | −8.1 |
| 20 wt% PEG6000 at 1 M NaCl | −8.6 ± 0.6 | −8.4 | ||
| d(CCGTACGG) | 20 wt% EG at 100 mM NaCl | −7.2 ± 0.8 | 147 | −6.7 |
| 20 wt% 1,3 PDO at 100 mM NaCl | −6.6 ± 0.3 | −6.2 | ||
| d(CCGTAACGTTGG) | 20 wt% EG at 100 mM NaCl | −10.9 ± 0.8 | 147 | −10.5 |
| 20 wt% 1,3 PDO at 100 mM NaCl | −10.8 ± 0.9 | −9.9 | ||
| r(GGCUCAAUUGAC) | 10 wt% PEG200 at 100 mM NaCl | −15.1 ± 0.8 | 154 | −15.4 |
| 20 wt% PEG200 at 100 mM NaCl | −14.8 ± 0.6 | −14.7 | ||
| 30 wt% PEG200 at 100 mM NaCl | −14.0 ± 0.7 | −14.0 | ||
| 40 wt% PEG200 at 100 mM NaCl | −13.8 ± 0.6 | −13.7 | ||
| 20 wt% EG at 100 mM NaCl | −14.7 ± 0.6 | −14.7 | ||
| 20 wt% PEG2000 at 100 mM NaCl | −16.9 ± 0.8 | −16.3 | ||
| 20 wt% PEG8000 at 100 mM NaCl | −16.7 ± 0.4 | −16.5 | ||
| 10 wt% PEG200 at 1 M NaCl | −18.9 ± 0.8 | −17.9 | ||
| 20 wt% PEG200 at 1 M NaCl | −18.3 ± 0.6 | −17.2 | ||
| r(GGAUCGAUCC) | 20 wt% EG at 100 mM NaCl | −12.7 ± 0.7 | 154 | −13.1 |
| r(AUCAGCUGAU) | 20 wt% EG at 100 mM NaCl | −9.9 ± 0.6 | 154 | −9.9 |
| r(GGCUCAAUUGAC) | 20 wt% EG at 100 mM NaCl | −14.4 ± 0.6 | 154 | −15.0 |
| r(GAUCCGGAUC) | 20 wt% 1,3 PDO at 100 mM NaCl | −14.5 ± 0.7 | 154 | −12.9 |
| r(GGCUCAAUUGAC) | 20 wt% 1,3 PDO at 100 mM NaCl | −14.7 ± 0.6 | 154 | −14.7 |
| r(GCUAUG) | 20 vol% PEG200 at 1 M NaCl | −5.2 ± 0.2 | 150 | −5.2 |
| r(AGAUAUCU) | 20 vol% PEG200 at 1 M NaCl | −5.7 ± 0.1 | 150 | −5.6 |
| r(UUAUCGAUAA) | 20 vol% PEG200 at 1 M NaCl | −6.9 ± 0.0 | 150 | −6.8 |
Considering that the nucleolar environment is similar to that of PEG200,156 the stability of the DNA duplex in Ddx4 liquid–liquid phase separation (LLPS)157 was successfully predicted using the universal parameters derived from 50% PEG200 with 0.1 M NaCl conditions.147 This approach clarifies that the nucleolar condition can be mimicked by the crowding conditions, allowing the investigation of various NA behaviours in the nucleolus using controlled solution conditions. These parameters could also accurately predict the stability of the RNA hairpin in the nucleus and cytosol,154 and the efficiency of gene editing by CRISPR/Cas9.155 Moreover, improvements in predicting the stability of GC- and AT-biased DNA duplexes enabled the prediction of the efficiency of G-quadruplex formations from GC-rich sequences, as well as the identification of the replication initiation region in genomic DNAs.158 These approaches, which mimick local cellular environments, can be a novel platform to assess the behaviour of NAs in such localised areas. For example, the mitochondrial environment in human cells induces G4 formation owing to the highly crowded conditions compared to the nucleus,159 which can be regarded as a 60 wt% 1,3-propanediol (1,3 PDO) solution.160 These findings indicate that the classical NN parameters obtained using 1 M NaCl solutions are not suitable for the prediction of either duplex stability or NA function based on duplex formations. Therefore, the newly obtained NN parameters can be key components for predicting and developing functional NA enzymes in specific environments, particularly under intracellular conditions.
In addition to the solution environment, compartmentalization also affects the solution conditions and NA behaviour. Since cellular environments are composed of lipid compartments, their effect on the structural stability of NAs and activity of NA enzymes should be critical from an evolutionary viewpoint, especially as a protocell model. The effects of the compartments on NA behaviour have been studied using reverse micelles and liposomes. Reverse micelles can create a nano-confinement space of variable size by changing the ratio of surfactants such as sodium bis(2-ethylhexyl)sulfosuccinate. These conditions in reverse micelles efficiently decrease DNA duplex stability.161 Interestingly, non-duplex structures such as G-quadruplex and i-motif formations are promoted in the nano-confinement space by the reverse micelles.162,163 These findings indicate that solution compartments within the nanometre size range alter the solution environment, similar to molecular crowding, and cause the destabilisation of duplexes and stabilisation of non-duplexes. For NA enzyme activity, the excluded volume effect caused by compartmentalisation significantly promotes reaction activity. In one case, the reaction kinetics and conformational folding of hairpin ribozymes within a liposome were investigated.164 The conditions inside the liposome (100 nm in diameter), prepared from a 1
:
1 mixture of oleic acid and 1-palmitoyl-2-oleoyl-glycero-3-phosphocholine, enhanced both intermolecular and intramolecular RNA interactions. Simultaneously, it promoted the proper folding of tertiary structures, including the docked conformation of the active hairpin ribozyme and its characteristic triplex arrangement. Moreover, the misfolding rate of the active structure was reduced, contributing to the promotion of ribozyme activity. A similar phenomenon was observed in the case of a self-aminoacylating ribozyme.165
Membrane-less compartments formed by LLPS also provide a confinement environment for NA enzymes. The Ddx4 LLPS decreases the DNA and RNA duplexes similar to PEG200-based in vitro crowding conditions.157 Cationic polymers or peptides are used to form LLPS with HH, hairpin, R3C ligase, and 10–23 DNAzyme ribozymes, all of which are activated, unlike in bulk solutions.166,167 The unique LLPS conditions can simulate the unique ionic conditions required for ribozyme activity by concentrating ions such as Mg2+s.168 Although the physicochemical properties of LLPS are not clearly defined, ribozyme activation have been recreated in molecular crowding conditions.169–171 Thus, future parameterization of solution properties mimicked by in vitro crowding conditions can provide a useful index to predict the stability and function of NA enzymes in the intracellular membrane or membrane-less organelles, including nucleus, mitochondria, nucleolus, and stress granules. For the application of NA enzymes in cells, their chemical modification is necessary to avoid degradation. As chemical modifications can affect both the stability and tertiary structure of NAs,172,173 a comprehensive analysis of the effects of chemical modifications on will also impact the development and functional prediction of NA enzymes.
As computational structural predictions advance in the field of NAs, an accurate prediction model for NAs, similar to AlphaFold, may be established in the near future. However, two key issues, which the alphaFold algorithm does not account for, remain: (1) prediction of the de novo structure from primary sequence information and, more importantly, (2) effect of the molecular environment on structure formation. Since, NAs fold dynamically to form tertiary structures from single strands and are more sensitive to the environment than proteins, basic energetic information on how the environment affects their base pairing is required for accurately predicting the structure and activity of NA enzymes. While computational approaches, including AI and machine learning techniques, have progressed rapidly, the necessary experimentally obtained fundamental databases have not been adequately collected. However, as reviewed in this article, the key factors determining the folding and activity of NA enzymes at the sequence level have now been identified, based on the accumulation of chemical properties of NAs affecting their thermodynamics (Fig. 10). The elucidation of the underlying chemistry, driven by a large dataset collected under various critical situations, would be useful for designing AI and machine learning techniques to solve the structure and activity of NA enzymes within specific environments. Moreover, the de novo generation of NA enzymes that function actively in targeted environments could be realized without requiring a massive database of experimentally solved tertiary nucleic acid structures.
![]() | ||
| Fig. 10 Required elements for the rational design of active NA enzymes in cells from NA sequence and environment information. | ||
| This journal is © The Royal Society of Chemistry 2025 |