Nina Sheng
Li‡
a,
Adriana
Coll De Peña‡
b,
Matei
Vaduva
c,
Somdatta
Goswami
d and
Anubhav
Tripathi
*b
aThe Warren Alpert Medical School, Brown University, Providence, RI 02906, USA
bCenter for Biomedical Engineering, School of Engineering, Brown University, 182 Hope Street, Providence, RI, USA
cDepartment of Molecular Biology, Cell Biology, and Biochemistry, Division of Biology and Medicine, Brown University, Providence, RI, USA
dDepartment of Civil and Systems Engineering, Johns Hopkins University, Baltimore, MD, USA
First published on 15th July 2025
RNA-based therapeutics are currently at the forefront of the biopharmaceutical industry because of their safety, efficacy, and shortened time from disease discovery to therapy development. Microfluidic electrophoresis provides a great analytical platform to analyze nucleic acids in unprecedented detail. However, while DNA has been studied extensively within microfluidic systems, there is limited data available for RNA, particularly of chemically modified molecules, such as those used in the COVID-19 mRNA vaccines, and for long double-stranded RNA molecules, which may accompany, intentionally or as a by-product, RNA therapeutics. To this end, this study focused on the empirical microfluidic electrophoretic analysis of double- and single-stranded RNA, non-modified and pseudouridine-modified, at varying gel concentrations. It then compared the findings to the electrophoretic mobility models in the literature. This work was then complemented with data-driven and physics-informed neural networks that successfully predicted the migration time and length of different RNA molecules with an average error of 12.34% for the data-driven model and 0.77% for the physics-informed model. The low error in the physics-informed neural networks opens the doors to the electrophoretic characterization of molecules, even beyond RNA, without the need for extensive experimental data.
The current models describing nucleic acid electrophoretic mobility are differentiated according to the relative sizing between the pore size of the semi-dilute polymeric network and the nucleic acid size, as defined by the radius of gyration (Rg).12 The main models include the Ogston model, describing DNA molecules with Rg smaller than the pore size, and the Biased Reptation with Fluctuation (BRF), describing DNA molecules with Rg greater than the pore size. The BRF model is further differentiated into two scenarios: reptation without orientation and reptation with orientation. The Ogston model assumes the DNA is a spherical object moving through a sieve driven by the electric field.12,13 In this model, mobility is proportional to the exponential of the negative concentration of the polymer solution.12,13 According to the BRF, mobility scales as 1/N for short chains and levels off for large sizes and/or high electric fields.12,14 Each model has respective limitations and alternative modifications have also been made to better describe nucleic acid movement in capillary electrophoresis.15–18 While there has been abundant research in the electrophoretic separation of DNA, much less work has focused on the separation and mobility models of RNA, especially long nucleoside-modified mRNA and double-stranded RNA (dsRNA).19–21
This study aims to explain the electrophoretic mobility of different RNA molecules in microfluidic systems and understand the underlying physical principles that govern the migratory patterns of differently sized RNA in varying concentrations of semi-dilute polymer solutions. Given their clinical relevance in mRNA vaccines and therapies, a focus is placed on the mobility of both single- and double-stranded RNA and the potential impact of nucleoside modifications.22–24 To the authors’ knowledge, this is the first time the electrokinetic of dsRNA fragments, especially that of longer length and nucleoside-modified RNA, has been studied in microfluidic capillary electrophoresis. With the rise in RNA research, the mobility of RNA with pseudouridine modifications, which enhance RNA stability and decrease their immunogenic response,23,25 will be important to characterize. Additionally, immunogenic dsRNA, intentional or residual, will be a critical component in future vaccine and alternative therapy research and development.
This paper also aims to provide predictive modeling of the electrophoretic mobility of single- and double-stranded RNA of varying lengths under different conditions using artificial neural networks (ANNs), a class of machine learning tools,26–29 to provide guidelines for future assay development, diagnostic protocols, quality control platforms, and genetic material differentiation. Both data-driven30,31 and physics-driven32,33 ANNs obtaining low margins of error were trained. This work highlights how ANNS, particularly physics-informed neural networks (PINNs), can be used to increase the understanding of physical behavior of biological samples, such as their electrophoretic mobility, and to support the development of analytical methods.
Once the samples were diluted to the desired concentration, 10–15 μL were loaded onto a 384-well plate, and the well plate and chip were loaded onto the platform. Independent of gel concentration, a script containing the same loading, injecting, and separation voltages, which have been described in our previous study, was used for all experiments.4 However, the separation time was increased depending on the gel concentration to ensure all peaks were captured. After the script was run, the LabChip Reviewer software (Revvity) was used to visualize the electropherograms.
Unless otherwise specified, all experiments were conducted in 2–3 experimental repeats with 2–3 instrumental repeats, yielding 6–9 data points per condition and sample tested. The statistical analyses were conducted using GraphPad Prism 9.4.1 (681), and the significance was by a Tukey post hoc test with a confidence interval of 95%; *p < 0.05, **p < 0.01, ***p < 0.001, ****p < 0.0001. GraphPad Prism 9.4.1 was also used to generate the non-linear regression fits reported across the study.
![]() | (1) |
![]() | (2) |
When analyzing mobility, it is important to examine the relationship between the pore size of the sieving matrix and the radius of gyration of the nucleic acid. Defined as the average distance squared between different parts of the object and its center of mass, this measurement provides information regarding the average shape of the nucleic acid that end-to-end distance does not, which is essential to characterize as electric fields can induce different molecular conformations.36 The radius of gyration of nucleic acids can be approximated by:37
![]() | (3) |
• Task 1: to predict length of the base sequence/number of base pairs (nb), given the migration time (mt), gel concentration (gc), information about the type of RNA (ssRNA or dsRNA) (tp), and the corresponding persistence length (lp).
• Task 2: to predict the migration time (mt), given the length of the base sequence/number of base pairs (nb), gel concentration (gc), information about the type of RNA (ssRNA or dsRNA) (tp), its persistence length (lp), and the molecular weight of the sequence (Mw).
In these two tasks, all the variables except tp have multiple discrete values as tp is used as an identifier with a value of 1 (for ssRNA) or 2 (for dsRNA). The additional parameter of sequence molecular weight (Mw) was included in task 2 to improve model robustness and allow for the model to capture trends that may deviate from theoretical expectations. Additionally, due to varying base composition, modifications, sequence-specific variations, the inclusion of (Mw) may help in capturing additional variability not stated by the other input parameters. Mw is directly tied to the unknown output of task 1 and therefore was not included as an input parameter. To efficiently solve the tasks, we developed a data-driven and physics-informed neural network frameworks, employing deep neural networks. More information regarding deep neural networks as well as basis of both data-driven and physics-driven neural networks can be found in our ESI.†
One primary bottleneck of data-driven neural networks (both deep and shallow) rests in the fact that a considerable amount of training data is required. In this work, we employ data-augmentation schemes to generate additional labeled datasets given the datasets obtained from the experiments. To that end, we plot the curves nbvs. mt for every gc obtained from the experimental data and obtain the equation of the curve using logarithmic regression. For every gc, we generate an additional 20 sample points considering sample nb points uniformly distributed between 100 and 4000 bases. The original experimental data as well as the obtained data from the data-augmentation schemes were using for training of the ANNs. We describe the anatomy of the deep neural networks and components of the physics-driven approach below.
For task 1, we have considered the migration time as an input parameter. However, these details might not be available a priori for an unseen case. Therefore, in task 2, we aim to learn the migration time given the other parameters. Hence, we design the network such that . Similar to the previous task, we design a framework with one deep neural network consisting of two hidden layers with 64 neurons each. The network takes as input one of the five quantities in the input space and output a scalar quantity denoting the solution, mt. A schematic representation of the framework is shown in Fig. 1. In this configuration, a single network takes the varying conditions as inputs and predicts the desired solution field, with the loss function consisting solely of the data loss. Notably, the data-driven architecture does not include the second network that outputs α or incorporates a residual loss.
The network parameters are optimized for both tasks using the – loss function and Adam optimizer. The learning rate for the optimizer is 1 × 10−4. The primary goal for developing ANN-based surrogate models is to generalize well-known to new and unseen data. However, this is a challenging problem. An under-parametrized model with too little capacity cannot learn the problem. In contrast, an over-parametrized model with too much capacity can learn it too well and overfit the training dataset. For our work, we experience overfitting due to the sparse representation of the input space because of the limited availability of labeled data. One popular approach to improve the generalization of deep neural networks is to use regularization during training that keeps the weights of the model small. These techniques not only reduce overfitting, but they can also lead to faster optimization of the model and better overall performance. In the end, we employed weights regularization with a regularization coefficient of 9 × 10−3.
![]() | (4) |
For this work, our goal was to obtain the scalar parameter α along with nb for task 1 and mt for task 2. For this task, we consider the biased reptation model defined as:
![]() | (5) |
The electrophoretic mobility of RNA of varying sizes and type were calculated and demonstrated in Fig. 2. The overall sigmoidal curve demonstrated by the double logarithmic mobility vs. size plots agrees with previous DNA and RNA capillary electrophoresis separation studies.34,42 However, this shape is much less evident for ssRNA, especially at higher gel concentrations (Fig. 2b). Transitions were demarked upon visual inspection with solid gray lines between Ogston-like sieving (regime I), reptation without orientation (regime III), and reptation with orientation (regime IV) with patterns outlined by previous studies.43,44 Regime II, which was only prominent in dsRNA, marks the regime described by Heller,34 where may be greater than pore size, but the dsRNA is still too stiff to reptate.
According to the Ogston model,
![]() | (6) |
Three distinct regions can be seen for dsRNA in Fig. 2a, and with our pore size analysis and the poor fitting of the Ogston model, the first region of the data points may be better described as what Heller concludes as a transition region where is greater than pore size, but the molecules are too still to reptate.34 As gel concentration increases, this distinction is less prominent, and there seems to be a linear decrease in mobility in Fig. 2a until a plateau in region IV is reached.
Unlike Ogston-like motion, the reptation model was developed for larger DNA due to the assumption that the spherical coil would be too large to fit through the pores of the matrix undeformed and would instead migrate head-first in a “snake-like” motion through “tubes” formed by the polymeric pore networks.45 Later improved by researchers such as Slater and Viovy, the model, still challenging to utilize mathematically, was modified to account for larger nucleic acid sizes, high electric fields, and the dynamic nature of uncrosslinked polymer networks (eqn (4)).46
Regime III can be correlated with regions of reptation without orientation, as seen by the linear decrease on the double logarithm of mobility as a function of fragment length (Fig. 2a and b). Another representation of this delineated region can be found in ESI (Fig. S3†). As eqn (4) demonstrates, the first term dominates for molecules below a critical size (N* = Nk), and mobility is inversely proportional to fragment size. However, as the Nk becomes larger than N*, a plateau mobility is reached. The “reptation with orientation” regime that describes separation failure is thought to be partly due to the electric forces leading long fragments to choose a tube of consecutive pores that do not follow a random walk-in space, and therefore resulting mobility is independent of size.37 The reptation regime has also been explained by other authors as countering the effects of increased charged residues against increased solvent friction due to large nucleic acid size, the latter of which can also be attributed to collisions and subsequent transient dragging of polymer chains.15,16 This plateau can be seen around 1000 bp for dsRNA (Fig. 2a) and seems to be reached a little later for ssRNA/mRNA in the same condition as shown in the region demarked regime IV in Fig. 2b. The higher critical sizes of single-stranded nucleic acids agree with previous findings.41 Similar to what was previously noted for ssDNA and dsDNA, this critical size seems to increase with decreasing polymer concentration for ssRNA but remained constant for dsRNA,34,42 but this is hard to confirm due to gaps in data from sample limitations. However, other studies have found the transition between regimes to be dependent on solution concentration only for RNA and not for ssDNA, which is a potential explanation for the short-lived secondary structures that ssRNA can make, resulting in increased stiffness.41 As expected, the resolving power for larger fragment sizes is poor in very low gel concentrations, but separation also fails earlier as gel concentration is increased, shown mainly by ssRNA. However, there was a consistent increase in peak width as gel concentration increased for both dsRNA and ssRNA.
Fig. 2a and b demonstrates that the difference in mobility for a single fragment between different gel % is relatively constant for dsRNA but not for single-stranded RNA, which is consistent with the finding of Heller in terms of dsDNA and ssDNA.34 With extraction and transformation of some data points, a semilogarithmic plot (Fig. 3) was graphed to directly assess the dependence of mobility on pore size for fragment sizes 500 and 4001, which should respectively fall under reptation without orientation and reptation with orientation. The dependence of dsRNA mobility on pore size was similar for both fragment sizes, with slopes of approximately 0.47. However, a considerable difference was seen for the RNA counterpart, suggesting that future developed models or analytical methods may be able to utilize a similar mobility dependence for dsRNA fragments of this size range but not for ssRNA to predict elution time or design experiment parameters.
![]() | ||
Fig. 3 Dependence of RNA electrophoretic mobility on pore size. dsRNA of 500 bp and 4001 bp, in blue and pink, respectively. ssRNA of 500 nt and 4001 nt, in purple and green, respectively. |
To analyze the potential difference in mobility between dsRNA and ssRNA in terms of gel concentration, the mobility of both RNA types was plotted against fragment size for each gel condition (Fig. 4). Fig. 4a demonstrates that at low gel concentrations (1% and 2%), the mobility of ssRNA and dsRNA are very similar across all fragment sizes (Fig. 4a), but as gel percentages increase, there is a clear increase in dsRNA mobility compared to that of ssRNA (Fig. 4b). This difference becomes more pronounced as fragment size increases. The mobility of dsRNA and ssRNA in 1% gel is approximately equal (<5% difference). For gel concentrations greater than 2%, the difference in mobility increases between dsRNA and ssRNA as fragment size increases above ∼500 bases for each gel percentage (Fig. 4b), before which the mobility difference is less than 10%. Additionally, as gel percentages increase from 1% to 5%, the difference in mobility for the largest recorded fragment also increases from 2.2% to 60.5%. This trend was also seen when analyzing raw migration time, demonstrating the potential manipulation of higher gel percentages to increase the separation between dsRNA and ssRNA. This manipulation of gel concentration, and therefore pore size, may be a simple target for high throughput separation of RNA mixture products that then allow for identification, such as in the context of quality control operations.
The migration (Fig. S1a and S2a†) and effective mobility (Fig. 5a and b) of modified dsRNA are very similar to those of its non-modified counterpart, and the difference between mobility is not statistically significant (Fig. 5a and b). Although a similar direct comparison cannot be made for ssRNA due to sample constraints, the migration profiles (Fig. S1b and S2b†) are similar, and the effective mobility of modified RNA fits nicely into the trends seen by its non-modified counterpart (Fig. 5c). The type and extent of modification RNA undergoes will vary depending on the desired application, but these findings demonstrate that resulting electrophoretic analysis may not be significantly impacted.
While the BRF model allows for some descriptive characterization of mobilities, the actual mathematical equation is very difficult to utilize in practice for the prediction of nucleic acid migratory behavior. Therefore, the use of ANNs were employed, including a PINNs,33 where the networks are trained using a modified loss function, which includes the governing equation and the training data. Table 1 demonstrates the results of the developed PINN, where the predicted values of nb obtained for ssRNA and dsRNA test samples are compared against the ground truth, represented by the nb on the axes.
1 | 2 | 3 | 4 | 5 | |
---|---|---|---|---|---|
ssRNA | |||||
The vertical axes labeled nb represents the ground truth. | |||||
500 | 486.9 | 486.9 | 486.9 | 486.9 | 486.9 |
1000 | 988.7 | 988.7 | 988.7 | 988.7 | 988.7 |
2000 | 1992.3 | 1992.3 | 1992.3 | 1992.3 | 1992.3 |
3000 | 2995.8 | 2995.8 | 2995.8 | 2995.8 | 2995.8 |
4000 | 4002.1 | 4002.1 | 4002.1 | 4002.1 | 4002.1 |
dsRNA | |||||
500 | 521.4 | 521.4 | 521.4 | 521.4 | 521.4 |
1000 | 1008.6 | 1008.6 | 1008.6 | 1008.6 | 1008.6 |
2000 | 1984.3 | 1984.3 | 1984.3 | 1984.3 | 1984.3 |
3000 | 2996.8 | 2996.8 | 2996.8 | 2996.8 | 2996.8 |
4000 | 3996.8 | 3996.8 | 3996.8 | 3996.8 | 3996.8 |
Similarly, in Table 2, we highlight the predicted migration time of ssRNA and dsRNA test samples and contrast them to the ground truth.
ssRNA | dsRNA | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | 1 | 2 | 3 | 4 | 5 | |
The value in brackets is the ground truth. | ||||||||||
500 | 24.7 (24.3) | 28.6 (27.0) | 38.2 (37.0) | 45.9 (42.5) | 53.6 (49.1) | 23.2 (25.2) | 25.8 (28.6) | 32.7 (33.2) | 38.3 (38.5) | 43.8 (41.3) |
1000 | 27.1 (25.8) | 32.6 (29.7) | 46.3 (43.3) | 57.3 (51.5) | 68.2 (61.7) | 23.8 (26.5) | 27.1 (30.8) | 35.3 (35.9) | 41.8 (42.0) | 48.3 (45.2) |
2000 | 29.8 (27.9) | 36.2 (34.3) | 55.2 (55.4) | 69.8 (67.6) | 84.3 (86.5) | 24.6 (28.1) | 28.3 (32.9) | 37.6 (38.5) | 45.1 (45.5) | 52.5 (49.1) |
3000 | 31.2 (28.8) | 39.4 (36.2) | 60.0 (60.0) | 76.5 (73.9) | 93.0 (95.6) | 24.9 (28.9) | 28.9 (34.2) | 38.9 (40.1) | 46.8 (47.5) | 54.8 (51.4) |
4000 | 32.3 (29.4) | 41.3 (37.5) | 63.8 (63.2) | 81.8 (78.3) | 99.8 (102.1) | 25.2 (29.5) | 29.3 (35.1) | 39.7 (41.2) | 47.9 (49.0) | 56.2 (53.0) |
Results from Tables 1 and 2 demonstrate the feasibility of utilizing artificial neural networks, specifically PINNs, to predict nucleic acid characteristics such as migration time and size based on other known parameters with decent accuracy. These values could aid in method development in ways such as guiding decisions on assay parameters to prevent signal peaks overlap, differentiating nucleic acids of different types, characterizing nucleic acid size with limited ladder sample, automating devices, etc.
Table 3 summarizes the relative error (%) computed for five test samples (details provided in Table 4) for both the frameworks of ANNs and both tasks. Test samples refer to the cases that the network was not provided with during the training. The test samples were chosen randomly from the available dataset. For both tasks, the PINNs model performs better than the data-driven model, and both frameworks perform better at task 1 than task 2.
Task 1 | Task 2 | |
---|---|---|
Data-driven | 9.56 | 15.12 |
PINNs | 0.44 | 1.1 |
Samples | n b | g c | M w | t p | l p | m t |
---|---|---|---|---|---|---|
1 | 1000 | 0.01 | 320![]() |
1 | 2 | 25.84166667 |
2 | 500 | 0.04 | 160![]() |
1 | 2 | 42.46833333 |
3 | 1800 | 0.03 | 1![]() ![]() |
2 | 64 | 38.9425 |
4 | 80 | 0.02 | 51![]() |
2 | 64 | 22.89 |
5 | 700 | 0.02 | 449![]() |
1 | 2 | 29.855 |
As expected, given the relatively limited data set, the PINNs model significantly outperforms the data-driven model, as it can compensate for the limited data through the physical equations that govern the experiments. This application of PINNs highlights its potential for electrophoretic analysis, which could far exceed the analysis of single- and double-stranded RNA molecules. Finally, we computed the value of from the PINNs framework, and report α = 192.77. This is essentially an inverse problem, where the network can predict the value given the minimization of the residual of the governing equation. We have employed the data generated using the data augmentation approach in the PINNs model. In engineering and biomedical problems, data collection using numerical or physical experiments is often expensive and time-consuming. This study demonstrates that ANNs can be a useful tool for determining complex relationships where governing solutions are not clearly known or when there is insufficient information regarding the relationship between input and outputs, as seen through our discussion on nucleic acid electrophoretic mobility.
Following a thorough experimental study of RNA mobility in microfluidic electrophoresis, it was clear that there lacks a comprehensive model that can be utilized to provide clear prediction and/or observation of RNA mobility under varying conditions. Real experimental conditions often result in additional complications that static equations are not able to adapt or account for. Due to the limitations of current models and the importance of the prediction of RNA size and migration time for assay development, ANNs were developed to with the intent of guiding future therapeutic develop and analysis with decreased experimental trial and error for method development. It was believed that with adaption of the model through modulation of the parameter and hidden layers of the neural network, adequate prediction may be made. This was proven by the low margin of error of final predications of the PINN.
Limitations of the developed model include generalizability only to the ranges and conditions used to train the model. While the tested RNA lengths can account for the sizes of most current RNA therapies, the rapidly developing field of RNA therapeutics could soon include far greater sizes. The validation of the model was done with limited samples due to sample constraints, but we believe that the developed model proves such methodology is feasible for the prediction of complex relationships seen in microfluidic electrophoresis and with biopharmaceutical development. As with our discussion on chemically modified RNA, we hope that future studies with greater variety of RNA samples can better validate and strengthen our discussion of predictive models and their applicability.
The findings presented in this study serve as an example of how ANNs, particularly PINNs, can be used to complement limited data sets for assessing the electrophoretic behavior of single- and double-stranded RNA molecules. In this study, using PINNs improved the relative error from 9.56% using the data-driven model in task 1 to 0.44% and from 15.12% to 1.1% in task 2. The developed PINN has applications in streamlining development of RNA therapeutics analytical methods through its ability to determine optimal run conditions for given mRNA constructs lengths, preventing potential overlapping of RNA target strand with potential impurities, determining contaminant (i.e., dsRNA or truncated ssRNA) length given the run conditions and migration time, etc. This added capability can have significant implications for the biopharmaceutical industry, where machine learning can help streamline the development process, which can involve countless permutations within a given product.
Footnotes |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d5an00381d |
‡ These authors contributed equally. |
This journal is © The Royal Society of Chemistry 2025 |