Open Access Article
Filipp
Gusev
ab,
Benjamin C.
Kline
c,
Ryan
Quinn
d,
Anqin
Xu
c,
Ben
Smith
c,
Brian
Frezza
cd and
Olexandr
Isayev
*ab
aDepartment of Chemistry, Mellon College of Science, Carnegie Mellon University, 4400 Fifth Ave, Pittsburgh, PA 15213, USA. E-mail: olexandr@olexandrisayev.com
bRay and Stephanie Lane Computational Biology Department, School of Computer Science, Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh, PA 15213, USA
cEmerald Cloud Lab, 15500 Wells Port Dr, Austin, TX 78728, USA
dEmerald Therapeutics, 15500 Wells Port Dr, Austin, TX 78728, USA
First published on 29th October 2025
Automation of experiments in cloud laboratories promises to revolutionize scientific research by enabling remote experimentation and improving reproducibility. However, maintaining quality control without constant human oversight remains a critical challenge. Here, we present a novel machine learning framework for automated anomaly detection in High-Performance Liquid Chromatography (HPLC) experiments conducted in a cloud lab. Our system specifically targets air bubble contamination—a common yet challenging issue that typically requires expert analytical chemists to detect and resolve. By leveraging active learning combined with human-in-the-loop annotation, we trained a binary classifier on approximately 25
000 HPLC traces. Prospective validation demonstrated robust performance, with an accuracy of 0.96 and an F1 score of 0.92, suitable for real-world applications. Beyond anomaly detection, we show that the system can serve as a sensitive indicator of instrument health, outperforming traditional periodic qualification tests in identifying systematic issues. The framework is protocol-agnostic, instrument-agnostic, and, in principle, vendor-neutral, making it adaptable to various laboratory settings. This work represents a significant step toward fully autonomous laboratories by enabling continuous quality control, reducing the expertise barrier for complex analytical techniques, and facilitating proactive maintenance of scientific instrumentation. The approach can be extended to detect other types of experimental anomalies, potentially transforming how quality control is implemented in self-driving laboratories (SDLs) across diverse scientific disciplines.
The emergence of cloud laboratories is transforming autonomous experimentation by enabling remote execution of complex biological and chemical research with enhanced reproducibility, scalability, and accessibility. These facilities integrate robotic automation and networked control systems to conduct experiments continuously and in parallel, significantly reducing physical and logistical constraints. The pioneering work by CMU alumni through Emerald Cloud Lab (ECL) provides researchers with a suite of instruments for biological and chemical experimentation at scale. While cloud laboratories hold great promise for democratizing access to sophisticated experimental infrastructure, they also introduce challenges related to remote troubleshooting, real-time experimental adaptability, and standardization across diverse research domains. As these platforms evolve, they present an opportunity to accelerate scientific discovery while necessitating new frameworks for data integrity, automation-driven research methodologies, and integration into traditional experimental workflows.
This study focuses on improving High-Performance Liquid Chromatography (HPLC) in the Cloud Lab. HPLC is an essential analytical technique used across various scientific disciplines, from pharmaceuticals and biotechnology to environmental studies, making it a prime target for automation in a Cloud Lab environment. In a traditional lab, modern HPLC instruments incorporate basic automation (e.g., for liquid handling, sample collection, and executing predefined protocols); however, a scientist is still often required to be present to monitor the data readout and ensure its validity as well as proper functioning of the instrument. This manual oversight enables experts to catch issues such as pressure fluctuations due to air bubbles, clogs, leaks, empty mobile phase bottles, and other unexpected system behaviors. In contrast, in autonomous closed-loop systems like Cloud Lab, such real-time human intervention is impractical given the high-throughput use of instruments and the need for a fully closed data flow in the Design-Make-Test-Analyze (DMTA) cycle.
Many integrated data analysis systems for HPLC instruments, for example those performing peak assignment, operate on the assumption that the recorded signal is valid, which is true in most cases. In a field of data-driven research, it is common to ‘trust but verify’ historically accumulated data,16,17 or re-generate it de novo to minimize discrepancy among data sources or avoid implicit biases. The classic ‘garbage in, garbage out’18 principle can undermine any data-driven system; closed-loop experiments like Bayesian optimization are among the most vulnerable. Bayesian optimization algorithms, once misdirected by a false signal, will require several observations (or rounds in batch execution) of data acquisition to self-correct at the cost of time and resources at best and fully degrade at worst. Currently, ensuring correct execution, a common step in computer science, is overlooked among target metrics for the evaluation of self-driving laboratories.19
HPLC chromatogram peaks, as frequently monitored by absorbance, can be negatively affected by many variables, including column health and age, purity of the sample, and—germane for this investigation—air bubbles. Although most modern instruments have some ways to detect common and expected pitfalls (e.g. by qualification/controlled experiments), complications arise when rare, stochastic events occur during large-scale experimental campaigns. Air bubbles—one such pitfall—can disrupt an HPLC experiment: when air enters the buffer tubing, it will eventually reach the column, where the chemical separation occurs. These intermittent pockets of air alter the interactions between analytes and the stationary phase, often leading to unpredictable retention times (Fig. 1A), distorted peak shapes (Fig. 1B), loss of a peak (Fig. 1C) or even an HPLC chromatogram that is indiscernible to the scientist. Moreover, the presence of air bubbles may be especially problematic for preparative HPLC experiments, where the entire source sample is used up during the experiment and repeating the protocol is not always an option.
Several user behaviors or instrument shortcomings can lead to the introduction of air bubbles and resultant pressure fluctuations in an HPLC run. Air bubbles in HPLC systems are most commonly introduced when mobile phases are not adequately degassed, allowing dissolved gases to come out of solution under pump pressure. Temperature fluctuations between different parts of the system can also reduce gas solubility and trigger bubble formation. In addition, leaks at pump seals, fittings, or inlet lines can draw air into the system, while insufficient priming after solvent changes may leave residual air pockets in the tubing. Together, these factors represent the primary sources of bubble formation in HPLC.20
Several factors can influence peak shape and retention time in HPLC, including column age or identity, mobile phase composition, temperature, sample concentration, and flow rate variations. A major advantage of a cloud laboratory is that all collected data are linked within a central database, enabling rapid root-cause analysis of problems and anomalies. The representative data shown in Fig. 1A–C were selected such that the other variables affecting peak shape were held constant, with the main difference being the pressure trace during the run. Column age or health is the most likely alternative cause; to mitigate this, standards are routinely run on all columns to ensure they are not used beyond their effective lifetime.
We designed our automatic anomaly detection system for HPLC experiments that operates on-the-fly and without human intervention. The system is based on a binary classifier: the ML model treats HPLC experiments affected by air bubble contamination as the positive class (class 1) and unaffected as the negative class (class 0). Focusing on the air bubbles, we analyzed HPLC pressure data—which exhibit a characteristic pattern when air is introduced into the HPLC tubing—and employed active learning combined with a human-in-the-loop approach to develop the model efficiently. The overall workflow (see Fig. 2) comprises three major steps: (i) Initialization of Training Data (Fig. 2A), (ii) ML Model Building via Human-in-the-Loop Approach (Fig. 2B), (iii) Deployment and Performance measurement of the final ML Model (Fig. 2C).
Once the ML model reached optimal performance, it was deployed in the Cloud Lab to autonomously screen HPLC experiments in real time. The purpose of the trained model is to screen and identify affected HPLC traces autonomously. Two prospective validation steps—one at the experiment level and one at the instrument level—were performed to ensure that the model's predictions align with real-world scenarios, thereby confirming its reliability and effectiveness.
A variety of instrument- or software-specific methods currently exist to detect leaks, empty buffer bottles, or pressure fluctuations during experimental HPLC runs (e.g., Shimadzu Nexera-40/LabSolutions, Waters Alliance iS/Empower, Thermo Scientific Vanquish Core/Chromeleon, and Agilent InfinityLab 1290 III/OpenLab).23–26 However, these methods often lack transparency regarding their underlying mechanisms and accuracy. More importantly, users generally cannot modify or improve commercial models to suit their specific needs. Here, we present an open-source anomaly detection approach that is adaptable, retrainable, and can ultimately be tailored to the user's requirements.
We designed our automatic workflow (see Fig. 2) in a data-driven manner rather than relying on rule-based or hardware-based approaches. This strategy makes the system adaptable and generalizable to other rare pitfalls that become observable as the database of HPLC experiments accumulates at scale. Our workflow began with the collection of approximately 25
000 HPLC experiments from a diverse set of chromatographic methods, instruments, and protocols (see Initial dataset for details, Tables S2 and S5), a dataset large enough to capture infrequent events like air bubble contamination. An initial subset of this data was reviewed by a human expert, who observed and annotated anomalous cases, resulting in an initial pool of 93 HPLC experiments affected by air bubble contamination. The initial pool of affected cases is relatively small due to infrequency of occurrence (a conservative a priori estimation was ≈1%) coupled with infeasibility of explicit annotation of the whole dataset due to its size. Although air bubble contamination is a common issue in HPLC, the low observed frequency is expected in a well-maintained system. Such an imbalance can be challenging for an ML model and may introduce bias. To address this, we employed Stochastic Negative Addition27 (SNA), which stochastically adds negative (“unaffected”) examples to the training set to ensure balanced representation while minimizing further annotation effort, after preliminary analysis (see Classical ML for details, Fig. S1) we decided to target 1
:
10 class ratio (the ratio of samples with positive, e.g. affected by air bubbles, class to samples with negative, e.g. normal, class) for the initial dataset and maintain it for the rest of the modeling stages. SNA has been successfully applied as a balancing strategy in other data-driven domains.27,28
During the model-building phase (see Fig. 2B), we employed an Active Learning (AL) cycle combined with a human-in-the-loop approach to iteratively refine the model with expert input while minimizing overall annotation effort by focusing on the most informative cases. This phase comprised the following steps: (i) training set creation: a training set comprising both affected (identified by expert annotation) and unaffected (selected using SNA27 balancing strategy) HPLC traces were assembled in each annotation round. (ii) ML model building: a ML model was built using the training set, then the model screened the dataset of 25k HPLC experiments to identify traces that were potentially affected (requiring further annotation) as well as those most likely unaffected (which were used for SNA later). (iii) Human expert annotation: the flagged traces were reviewed by a human expert who annotated each as affected or unaffected, further enriching the dataset and improving the model's accuracy. This iterative cycle continued until the ML model achieved satisfactory performance; in total, only three rounds (one initial and two AL) of annotation were conducted to sufficiently train our model.
For the air bubble contamination we focused on HPLC pressure traces. Since the pressure trace, by its nature, is a time series, we started our modeling from classical ML approaches for time series data. The classical featurization approach29 (see Methods for details) performed well (Fig. S1); however, its resource demands—in terms of memory footprint and processing delays—made it unsuitable for on-the-fly deployment in the Cloud Lab. Therefore, we transitioned to an end-to-end Deep Learning approach utilizing a 1D convolutional neural network (CNN) coupled with automatic architecture and hyperparameter optimization.30
With complete pressure traces available during analysis, the 1D CNN minimized model size, memory usage, and response time, while preserving the capacity to generalize to other HPLC anomalies in future developments. This approach also avoids the need for labor-intensive, manual feature engineering required by rule-based methods.
The model perception (Fig. 3A), visualized as a UMAP projection (see Methods for details) of the latent representation from the 1D CNN model, reveals the learned feature space structure. There are two regions of very high artifact probability that were sampled through the three rounds of expert annotation (Fig. 3B). Unlike normal experimental cases (Fig. 3C), Fig. 3D–G illustrate varying levels of uncertainty in trace annotation. These traces fall between “clean” samples and those clearly containing air bubbles. As annotation progressed through multiple rounds, the focus shifted from simply identifying air bubbles to investigating potential causes of anomalous behavior, particularly for traces near the ML-model's decision boundary. Fig. 3G highlights traces exhibiting pressure-related anomalies likely caused by factors other than air bubbles.
These anomalies can be attributed to several technical issues in the HPLC system. Insufficiently tightened barrel-tubing connections often lead to pressure fluctuations as fluid escapes through minute gaps in the assembly. When HPLC buffers run dry, a characteristic pattern emerges where pressure gradually drops toward zero as the system attempts to pull nonexistent fluid through the lines. Pump malfunctions represent another common source of pressure anomalies, creating irregular patterns in the trace data that differ distinctly from the signature patterns of air bubbles but nonetheless require identification and remediation to ensure experimental validity.
Validating these hypotheses would require either generating controlled error states in the laboratory or further accumulation of historical data. Although the anomalous traces shown in Fig. 3G represent only a minor fraction of the total HPLC experiments—and are not yet a significant concern—the continuous aggregation of data in the Cloud Lab facilitates ongoing model retraining. This will enable future refinements to distinguish among various pressure-related artifacts.
For deployment in Cloud Lab, the ML model was serialized in ONNX format. This enabled seamless integration into the Emerald Cloud Lab backend by loading it directly into Wolfram Language to analyze HPLC data for bubble likelihood, expressed as a Class 1 probability. Immediately after HPLC data from the experiment are parsed and imported into the Cloud Lab database, the pressure traces undergo brief preprocessing to ensure compatibility with the model and to eliminate false positives due to early retention-time pressure instabilities. Each preprocessed pressure data set is then passed to the model to yield a predicted likelihood (between 0 and 1) that the corresponding experiment was contaminated by air bubbles, and the predictions are added to the experiment's metadata.
Interestingly, the frequency of air bubble-affected HPLC experiments was higher than expected. This observation prompted us to apply the ML model for instrument validation-to assess whether the fraction of bubble-affected traces for a given instrument deviates from expected norms. Although routine qualification experiments (using known standards) ensure instrument reproducibility, their infrequent scheduling (weekly or monthly) may overlook subtle, stochastic shifts in experimental quality.
A qualification experiment is an in-depth control experiment that tests the performance and health of an instrument. In the Cloud Lab, every qualification generates an automated report that can be easily compared to previous reports to give users confidence in their experiments. Each automated report is assigned a pass/fail grade that is confirmed by a human expert. If an instrument is passing its latest qualification test, it is “qualified” to run experimental samples.
Most qualification tests are focused on testing for reproducible experimental outcomes. For HPLC the test targets autosampler, fraction collection, lineshape, etc. For the most part (>90% of the time), the air bubbles in the lines are a transient issue and do not cause any appreciable difference in the experimental result based on what was tested in the qualification runs.
This approach appears to be more sensitive than the qualification experiments in detecting air bubble-associated issues (e.g. pump malfunctions). Incorporating this model into the instrument quality control pipeline will further enhance overall Cloud Lab performance. This enabled us to “flag” all affected HPLC experiments that had bubbles in the lines, not just those that had obvious noticeable negative impacts on the sample elution data. Overall, this will have a few positive impacts for the lab: (1) increase reproducibility of elution times and peak widths (2) improve troubleshooting turnaround time for HPLC.
Implementing the air bubble detection ML model reduces the learning curve for scientists new to HPLC by demystifying one of the major error modes. Researchers will no longer need months or years of experience analyzing various pressure traces alongside experimental outcomes to pinpoint failures. Instead, this model represents a step toward making complex experimentation accessible across disciplines and skill levels. For the more experienced users who typically review key experimental parameters (such as column age, standard and blank traces in a neat window, etc.) as part of their workflow during troubleshooting, the ML model predictions are used for displaying the average bubble likelihood for an entire batch run—as well as the minimum and maximum values—providing a rapid quality assessment.
In this study, we proposed a protocol-agnostic, instrument-agnostic, and, in principle, vendor-neutral framework for on-the-fly detection of common errors in High-Performance Liquid Chromatography (HPLC) experiments. Leveraging a Cloud Lab's rigorous management of all experimental data provides a foundation for adapting and generalizing our end-to-end, data-driven anomaly detection framework to address rarer types of errors as they accumulate over time.
The machine learning model developed in this study demonstrated strong performance in prospective validation across a diverse set of HPLC traces, achieving an accuracy of 0.96 and an F1 score of 0.92 in detecting HPLC traces affected by air bubble contamination, formulated as a binary classification problem. Furthermore, we provided a proof-of-principle demonstration of repurposing the ML model for validation of HPLC instruments based on systematic performance evaluation over a large set of experiments, which appeared to have higher sensitivity compared to individual control experiments.
Future development could enhance the feedback loop in Cloud Lab environments to operate at the level of individual experiments rather than post-batch analysis, enabling automatic retries for affected experiments. The successful application of our framework, using HPLC experiments as a case study, demonstrates its viability for both experiment and instrument validation, addressing key challenges in closed-loop experimental automation.
This study represents a significant advancement in the field of autonomous laboratory operations and has several far-reaching implications for scientific research. At its core, the work addresses a fundamental challenge in automated laboratories: the need for continuous quality control without human oversight. By developing a system that can detect experimental anomalies in real-time, we have eliminated a task that traditionally required experienced human operators.
The implications for democratizing science are particularly noteworthy. The system significantly reduces the learning curve for scientists new to HPLC by automating the detection of common error modes. This means researchers no longer need months or years of experience to identify certain types of experimental failures, making complex experimentation more accessible across disciplines and experience levels.
Our approach to maintenance and quality control represents a significant improvement over traditional methods. It demonstrated higher sensitivity in detecting certain equipment issues compared to conventional qualification tests, thereby enabling proactive maintenance through early identification of systematic problems before they cause major failures. Unlike periodic checks, our method provides continuous quality monitoring.
Looking toward the future, this work opens possibilities for automatic retrial of failed experiments in real-time and represents a crucial step toward fully closed-loop experimental automation. The framework could be expanded to detect other types of experimental anomalies as more data accumulates in the Cloud Lab. From a practical standpoint, this work helps prevent waste of valuable samples, reduces equipment downtime through predictive maintenance, and potentially lowers operational costs by preventing failed experiments.
The present work marks an important milestone in making automated laboratories more reliable and accessible, while potentially reducing costs and enhancing research quality across scientific disciplines. As laboratories increasingly move toward automation and cloud-based operations, models like this will become essential for maintaining rigorous standards of scientific research while democratizing access to advanced analytical instrumentation.
The model development workflow (Fig. 2B) was organized into three iterative rounds of annotation. In Round 1 (the initial round) we obtained the initial dataset to start the Active Learning cycle. In Round 2, a classical ML modeling approach was employed to create an ensemble of models, allowing us to balance selecting candidates for annotation between cases that were likely positive (specifically “class1_prob_mean > 0.5 and class1_prob_std < 0.1” yielding 287 candidate cases) and uncertain cases (specifically “class1_prob_std ≥ 0.1 yielding 213 candidate cases). This process—after expert evaluation—resulted in a pool of 567 experiments affected by air bubbles. In Round 3 we utilized DeepLearning modeling (see Methods for details), which led—after expert evaluation—to a final accumulation of 700 experiments affected by air bubbles.
At this stage, the pool of traces with high uncertainty (0.1 < “class1_prob < 0.9”) significantly decreased, compared to previous rounds, to 93, suggesting that no further rounds of annotation were required. The accumulated data were then used to train the final ML model for deployment—with an optimized architecture and hyperparameters (see S1). The model performance was measured on prospective experiment validation—to assess generalizability—and on prospective equipment validation.
423). The majority of the data fell into the following three categories: (1) semi-preparative size exclusion chromatography experiments separating small molecules away from target oligonucleotides; (2) preparative reverse phase ion pair chromatography experiments separating a desired oligonucleotide from a mixture of small molecules and undesired oligonucleotides; and (3) reverse phase chromatography experiments aimed at small molecule analysis mainly for the purpose of qualifying and ensuring the health of the HPLC instruments. (see Table S2 for detailed counts per chromatography type, instrument model and manufacturer) Among these, 93 experiments were annotated by a human expert as being affected by air bubble contamination.
000 time steps, shorter than 100 time steps, or longer than 75 minutes were discarded, yielding a final set of 25
036 experiments. Next, we featurized the remaining experiments using the tsfresh29 default set of 783 features. Features that contained undefined values in any experiment were removed, reducing the feature set to 585. We then processed these features using Scikit-learn:31 (1) each feature was scaled using a MinMaxScaler; (2) features with a variance of less than 0.01 were filtered out—resulting in 122 features; and (3) a pairwise Spearman correlation matrix was computed; for each pair of features with an absolute correlation greater than 0.9, only one was retained. This procedure resulted in a final processed dataset comprising 99 features suitable for classical machine learning modeling.
Notably, before applying classical ML modeling, we tried several non-ML, simple mathematical models like pressure oscillation or the derivative of pressure with respect to time, and signal processing approaches. All of them were deemed not suitable for the project because of their lack of transferability and extensibility.
Since the classification task is sensitive to class imbalance, and due to usage of SNA framework we evaluated three class ratio variants: 1
:
1, 1
:
10, 1
:
100. The positive class was represented by the 93 initially annotated as affected by air bubble contamination, while the negative class examples were randomly sampled from the Initial dataset according to desired class ratio. The Random Forest algorithm was used for classification. The dataset was split using StratifiedKFold in 5 folds, hyperparameters ({‘max_depth’:[2,4,8,16,32,64,None], ‘n_estimators’:[10,25,50,100,250,500,1000], “max_features”:[‘auto’, ‘sqrt’, ‘log
2’]}) were optimized using GridSearch inner loop cross-validation with F1 score as objective function. The model with the best hyperparameters was fitted on the whole fold of the outer 5-fold cross-validation loop.
Based on model performance (Fig. S1) 1
:
10 class ratio was considered optimal (forming a training set of 1023 traces: 93 affected + 930 SNA-sampled unaffected) and used for all other modeling stages. For Round 1, an ensemble of 5 models obtained for the 1
:
10 class ratio was used.
The model was trained using binary cross-entropy loss (BCELoss) with the Adam optimizer. The initial learning rate was treated as a hyperparameter, and a learning rate scheduler (ReduceLROnPlateau: factor 0.5, patience 20) was employed. The model was trained for up to 500 epochs, with early stopping based on validation loss (patience: 150) and batch size of 100. Performance metrics—including F1 score, accuracy, precision, and recall—were tracked across the training, validation, and test splits.
The model's hyperparameters (and architecture) were fixed for Round 2 and later optimized for deployment using Optuna30 by maximizing the validation F1 score with an optimization budget of 2000 trials.
For Round 3 annotation using the CNN, the dataset was constructed as follows. From the Round 2 and initial sets, 567 traces annotated as affected by air bubbles were combined with 5670 SNA samples randomly drawn from the initial dataset—preselected using an Upper Confidence Bound (UCB) threshold of < 0.05 (where UCB is defined as “class1_prob_mean” + “class1_prob_std” from the Round 1 RandomForest ensemble). The combined dataset—forming the dataset for Round 3—was then split into training, testing, and validation sets in an 80
:
10
:
10 ratio with class stratification.
For training the CNN intended for deployment to Cloud Lab, we used 700 traces annotated as affected by air bubbles (acquired by the end of Round 3), 261 traces annotated as normal, and sampled SNA examples from the initial dataset (to accumulate in total 7000 normal traces preserving the desired class ratio)—preselecting those with a Round 3 ML model predicted class 1 probability < 0.05—to construct the deployment training dataset (Table S1 and Fig. S2 for further details). This dataset was split into training, testing, and validation sets in an 80
:
10
:
10 ratio with class stratification. Finally, the trained model was converted to ONNX to ensure native compatibility with the Wolfram Mathematica-based backend of the Cloud Lab.
Supplementary information is available. See DOI: https://doi.org/10.1039/d5dd00253b.
| This journal is © The Royal Society of Chemistry 2025 |