Open Access Article
Sitanan
Sartyoungkul
ab,
Balasubramaniyan
Sakthivel
a,
Pavel
Sidorov
*a and
Yuuya
Nagata
*abc
aInstitute for Chemical Reaction Design and Discovery (WPI-ICReDD), Hokkaido University, Sapporo, Hokkaido 001-0021, Japan
bJST, ERATO Maeda Artificial Intelligence in Chemical Reaction Design and Discovery Project, Sapporo, Hokkaido 060-0810, Japan
cAutonomous Polymer Design and Discovery Group, Research Center for Macromolecules and Biomaterials, National Institute for Materials Science (NIMS), Tsukuba, Ibaraki 305-0047, Japan
First published on 26th November 2025
The integration of automated synthesis and machine learning (ML) is transforming analytical chemistry by enabling data-driven approaches to method development. Chromatographic column selection, a critical yet time-consuming step in separation science, stands to benefit substantially from such advances. Here, we report a workflow that combines automated synthesis of a structurally diverse amide library with fragment descriptor-based ML for retention time prediction in supercritical fluid chromatography (SFC). Retention data were systematically acquired on the recently developed DCpak® PBT column, providing one of the first structured datasets for this stationary phase. Benchmarking revealed that fragment-count descriptors (ChyLine and CircuS) substantially outperformed conventional molecular fingerprints, delivering higher predictive accuracy and more interpretable relationships between substructures and retention behavior. External validation underscored the role of chemical space coverage, while visualization techniques such as ColorAtom analysis offered mechanistic insight into model decisions. By uniting automated synthesis with chemoinformatics-driven ML, this study demonstrates a scalable approach to generating high-quality training data and predictive models for chromatography. Beyond retention prediction, the framework exemplifies how data-centric strategies can accelerate column characterization, reduce reliance on trial-and-error experimentation, and advance the development of autonomous, high-throughput analytical workflows.
In recent years, artificial intelligence (AI) and machine learning (ML) have attracted considerable attention for their predictive capabilities across various scientific disciplines, including analytical chemistry.4–6 In liquid chromatography (LC), AI and ML have emerged as powerful tools for retention time prediction, enabling faster, more accurate, and more efficient chromatographic method development. Furthermore, supercritical fluid chromatography (SFC) has gained increasing attention due to its ability to provide even faster analyses, and its adoption has been expanding rapidly.7,8
Despite these advancements, the adoption of newly developed chromatography columns remains challenging for analytical chemists, as their separation characteristics are often unknown. Consequently, trial-and-error experimentation with unfamiliar columns can be impractical and time-consuming. To address this issue, we propose a machine learning model capable of predicting retention times based on molecular structures, thereby providing analytical chemists with valuable insights into the separation characteristics of new columns and facilitating their selection and use.
In this study, we employed an automated synthesis robot to rapidly generate a diverse set of amide compounds with varying molecular structures. Retention times were measured using an SFC system, and a machine learning model was developed to predict retention times based on molecular structures. Furthermore, we explored the relationship between molecular substructures and retention times through visualization, which is also discussed in this study.
Subsequently, the automated synthesis of various amide compounds was carried out through condensation reactions between the selected amines and carboxylic acids (Fig. 1 and Table 1). Tetrahydrofuran (THF) solutions of the selected eight carboxylic acid derivatives and dichloromethane (DCM) solutions of the eight selected amines were prepared. A dichloromethane solution of 4-dimethylaminopyridine (DMAP) and 3-ethylcarbodiimide hydrochloride (EDC-HCl) was then added, and the mixtures were stirred at 40 °C for 8 hours to synthesize various amide compounds. After the reaction was complete, 0.1 mol L−1 HCl aqueous solution was added, and the mixture was shaken. The organic layer was separated using a phase separation filter, collected, and diluted with a heptane/2-propanol (50/50) mixture to prepare the chromatography sample solutions. For the synthesis of compound 7b, a catalytic amount of hydroxybenzotriazole (HOBt) was additionally employed under otherwise identical conditions. Chromatographic analysis was carried out on a Daicel DCpak® PBT column9 (3 µm, 4.6 mm i.d. × 100 mm, fully porous particles) with supercritical CO2 and 2-propanol (90
:
10, v/v) as the mobile phase at a flow rate of 2.0 mL min−1. The column temperature was maintained at 40 °C. Samples (1 mg mL−1 in n-hexane/2-propanol) were injected at a volume of 5 µL. Detection was performed using a two-dimensional photo diode array detector, and one-dimensional chromatograms were obtained at 220.0 nm. For samples that eluted at retention times close to that of the sample solvent, chromatograms were compared with those of previously measured other samples to identify solvent-derived peaks, and the analyte retention times were determined accordingly. When the reaction did not proceed completely, chromatograms of the starting materials were measured, and the newly appeared peak that was not derived from the starting materials was identified as the amide product.
| a Retention time tR (s), 220 nm, t0 ∼ 39.8 s. | ||||||||
|---|---|---|---|---|---|---|---|---|
|
148.0 | 97.8 | 130.4 | 201.6 | 149.2 | 81.6 | 174.8 | 88.0 |
|
132.6 | 105.6 | 131.8 | 240.6 | 150.6 | 77.0 | 180.8 | 86.6 |
|
76.0 | 61.8 | 66.2 | 103.0 | 78.8 | 56.4 | 88.0 | 59.6 |
|
220.2 | 133.4 | 179.4 | 315.6 | 221.1 | 102.8 | 224.4 | 107.8 |
|
132.2 | 104.8 | 130.4 | 228.2 | 149.4 | 98.0 | 168.6 | 110.2 |
|
381.0 | 238.2 | 300.0 | 654.4 | 408.8 | 158.6 | 488.6 | 183.8 |
|
504.4 | 262.0 | 371.8 | 719.8 | 503.2 | 179.2 | 534.0 | 193.2 |
|
161.0 | 107.6 | 168.8 | 260.4 | 185.6 | 80.6 | 182.4 | 87.6 |
Here, we employed the DCpak® PBT column, which is a silica gel-modified column with polybutylene terephthalate (PBT). This column was developed relatively recently, and its use remains limited. The retention times of the 64 synthesized amide compounds are summarized in Table 1.
In general, compounds containing aromatic rings tended to exhibit strong retention, whereas those with alkyl chains showed shorter retention times. However, interpreting the column characteristics intuitively based solely on this retention time table is challenging. Therefore, based on these results, we attempted to develop a machine learning model to predict retention times from molecular structures.
The ML model for prediction of retention time was built following the best practices in QSPR modelling.10 In this work, we chose structural descriptors to represent the molecules, as it is the most relevant part in our dataset. Widely used molecular fingerprints (FP) – binary vectors indicating the absence or the presence of certain structural features – were selected for their simplicity.11 We have used Morgan FP12 (capturing the circular substructures), RDkit FP13 (circular, linear, and branched substructures), AtomPairs14 (pairs of atoms with the topological distance between them), Torsion15 (substructures consisting of 4 connected atoms with torsion angles) and Avalon16 (various drug-likeness features). The binary nature of the FP, however, limits their expressiveness and may lead to lower performance of a model. To circumvent that, we also use fragment features that account not only for the presence of certain substructure, but count their occurrences in each molecule, enriching the information content in the descriptor vector. Two types of fragment descriptors were used – CircuS (Circular Substructures) to account for circular fragments, and ChyLine (Chython Linear) for linear substructures. Both fragment descriptors were calculated using DOPtools library (ver.1.2),17 all fingerprints – using RDkit (ver.2024.9.6). Each descriptor type generates a number of features for the dataset: for fingerprints, the length of the feature vector was set to 1024; for fragment counts, the number varies depending on the fragment topology and size. The calculated matrices of descriptors for each setting are available in SI.
The best descriptor type was selected in a benchmarking study. It was performed using DOPtools library and the following parameters were optimized: (1) descriptor space – only one type of descriptors were used at a time by each model; (2) ML algorithm – Support Vector Machines (SVM),18 Random Forest (RF)19 and XGBoost (XGB)20 were tested in a regression model; (3) ML hyperparameters, depending on the algorithm. The models were scored by the prediction results of a repeated 5-fold cross-validation (CVk=5). Determination coefficient (R2) and root mean squared error (RMSE) are used to quantify the model's quality:
The following Python libraries were used for data processing and calculations: Chython (ver.1.78),21 RDkit (ver.2024.9.6), DOPtools (ver.1.2), Scikit-learn (ver.1.5),22 Optuna (ver.3.6).23 Other libraries were installed as dependencies to the latest available versions.
Yet, the retention time by itself depends not only on the chemical structure, but also on the experimental setup and conditions. To eliminate the effect of changes in the chromatography column size and eluent speed, we have then selected the retention factor as the modelled property. The retention factor (k) is given by (tR − t0)/t0, where tR is the analyte retention time and t0 is the column dead time. Considering the range of the values, we also transform the retention factor value to a logarithmic scale to reduce the effect of the range on the error of prediction (ln
k). As the Fig. 2 shows, the fragment descriptors have again shown the best performance in cross-validation, although the performance was excellent across the board. For this property, the fingerprints still have difficulties with predicting values in lower and higher ranges. Since the models for ln
k with fragments have shown the best performance, further we only discuss these.
k were applied to a series of molecules from external sources to verify the chemical space coverage by the models. The compounds used here (1x–12x) included various amide compounds with structures relatively similar to those used in model training, as well as arbitrarily selected compounds with completely dissimilar structures. The results of predictions are shown in Fig. 3. As the figure shows, both models struggle with this test set, with the RMSE being over 1 compared to 0.12 for cross-validation. However, such high prediction error is due to two main factors.
First, there are several notable outliers for both models, especially compounds 10x and 12x. If the outliers are removed, the statistical scores for the models improve significantly. Moreover, outside of these outliers, model built on ChyLine performs quite well across most of the range of ln
k values. The CircuS model, on the other hand, shows a more restrictive coverage.
Second, the chemical space of the test set is quite different from that of the training set. First of all, not all molecules are amides, although they are the main target of the model. The special cases are the aforementioned compounds 10x and 12x, the former of which (9,10-diphenylanthracene) is a polycyclic aromatic compound, and the latter (1,4-bis(trimethylsilyl)benzene) contains trimethylsilyl groups which are completely outside of the initial chemical space. One can also interpret these errors using the ColorAtom methodology,24 which allows to assign atomic contributions to predictions by coloring them according to their importance. Fig. 4 shows ColorAtom interpretations for the predictions on outliers by the ChyLine model. Indeed, for the compound 10x, the aromatic groups show positive contribution, i.e., increasing the retention time as it would be expected. However, due to the high number of these groups compared to the training set, the model overestimates the ln
k which leads to a high prediction error. On the other hand, the silyl groups in the compound 12x are completely ignored by the model and their contribution cannot be correctly estimated. Similar observations can be made about other outliers, as well, where some groups' contributions are over- or underestimated.
It could also be assumed that the compounds of the test set are outside of the applicability domain (AD)25 of the training set. Indeed, when estimating Fragment Control (FC)26 AD, which excludes the compounds possessing new fragments, and Bounding Box (BB)27 AD, which excludes compounds which have descriptors values outside of the training set, all compounds of the test set would be considered outside of AD, although these are very strict definitions (see details in SI). To demonstrate that the AD of the model is not extremely restrictive, we performed validation by excluding a random portion of the training set to an external test set and repeated the optimization and validation process on these new sets. The predictions for these sets are excellent, which shows that the model works well on external data of amides, as expected (all details are presented in SI).
We benchmarked a range of molecular descriptors and machine learning algorithms, showing that fragment-count-based descriptors (ChyLine and CircuS) substantially outperformed traditional molecular fingerprints in cross-validated prediction of both raw retention times and logarithmic retention factors (ln
k). These fragment descriptors provided richer, more quantitative representations of structural features that correlate with chromatographic behavior, especially for compounds with repeating or aromatic substructures that drive retention on the PBT column.
External validation using structurally diverse test compounds highlighted important limitations of model extrapolation, with notable prediction errors for molecules well outside the training set's chemical space. Nonetheless, interpretation methods such as ColorAtom analysis clarified the origins of prediction errors, confirming that the model's learned relationships remain chemically meaningful within its applicability domain. Moreover, controlled experiments excluding subsets of the training data demonstrated robust predictive performance for amide structures within the expected chemical space.
Overall, our approach shows that machine learning models trained on systematically designed reaction libraries can provide accurate, interpretable predictions of SFC retention times for new columns. This can reduce the need for trial-and-error experimentation, accelerate method development, and improve column selection workflows. Future work will expand the training data to broader chemical classes and columns, refine applicability domain estimation, and integrate these predictive tools into automated analytical workflows for high-throughput chromatography.
Supplementary information (SI): experimental data and code for reproducing the modelling results. See DOI: https://doi.org/10.1039/d5dd00437c.
| This journal is © The Royal Society of Chemistry 2026 |