Open Access Article
Jinglong
Lin†
a,
Longyin
Song†
b,
Yuntian
Chen
c,
Chengchun
Liu
a,
Shufeng
Chen
*b and
Fanyang
Mo
*ade
aSchool of Materials Science and Engineering, Peking University, Beijing, 100871, P. R. China. E-mail: fmo@pku.edu.cn
bSchool of Chemistry and Chemical Engineering, Inner Mongolia University, Hohhot, 010021, P. R. China. E-mail: shufengchen@imu.edu.cn
cNingbo Institute of Digital Twin, Eastern Institute of Technology, Ningbo, Zhejiang 315200, P. R. China
dSchool of Advanced Materials, Peking University Shenzhen Graduate School, Shenzhen, 518055, China
eAI for Science (AI4S)-Preferred Program, Peking University Shenzhen Graduate School, Shenzhen, 518055, China
First published on 29th July 2025
Multimodal learning, a key machine learning (ML) approach, has been extensively applied in fields such as medical diagnostics and recommendation systems. The complexity of chemical data offers unique opportunities for multimodal learning, though its application in chemistry remains underexplored. Here, we propose an innovative multimodal framework for gas chromatography (GC) that integrates a geometry-enhanced graph isomorphism network and gated recurrent units. This framework predicts GC retention time across diverse molecular heating profiles with a test set R2 of 0.995, outperforming traditional ML methods. It effectively recommends optimal chromatographic conditions for separating positional isomers and cis/trans isomers, minimizing experimental iterations and significantly improving analytical efficiency. Moreover, the model provides insights into the separation challenges of various isomers, enhancing understanding of the relationship between molecular structure and chromatographic behavior. This approach could pave the way for broader applications of multimodal learning in chemistry.
GC is a vital technique for separating and analyzing complex mixtures,6–8 now extensively used in fields such as chemical analysis,9 pharmaceutical testing,10 and environmental monitoring.11 The RT in GC, referring to the duration a specific component stays within the chromatographic column, is one of the key parameters for compound identification. Researchers have proposed various theories,12 including thermodynamic and kinetic models, to explain analyte behavior in stationary and mobile phases and to account for RT variations under different chromatographic conditions. However, these theories rely on idealized assumptions and simplified conditions, making them challenging to apply in practical research. Therefore, there is an urgent need to develop predictive models for GC-RT to rapidly determine chromatographic conditions and enhance analytical efficiency. Recently, researchers have used ML to predict GC retention indices (GC-RI)13–16 and have combined ML with techniques like Gas Chromatography-Mass Spectrometry (GC-MS)17–19 and Gas Chromatography-Ion Mobility Spectrometry (GC-IMS).20 However, there is currently a lack of predictive models for GC-RT. Our group has primarily focused on predicting RT in thin-layer chromatography (TLC)21 and high-performance liquid chromatography (HPLC).22
The factors influencing GC-RT include molecular properties, temperature conditions, and stationary phases (Fig. 1a). The multimodal learning framework is particularly suited to address these complexities. The multimodal framework developed in this work can be utilized for rapid virtual screening and recommendation under GC temperature programming conditions, identification of internal standard peak positions, and analysis of complex mixtures (Fig. 1b). Our contributions can be summarized as follows:
1. We established a Gas Chromatography Retention Time (GCRT) dataset, covering 250 compounds, 244 types of temperature programs, and 3950 retention time data. We developed a threshold filtering method, GC-PIEA, that can extract target peak information from GC chromatogram data in PDF format in batches automatically.
2. We constructed three conventional ML models and explored model interpretability through correlation and feature importance analysis, thereby validating and uncovering physicochemical principles affecting GC-RT from a statistical perspective.
3. We constructed a multimodal model where a geometry-enhanced graph isomorphism network processes molecular information and a Bidirectional GRU handles temperature curve data. This model can accurately predict the GC-RT of target molecules under different temperature programs, with a test set R2 reaching up to 0.995, surpassing conventional ML models. The multimodal model demonstrates substantial robustness and generalization capabilities. Even with Gaussian noise at 40%, the R2 value on the test set remains 0.906. Furthermore, the model achieves an R2 of 0.900 in predicting entirely novel compounds under nonlinear temperature programs.
4. We introduced the concept of the isomer separation degree, enabling our multimodal model to rapidly perform virtual screening. This approach identifies optimal temperature programs for positional and cis/trans isomers, ensuring the fastest detection peaks while maintaining separation. Search algorithms can be utilized to explore the separation challenges of various isomers, providing new chemical insights and enhancing the understanding of the relationship between molecular structure and chromatographic behavior. This method advances chromatographic research, allowing other scientists to apply it to their isomer studies of interest.
5. We proposed a strategy that utilizes GC-RI and temperature programs to predict GC-RT, and provided an ANN pre-trained model. Other researchers can fine-tune this model on their own tasks, thereby achieving universal RT prediction.
The sample preparation process was precise and meticulous. Initially, 0.5 mmol of the sample was dissolved in 1 mL of ethyl acetate. Then, approximately 0.05 mL of this solution was extracted using a disposable pipette and dropped into a dedicated sample vial, which was subsequently diluted to the mark with 1 mL of ethyl acetate. The sample was injected in a volume of 1 μL, employing a 20
:
1 split injection mode, with the injector temperature set at 250 °C.
The GC chromatogram data obtained using specific devices and methods have been saved in PDF format. To efficiently extract and batch-process key information from these PDF-format chromatograms, we have developed a new algorithm: Gas Chromatography Peak Information Extraction Algorithm (GC-PIEA), as detailed in Section 1 of the SI Materials. This algorithm first converts the text in the chromatograms into strings, then filters data by setting a threshold for peak area, thereby accurately locating the position of the target peak and extracting key parameters such as RT, peak height, and peak area.
The temperature program is originally a time-series data set (T–t). We have processed the temperature programs of all data points into series of 32 time steps each, corresponding to the temperatures for 1 to 32 minutes. Furthermore, to enhance the data's usability, molecular information is stored in SMILES format. Based on these methods, we have constructed the Gas Chromatography Retention Time (GCRT) dataset. This dataset comprises 3950 RT data points for 250 different compounds under various temperature programs, with the distribution of compounds illustrated in Fig. 1c and the dataset construction process shown in Fig. 1d.
:
1 and utilized 10-fold cross-validation on the training set to determine the optimal hyperparameters. For the ANN model, we designed a structure with two hidden layers and incorporated batch normalization and dropout techniques (see Section 2 of the SI Materials for details), setting a random split for the training, validation, and test set ratio of 8
:
1
:
1. The model's input features total ten dimensions (Fig. 2a), six of which are based on the physicochemical properties of molecules, covering aspects such as polarity and hydrophobicity. These features include Molecular Weight (MW), Topological Polar Surface Area (TPSA), Number of Hydrogen Bond Acceptors (HBA), Number of Hydrogen Bond Donors (HBD), Number of Rotatable Bonds (NROTB), and lipid-water partition coefficient (Log
P). These features were selected due to their potential correlation with the compounds' RT in GC. For instance, MW influences boiling points, thereby affecting the migration speed and volatility of compounds within the chromatographic column; Log
P is a key descriptor for assessing the hydrophobicity of compounds, which influences their hydrophobic interactions with non-polar stationary phases. All these descriptors were calculated using the RDKit library in Python. The remaining four dimensions are temperature program features, including initial temperature, final temperature, heating rate, and duration of initial temperature, all of which directly impact the analysis outcome in GC. In this study, the efficacy of feature engineering was corroborated from a modeling perspective, as evidenced by the exemplary performance on the test set of three ML models (Fig. 2b): RF (R2 = 0.950), LGB (R2 = 0.919), and ANN (R2 = 0.925).
To further investigate the predictive performance of conventional ML models on unknown compounds, we employed the compound split method (Fig. S1). The compound split method refers to dividing a dataset based on the SMILES of compounds, ensuring that compounds in the test set are never seen in the training set. This ensures that the model's evaluation is based on entirely novel chemical entities, enhancing the robustness and generalizability of the results. For the applications of RF and LGB, we set the ratio of training to test sets at 9
:
1; for the ANN, the data were divided into training, validation, and test set with respective ratios of 8
:
1
:
1. The models' performance on the test sets can be found in Section 4 of the SI Materials, under Fig. S2, where the R2 values are 0.878 for RF, 0.885 for LGB, and 0.852 for ANN.
In our investigation into model interpretability, we began by computing the feature correlation matrix (Fig. 2c), uncovering significant positive correlations between HBA, HBD, and TPSA, as well as between Log
P and MW. Furthermore, the evident negative correlations between TPSA, HBA, and Log
P, along with near-zero correlation coefficients among other features, highlighted the independence among features. In the following analysis, we focus on the assessment of feature importance. Given the superior performance of the RF model among the three conventional ML approaches, we employed the GCRT dataset and RF model to compute the mean absolute SHapley Additive exPlanations (SHAP) values of various molecular descriptors, thereby quantifying the importance of each feature. The normalized importance ratios of these features are illustrated in Fig. 2d. The concept of SHAP values originates from Shapley values in game theory, which are a cooperative game theory tool designed to fairly allocate the contributions of each participant to a collective outcome. In this study, each molecular descriptor was considered a “player”. The definition of Shapley values is as follows:
![]() | (1) |
P (24.5%), and TPSA (15.7%) are the three features with the most significant impact on the RT. This finding underscores the key role of a molecule's molecular weight, hydrophobicity, and polarity in influencing RT. Subsequently, we conducted an analysis of the correlation between molecular descriptors and RT, specifically using the Spearman rank correlation coefficient (ρ) to measure the relationship between each feature and RT (Fig. 2e). The ρ value ranges from [-1, 1], where ρ > 0 indicates a positive correlation, and ρ < 0 indicates a negative correlation. The analysis reveals that MW and Log
P show a moderate positive correlation with RT (ρ > 0.3), and their p-values are significantly less than 0.001, lending statistical significance to our findings and further validating the reliability of our discoveries. All these findings are derived entirely from a data-driven analysis.
We provided a chemical rationale for these findings: molecules with higher MW typically exhibit higher boiling points due to stronger van der Waals forces among them, requiring more energy to overcome these forces and transition into the gas phase. This implies that compounds with larger MW tend to linger longer in the stationary phase, as they are more challenging to be transported by the carrier gas, thereby resulting in extended RT. The study utilized a 5% phenyl–95% dimethylpolysiloxane chromatography column, offering an analytical environment ranging from nonpolar to medium polarity. Molecules with higher Log
P values, due to their stronger hydrophobicity, are more likely to interact with the stationary phase, leading to an increase in RT. The positive correlation between Log
P and MW (correlation coefficient of 0.62) further reveals the synergistic effect of Log
P and MW in prolonging the RT of molecules.
Molecules are naturally represented as graphs, with atoms as nodes and chemical bonds as edges, making them particularly suited for processing with graph neural networks. Previous researchers have conducted extensive work in the field of molecular graph representation learning.23–26 In the molecular processing branch, we utilize a Geometry-enhanced Graph Isomorphism Network (GeoGIN) framework. Two graphs comprehensively represent molecular information: the atom-bond graph, showing the connections between atoms, and the bond-angle graph, which includes information on the molecule's 3D conformation. The embeddings of nodes and edges for the atom-bond graph and the bond-angle graph are illustrated in Fig. 3a. In the atom-bond graph, node features include formal charge, implicit valence, hybridization, and aromaticity of atoms, among others, while edge features encompass chemical bond types and ring presence, among others. In the bond-angle graph, in addition to calculating bond angles, edge features also embed molecular descriptor information, such as the TPSA and Relative Polar Surface Area (RPSA), merging spatial information with physicochemical properties. Given that chromatographic RT prediction is a structure-sensitive task, we used the Graph Isomorphism Network (GIN) as the molecular encoder to extract features of the graph structure. The key idea of GIN is to distinguish different molecular graph structures more precisely through its message passing and aggregation mechanism, even when these structures appear similar in conventional graph neural networks (like GCN or GAT). In our designed architecture, both atom-bond and bond-angle graphs are processed deeply using five-layer GIN. Through this approach, we have conducted meticulous feature extraction for each graph. Upon completion, sum pooling is employed to aggregate the extracted features, obtaining the embedding representation of the graphs. Subsequently, the embedding representations of the atom-bond and bond-angle graphs are fused to generate a 128-dimensional feature vector. This feature vector accurately captures the structural information and chemical properties of the molecule, laying a solid foundation for subsequent analysis and applications.
In the experimental conditions processing branch, we employ a Bidirectional GRU (Bi-GRU) to handle the temperature programs. Bi-GRU, a lightweight recurrent neural network, is highly effective for extracting series features due to its ability to capture temporal dependencies in data. In our study, by applying average pooling operations along the temporal dimension to the hidden state vectors extracted by Bi-GRU, we successfully obtained a vectorized representation of the temperature program. Subsequently, we employed an attention-weighted fusion strategy that dynamically determines the feature weights of two separate branches: the feature weight for the Bi-GRU branch is denoted as ω1, while the feature weight for the GeoGIN branch is represented as ω2. Through this method, we obtained a 128-dimensional fused feature vector that integrates information from both the temperature program and molecular graphs. Considering the differences in feature distributions extracted by Bi-GRU and GeoGIN, we utilized two residual blocks to further enhance the fusion of these two types of features. The use of residual blocks aims to reduce information loss while increasing the model's adaptability to and integration of feature differences. This step ensures that the fused features more comprehensively represent the attributes of the input data. Finally, these meticulously fused features are fed into a fully connected neural network. The output layer of this network features a single neuron, tasked with delivering the predicted RT. Techniques like Batch Normalization, Layer Normalization, and dropout are applied to prevent overfitting and accelerate convergence. All the aforementioned designs not only improve the accuracy of predictions but also provide richer information for decision-making.
:
1
:
1 ratio. As illustrated in Fig. 4a, the model achieved an R2 value of 0.995 on the test set, significantly surpassing the performance of conventional ML algorithms. An outlier analysis (see Section 5 of the SI Materials for details) was also performed to investigate the 6 samples with prediction errors greater than 1 minute. The results indicate that the model has limitations in predicting high molecular weight (MW > 200) and highly polar compounds. To further evaluate the model's generalization capabilities towards unknown compounds and to prevent data leakage, we implemented a novel dataset division strategy: 80% of the compound data was designated for training, with 10% allocated for testing and another 10% for validation. Following this approach, the model demonstrated an R2 value of 0.931 on the test set (Fig. 4b), exhibiting excellent predictive performance and robust generalization to unknown compounds, significantly surpassing the performance of conventional ML algorithms. Above outcomes indicate that architecture of the proposed multimodal model effectively integrates diverse modal information, demonstrating a robust capability to learn the mappings among different variables, significantly augments the model's predictive accuracy for RT.
Additionally, we investigated the robustness of our model by evaluating its performance across different training set ratios and Gaussian noise levels. The method for adding noise to the data is as follows:
= X + ε·std(X)·N(0,1) | (2) |
represents the noise features of the temperature programs, X denotes the experimental features of the temperature programs, ε is the Gaussian noise ratio, std(X) is the standard deviation of the features, and N(0,1) refers to the normal distribution. As shown in Fig. 4c, the R2 value on the test set increases with the rising proportion of the training set. When the proportion of training data reaches 60%, the R2 value on the test set already exceeds 0.99. Additionally, the multimodal model demonstrates excellent robustness to noise in chromatographic experimental conditions: even with the introduction of 40% Gaussian noise, the R2 value remains above 0.90, indicating robust performance of the model.
Although our model was initially trained on linear temperature program data, to assess its adaptability to nonlinear heating processes, we conducted further tests. We specifically selected eight nonlinear temperature programs and ten compounds that were distinctly not included in the GCRT dataset. This choice underscores the compounds' novelty, ensuring a rigorous evaluation of the model's performance on entirely new chemical entities. As shown in Fig. 4d, the temperature programs were categorized into rapid nonlinear (reaching final temperature within 8 minutes, represented in red) and slow nonlinear (reaching final temperature after 10 minutes, represented in purple). The orange area represents the coverage of linear temperature programs within the training set. Notably, two of the rapid nonlinear curves fall outside this coverage area. Encouragingly, the test results demonstrated that the multimodal model was able to accurately predict the RT of unknown compounds under various nonlinear conditions, achieving an R2 of 0.900. This not only demonstrates the model's robust generalization ability but also indicates its effectiveness in capturing the essential temporal information within the heating process.
000 temperature programs, balancing search precision with efficiency. On a single Nvidia 4090 GPU, this search is completed in just one minute. The procedure for generating 100
000 random temperature programs starts by establishing boundary conditions for the following variables: initial temperature, final temperature, heating rate, and duration of the initial temperature. Following the setting of these parameters, random sampling is utilized to generate the curves. For further details on this method, refer to Section 6 of the SI Materials. The definition of isomer separation degree is as follows:![]() | (3) |
![]() | (4) |
![]() | (5) |
To validate the reliability of the temperature program search algorithm, we selected two pairs of positional isomers (2-bromo-4-methylbenzaldehyde and 2-bromo-5-methylbenzaldehyde; 2-iodo-4-methylbenzeneamine and 3-iodo-4-methylbenzeneamine) and two pairs of cis/trans isomers (cis-1,2-diphenylethylene and trans-1,2-diphenylethylene; cis-1,4-dichloro-2-butene and trans-1,4-dichloro-2-butene) as proof-of-concept subjects. More importantly, these four sets of compounds were not included in the GCRT dataset, hence the model had no prior exposure to these compounds during training.
The Tthreshold for two groups of positional isomers was set at 0.33 (20 seconds), while the default value of 0.5 was used for two groups of cis/trans isomers. This adjustment was necessary because the positional isomers failed to achieve a temperature program satisfying Δi = 1 at Tthreshold of 0.5. Fig. 5a presents the kernel density estimation (KDE) plots of max(RTpredA, RTpredB) values for the four groups of isomers during the search process where Δi = 1, illustrating the probability distribution of different isomers' later RT. The KDE was performed using a Gaussian kernel, as described by the following equation:
![]() | (6) |
(x). The recommended temperature program corresponds to the minimum max(RTpredA, RTpredB). For the four groups of isomers from top to bottom in Fig. 5a, the temperature programs recommended by the multimodal model are accelerated by 42.85%, 27.77%, 19.48%, and 21.41%, respectively, relative to the temperature programs corresponding to the mode of max(RTpredA, RTpredB). Experimental validation demonstrated that under the recommended temperature programs, all four groups of isomers were effectively separated, and the multimodal model successfully predicted the elution order with absolute errors of 0.395 minutes, 0.613 minutes, 0.693 minutes, and 0.395 minutes, respectively, showcasing excellent predictive accuracy. The experimental chromatograms are shown in Section 7 of the SI Materials. Notably, the s values for these four groups of isomers were 0.562, 0.094, 0.217, and 0.870, respectively, indicating a certain level of separation difficulty, especially for the pair of isomers 2-bromo-4-methylbenzaldehyde and 2-bromo-5-methylbenzaldehyde. The results suggest that this multimodal model, in conjunction with the search algorithm, has the potential to ensure effective isomer separation and enhance analytical efficiency. Given its high precision in prediction, this method can also be employed for internal standard peak identification and rapid analysis of complex mixtures.
To further investigate the separation difficulty of different isomers on a 5% phenyl–95% dimethylpolysiloxane stationary phase, we calculated the s values for various positional and cis/trans isomers at different Tthreshold values as a proof of concept: Fig. 5b illustrates three groups of ortho/meta positional isomers, differing only by their substituents. The results show that bromine-substituted isomers are easier to separate than chlorine-substituted ones, while methoxy-substituted isomers are the most difficult to separate. The chemical explanation we provide is that methoxy is a strong electron-donating group with significant polarity, producing a relatively consistent polar effect regardless of its position on the benzene ring, leading to minimal polarity differences between positional isomers, making them hard to separate. Chlorine is less polar than methoxy, and bromine is the least polar but has a higher polarizability, resulting in significant differences in substituent effects at different positions on the benzene ring, making positional isomers easier to separate. Fig. 5c shows three groups of halogenated ethylene cis/trans isomers, with iodine isomers being the easiest to separate, followed by bromine, and chlorine isomers being the hardest. The chemical explanation is that iodine atoms, being larger with high polarizability and weaker polarity, cause significant van der Waals force differences between cis and trans isomers. This substantial difference in atomic size and polarizability makes it easier to separate cis/trans isomers in the stationary phase. Bromine has moderate polarizability and size, whereas chlorine atoms are the smallest with the lowest polarizability, resulting in minimal polarity differences between cis and trans isomers, making separation the most difficult. In summary, different substituents lead to varying polarity differences between isomers, influencing separation difficulty. The data-driven conclusions align with chemical intuition. Other researchers can apply our model to their isomers of interest to explore the relationship between molecular structure and chromatographic behavior, aiding in experimental decision-making without performing experiments.
| RI = f(SP, Molecule) | (7) |
This characteristic renders it a critical parameter for the identification of compounds. Researchers like Dmitriy and Aleksandar have advanced ML models to predict the RI of compounds under various SPs.13–16 Given the applicability of the RI across various GC conditions and instrumentation, integrating RI to enhance the scalability of RT prediction models represents a promising direction.
RT is primarily influenced by the SP, the molecule, and the temperature program, which can be represented as follows:
| RT = g(SP, Molecule, Temperature Program) | (8) |
Therefore, we used RI and the temperature program to predict RT, formulated as follows:
| RT = z(RI, Temperature Program) | (9) |
Linearly temperature-programmed retention index (LTPRI) is the most commonly used type of RI that depends on the RT of adjacent standard alkanes. Its calculation formula is as follows:
![]() | (10) |
We developed an ANN model with two hidden layers, referred to as the RI-RT model (see Section 8 of the SI Materials for details), to predict RT. The model incorporates five input dimensions: LTPRI, final temperature, heating rate, initial temperature, and duration of initial temperature (as illustrated in Fig. 6a). The LTPRI encapsulates information about the molecule and stationary phase, while the remaining four features describe the dynamics of the temperature program. The dataset was randomly divided in an 8
:
1
:
1 ratio for training, validation, and test, achieving a predictive R2 of 0.973 on the test set (shown in Fig. 6b). To assess the model's generalization capability and prevent data leakage, we re-partitioned the dataset by compound type, maintaining the same distribution ratio. Under these conditions, the model's predictive R2 on the test set was 0.966 (as shown in Fig. 6c), surpassing the performance of the multimodal learning model.
The exceptional performance of the RI-RT model validates the feasibility of using RI to predict RT, demonstrating that a single parameter, RI, can encompass information about both the molecule and the stationary phase, significantly broadening the applicability of RT prediction models. We offer a pre-trained RI-RT model. Researchers can first employ existing RI prediction models13–16 to calculate the RI of different molecules on their specific stationary phases. Subsequently, with minimal temperature program testing, they can fine-tune the RI-RT pre-trained model to achieve universal RT prediction.
However, the multimodal model still exhibits limitations. Due to practical constraints, obtaining RT data for compounds across various stationary phases and temperature programs is difficult. Despite being trained on the widely used 5% phenyl–95% dimethylpolysiloxane stationary phase, extending its application to other stationary phases remains challenging. To address this issue, we propose a strategy for predicting RT using RI and provide a corresponding pre-trained model. Since RI inherently contains information about the stationary phase, researchers can apply our pre-trained model to their specific stationary phases, fine-tuning it with minimal data to achieve universal RT prediction. Through this approach, we aim to overcome the limitations posed by different stationary phases and further enhance the model's applicability. Additionally, in future work, the collection of chromatographic data under optimized flow rates and injection conditions would significantly enhance the quality of the dataset.
Supplementary information includes details of model architectures, outlier analysis, and experimental validation. See DOI: https://doi.org/10.1039/d4dd00369a.
Footnote |
| † These authors contributed equally. |
| This journal is © The Royal Society of Chemistry 2025 |