Open Access Article
Barnabas P. Agbodekhe
a,
Montana N. Carlozo
a,
Dinis O. Abranches
b,
Kyla D. Jones
a,
Alexander W. Dowling
*a and
Edward J. Maginn
*a
aDepartment of Chemical and Biomolecular Engineering, University of Notre Dame, Notre Dame, IN 46556, USA. E-mail: adowling@nd.edu; ed@nd.edu
bCICECO - Aveiro Institute of Materials, Department of Chemistry, University of Aveiro, 3810-193 Aveiro, Portugal
First published on 13th October 2025
Group contribution (GC) models are powerful, simple, and popular methods for property prediction. However, the most accessible and computationally efficient GC methods, like the Joback and Reid (JR) GC models, often exhibit severe systematic bias. Furthermore, most GC methods do not have uncertainty estimates associated with their predictions. The present work develops a hybrid method for property prediction that integrates GC models with Gaussian process (GP) regression. Predictions from the JR GC method, along with the molecular weight, are used as input features to the GP models, which learn and correct the systematic biases in the GC predictions, resulting in highly accurate property predictions with reliable uncertainty estimates. The method was applied to six properties: normal boiling temperature (Tb), enthalpy of vaporization at Tb (ΔHvap), normal melting temperature (Tm), critical pressure (Pc), critical molar volume (Vc), and critical temperature (Tc). The CRC Handbook of Chemistry and Physics was used as the primary source of experimental data. The final collected experimental data ranged from 485 molecules for ΔHvap to 5640 for Tm. The proposed GCGP method significantly improved property prediction accuracy compared to the GC-only method. The coefficient of determination (R2) values of the testing set predictions are ≥0.85 for five out of six and ≥0.90 for four out of six properties modeled, and compare favorably with other methods in the literature. Tm was used to demonstrate one way the GCGP method can be tuned for even better predictive accuracy. The GCGP method provides reliable uncertainty estimates and computational efficiency for making new predictions. The GCGP method proved robust to variations in GP model architecture and kernel choice.
Design, System, ApplicationEfficient and reliable thermophysical property prediction sits at the heart of any high-throughput computational molecular discovery and design campaign. Thermophysical property predictions from a simple first-order group contribution (GC) model, along with molecular weight (MW), are used as the only two input features to Gaussian process (GP) regression models for enhanced thermophysical property predictions with reliable uncertainty quantification (UQ). Accurate property predictions are obtained with only two input feature dimensions, instead of the tens or hundreds typically used in the literature. The method, known as the GCGP method, provides a state-of-the-art balance of speed, ease of implementation, predictive accuracy, parsimoniousness, and reliable uncertainty quantification. It is especially suited to systems that can be modeled using GC methods, and its scope of applicability can be extended by incorporating other GC methods and/or input features into the GP models. Potential applications of the GCGP method include efficient and enhanced prediction of thermophysical properties with uncertainty quantification for materials discovery via database screening or computer-aided molecular design campaigns. |
The discovery and development of new materials to meet these challenges requires the reliable prediction of material properties. Experimental exploration of all possible molecules and properties needed for any material discovery problem is often not feasible. Databases17–20 of materials and some of their experimentally measured properties have been assembled for decades. However, these databases contain a small fraction of potentially relevant molecules. Furthermore, assuming that large enough databases of potential molecules are available or developed for the discovery of materials, the properties required to assess the suitability of materials are not always available.21 Predictive computational tools are essential for streamlining the process of molecule discovery. Computer-aided molecular design (CAMD) is a well-established molecular discovery method that integrates and automates considerations from molecular to process scales in the development of new materials and processes.22 It has key advantages over traditional database screening methods, such as the potential to discover new molecules not present in compiled databases. However, one of the persistent challenges is the availability and integration of fast and reliable property prediction methods in CAMD workflows.22
Group contribution (GC) models have long been used to predict the properties of materials within CAMD and other material discovery workflows, particularly to estimate thermophysical properties.23–27 GC models operate by decomposing molecular structures into predefined functional groups and assigning specific contributions or interaction parameters to each group on the basis of experimental data.
Substantial effort has been made to develop GC-based thermodynamic models, including equation of state (EoS) and activity coefficient (AC) models. Examples of GC-based EoS models include the Predictive Soave-Redlich-Kwong (PSRK),28,29 GC-SAFT,30–32 and SAFT-γ-Mie33–36 models, amongst others. An example of a GC-based AC model is the UNIQUAC37 Functional-group Activity Coefficients (UNIFAC) model.38 These GC-based EoS or AC models are of great utility in CAMD, particularly for predicting thermodynamic properties of mixtures across a wide range of temperatures, pressures, and compositions.39–42 However, implementing these models can be cumbersome, and their computational efficiency is often limited due to the need to evaluate complex derivatives.43
An alternative class of GC methods is the class of semi-empirical or correlation-based GC models. These GC methods typically consist of several models or equations—one equation for one property—for direct and efficient computation of properties without the need to evaluate complex derivatives of other properties, as is required in EoS models. Notable examples of such semi-empirical GC models include the Joback and Reid (JR) method,44 the Lydersen method,45 and the Marrero–Gani method46 amongst others.47 These models are particularly useful for material screening tasks that involve pure fluids. Therefore, these methods can be applied, at least in a preliminary stage, to many material screening and CAMD tasks.22 Their simplicity and generalizability make them invaluable tools for screening chemical systems and designing processes without requiring extensive experimental datasets.
Because property predictions using these types of GC models do not rely on calculating the derivatives of other thermodynamic properties, they offer the advantages of speed and ease of implementation compared to other methods. Compared to GC-based thermodynamic models, these types of GC models are also more generalizable for predicting diverse properties, such as environmental48,49 or safety properties50–54 of materials.
However, as highlighted in recent studies,55 limitations in available group parameters and interaction data often restrict the predictive accuracy and scope of GC models. Furthermore, the most accessible types of these GC methods, which are first-order GC models such as the JR GC method,44 are known to have significant systematic bias.56,57 Moreover, common GC models generally do not have uncertainty estimates associated with their predictions, which is essential for material screening.58
The emergence of machine learning (ML) techniques has opened new avenues for addressing some of the limitations of GC approaches. ML methods can predict the properties of molecules with high accuracy by leveraging large datasets and advanced models such as neural networks (NNs),59 support vector machines (SVMs),60,61 Gaussian process (GP) models,62–64 random forests (RFs),65 boosting algorithms,66–69 and so on, enabling rapid virtual screening of chemical candidates.
However, ML models have several drawbacks. They typically require a large amount of data, which is not always available.70,71 Also, unlike traditional thermodynamic models and some GC models, ML models rarely have clear physical interpretability.72,73 Furthermore, uncertainty propagation and estimation from complex ML techniques, such as deep neural networks, can be cumbersome.74 GP ML surrogate models, in contrast, are well-suited for applications with limited data and inherently include uncertainty quantification. The drawbacks of GPs include scalability to large datasets with many observations or many input features, difficulty scaling to multiple outputs, and challenges approximating discontinuous functions.75
Despite these limitations, ML offers powerful tools for identifying patterns and correlations in complex, multidimensional datasets, which can be leveraged to extend the applicability and accuracy of GC models. For instance, matrix completion methods have been used to predict missing group interaction parameters in thermodynamic GC models,55 demonstrating how data-driven approaches can fill gaps in traditional GC model parameterization.
Several studies have explored the synergistic benefits of combining GC and ML for enhanced property predictions. For example, Villazón-León et al. used the number of functional groups along with several properties such as Tc and Pc as inputs to several ML models for predicting triple point temperature.76 Other studies77–94 have explored the combination of GC and ML for property prediction. Ahmadreza and co-workers applied a GC-ML approach to predict the liquid density93 and viscosity92 of deep eutectic solvents (DESs). In both works, Ahmadreza and co-workers used GC fragmentations as inputs to NN and SVM models for the prediction of density or viscosity.92,93 Ma et al. used GC fragmentations of anions, cations, and substituents as inputs to several ML models to predict the viscosity and density of ionic liquid–inorganic solvent–water ternary mixtures.77 They reported high prediction accuracy. Aouichaoui et al. applied GC fragmentation in a graph convolutional NN to enhance the interpretability of molecular property predictions.78 Adhab et al. recently developed a hybrid GC-ANN method. They used a GC model to predict critical properties and acentric factor, which were then fed as inputs to an ANN model to predict DES speed of sound.83
Cao et al.81 used inputs from a third-order GC-based fragmentation as features to train SVM and GP models. This resulted in models with a 424-dimensional input size. They introduced a warping function to address the challenge of high dimensionality in GP inputs.81 More recently, Cao et al.85 developed GC-ML models for seven properties using 231 descriptors consisting of first-order groups, individual atoms, bonds, and special atom-bond groups. They explored the use of several ML models, including RFs, SVMs, and GPs. They found that the GC inputs to the GP proved superior in terms of predictive accuracy compared to the other ML methods.
These previous attempts at combining GC and ML for enhanced property predictions have primarily focused on utilizing GC-based molecular fragmentations as molecular descriptors in ML models for property prediction.76,77,81,85 This leads to high-dimensional, sparse, discrete input feature spaces for training ML models. Such high-dimensional input spaces pose severe challenges for certain ML models, such as GP regression (GPR).75,95 Therefore, most GC-ML methods in the literature have focused on ML models such as SVMs, boosting algorithms, and NNs, which do not provide a convenient route for reliable prediction uncertainty quantification.
To the best of our knowledge, only two works81,85 have applied GC inputs to GP models for property predictions. This is the case despite the several benefits of GP models, such as inherent uncertainty quantification, ease of model training, and high predictive performance on small data sets compared to some other ML models. The works that involve applying GC inputs in GP models used 424-dimensional,81 and 231-dimensional85 inputs to the GP models. A very high-dimensional input size can prove challenging for GP modeling.75,95 Furthermore, several of the GC-ML methods in the literature use higher-order GC-methods81,85 which can be tedious to apply compared to simple first-order GC models. Furthermore, several GC-ML methods in the literature are restricted to certain classes of materials like DESs.83,92,93
The present work aims to enable the efficient, reliable, and parsimonious prediction of thermophysical properties with uncertainty quantification by combining the strengths of simple, first-order, semi-empirical GC methodologies and GP models. One notable contribution of this work is the evaluation of the quality of uncertainty estimates from the proposed GCGP method. To the best of the authors' knowledge, this is the first work within the GC-ML literature to provide and assess the quality of uncertainty estimates for property predictions. We use property predictions from a basic first-order GC method (the JR GC model), along with a readily accessible molecular property (molecular weight), as the only two inputs to the GP. The GC predictions often have significant systematic biases for several properties, which are then corrected by training the GP.
The proposed GCGP method offers a general property prediction method based on the GC-ML framework, and requires significantly fewer input features than other works in the literature. The method exploits the benefits of GP modeling while avoiding the potential curse of dimensionality issues that limit previous attempts to use GC inputs in GPs.81,85
By integrating the systematic framework of GC models with the predictive power of GPR, we propose a hybrid approach that overcomes existing data limitations and improves predictive accuracy compared to GC-only methods. Furthermore, the proposed method provides uncertainty estimates, requires two simple-to-compute input features, provides interpretability, and maintains computational efficiency.
Our study evaluates the performance of this hybrid model approach96,97 in comparison to predictions made using only the GC model. The approach aims to provide a versatile and robust framework for property prediction, enabling the design and optimization of a broad range of chemical systems.
Fig. 1 summarizes the key steps of the proposed method. Subsequent sections describe each step in detail.
Three types of data are collected or computed for each molecule and property to build the complete datasets used in this work: experimental property data, the JR GC property predictions, and the molecular weights (MW).
Table 1 shows the total number of experimental data points used in this work for each property. Tm had the highest number of data points for molecules whose melting temperature could be predicted using the JR GC model. ΔHvap had the fewest. 514 and 416 experimental data points for Tm and Tb, respectively, in the CRC Handbook of Chemistry and Physics were omitted from this study. These are not included in Table 1 as the database indicated that the reported temperatures may not be the true melting or boiling temperatures. At those temperatures, the molecules could instead undergo decomposition or sublimation.
| Property | Total data points | Training set | Testing set |
|---|---|---|---|
| Tb | 4321 | 3457 | 864 |
| ΔHvap | 485 | 388 | 97 |
| Pc | 684 | 547 | 137 |
| Vc | 698 | 558 | 140 |
| Tc | 712 | 570 | 142 |
| Tm | 5640 | 4512 | 1128 |
![]() | (1) |
![]() | (2) |
![]() | (3) |
![]() | (4) |
![]() | (5) |
![]() | (6) |
In the above equations, ni is the number of structural units of type i in the molecule.
is the set of groups with parameters in the JR GC model. Tb,i, Hvap,i, …, Tm,i are the JR GC parameters for the structural unit (group) i for each property. These parameters determine how the presence of each structural unit changes or contributes to the properties.
The JR GC method works by dividing the molecule into predefined structural units, for which parameters are available in the JR GC method. The desired property of the molecule is then predicted using the appropriate JR GC equation from eqn (1)–(6). The parameters for these equations are tabulated. Fig. S1 shows an example of how the JR GC method is used to compute properties.
In this work, the JRgui software (first release),103 an open-source Python-based code, was used to automatically compute the JR GC predictions for all properties using the SMILES strings of molecules. SMILES strings that could not be treated using the JR GC method were filtered out. The SMILES strings were obtained by parsing the Chemical Abstracts Service (CAS) registry numbers of the molecules in the CRC Handbook of Chemistry and Physics using PubChemPy (version: 1.0.4).104 PubChemPy is another open source Python-based package for interfacing with the PubChem17 database of compounds. The PubChem database contains over 100 million compounds and contains SMILES strings for all or almost all compounds for which it has an entry. In this work, we assume that all or almost all of the molecules in the CRC Handbook of Chemistry and Physics will have an entry in the PubChem database. Thus, their SMILES strings will be available from PubChem.
The JRgui software also provides the values of 187 molecular descriptors from RDKit (version: 2017.09.1)105 in addition to other output data. The molecular weight (MW) is one of the outputs of the JRgui tool and was used as the source of MW data for this work. Note that MW can be readily computed in the same fashion as some other properties from simple GC equations by simply summing the molecular weights of the structural units in a molecule. Therefore, there is no need to use RDKit, JRgui, or any specialized tool.
We found that the JRgui software failed to correctly parse SMILES of co-crystals, molecular complexes, ring-embedded tertiary amines, and anhydrides, which are not amenable to the JR GC method. For these classes of molecules, the JRgui software gave incorrect properties due to erroneous molecular fragmentation of the molecules. We applied additional filtering to remove data corresponding to such molecules to obtain the final data sets used in this work (see Fig. 1B).
Briefly, for any given property, we compute the standard deviation of the JR GC predictions. Depending on the relationship between GC-predicted and experimental data, we fit either a linear or a power law function to the experimental data as a function of the JR GC predictions. These serve as simple relationships between GC-predictions and experimental data (green lines in Fig. S2). If the GC-predictions for a given property all follow the same general trend in relation to the experimental data, they should all lie closely around the corresponding green line (in Fig. S2). In this case, no outlier will be detected. An outlier, for any property, is defined as any GC-prediction that is more than two standard deviations (computed from the GC-prediction data for the given property) from the corresponding simple relationship. See SI section S1.2 for more details and for the relevant figures.
We observed certain data points that showed significant deviations from the general trends in the JR GC predictions compared to experiments for ΔHvap. These points were flagged as ‘outliers’ with respect to the JR GC model (see Fig. S2 and S3). In further investigation of these points, we identified three experimental ΔHvap values for which the CRC Handbook of Chemistry and Physics had incorrect data entries. These molecules are butyrolactone, 1-methylcyclohexanol, and (+)-2-bornanone with CAS registry numbers 96-48-0, 590-67-0, and 464-49-3, respectively (see Fig. S4). We ascertained that the data entries for these molecules were incorrect by comparing them against data from two additional sources: Yaws' Critical Property Data for Chemical Engineers and Chemists18 as available in the Knovel database and the National Institute of Standards and Technology (NIST)106 WebBook. These two sources agreed with each other, while the CRC Handbook data differed for these three molecules. Furthermore, once the experimental ΔHvap data for these three molecules were replaced with those from the Yaws' Critical Property Data, they ceased to be flagged by our outlier detection procedure (Fig. S5). The other data points that were flagged as ‘outliers’ for ΔHvap were found to be due to limitations in the parameterization of the JR GC method (Fig. S6a–c). This is discussed in more detail in section 3.3.
We note that the Tm data collected from the CRC Handbook of Chemistry and Physics had several entries for which the Tm values were exactly the same. This included molecules with widely differing structural units, functional groups, and molecular weight. We compared some of the Tm data collected from the CRC Handbook of Chemistry and Physics with those from the Yaws' Critical Property Data for Chemical Engineers and Chemists.18 We found that the entries in the CRC Handbook of Chemistry and Physics agree with those from Yaws' Critical Property Data for Chemical Engineers and Chemists. Tm poses an interesting challenge, considering that molecules with seemingly very different functional groups and molecular structures have similar values of Tm. See section 3.1 for further discussion.
Fig. 2 and 3 demonstrate that the JR GC predictions and molecular weight are related to the experimental data for all properties of interest. Fig. 2 shows that the JR GC predictions and the experimental data are fairly linearly correlated for ΔHvap, Pc, and Vc. The JR GC models for Tm, Tb, and Tc are much worse predictors of the experimental data as quantified in sections 3.1 and 3.2. Thus, we observe a clearly nonlinear trend in the discrepancy, as shown in Fig. 3.
![]() | ||
| Fig. 3 3D visualization of JR GC predictions against experimental values and MW. Points a, b, c, and d are as previously discussed. | ||
Fig. 3 shows a relationship between molecular weight and the experimental data and JR GC predictions for Vc, Tb, Tc, and Tm. We observe that the discrepancy in ΔHvap does not have a strong correlation with molecular weight (i.e. there is no clear discrepancy color gradient with changing MW). However, Pc exhibits a strong nonlinear trend. This suggests that molecular weight is in general, an excellent molecular descriptor for Pc and a subpar descriptor for ΔHvap (see Fig. S7 and S8).
The systematic bias in Tm, Tb, and Tc highlights shortcomings of the JR GC method, which assumes that structural units contribute to the value of these properties monotonously. We observe, for example, that the JR GC method predicts that several organic molecules would have values of Tm greater than 1500 K, which is not the case in nature. Molecules—even within the same family—do not monotonously and boundlessly melt at higher temperatures as they get bigger. The systematic bias of the JR GC predictions for ΔHvap and Pc is more nuanced. Other properties for which the JR GC method shows a systematic bias are generally correlated with molecular weight. In contrast, the systematic bias of the JR GC method for ΔHvap and Pc is for specific classes of molecules.
In Fig. 2 there are two points (a and b) with conspicuously low JR GC ΔHvap predictions. These correspond to highly fluorinated molecules with moderate to high MW. The two molecules with this large underestimation in ΔHvap using the JR GC method are shown in Fig. S6a and b. The contribution of the fluorine group to ΔHvap according to the JR GC method is −0.67 kJ mol−1. This represents the only negative value in the parameter set for ΔHvap in the JR GC method; all other groups have positive contributions to ΔHvap in the JR GC method.44 This explains why, for highly fluorinated molecules, the JR GC method predicts very low values of ΔHvap contrary to experimental values. The JR GC method could predict negative ΔHvap values for sufficiently fluorinated molecules, which would be unphysical.
Fig. S6c shows another class of molecules for which the JR GC method has a large systematic bias in its ΔHvap predictions. They are highly nitrated compounds, such as tetranitromethane, shown in Fig. S6c. The JR GC ΔHvap prediction for tetranitromethane is 82.89 kJ mol−1 and can be observed in Fig. 2 as the highest JR GC ΔHvap prediction (point c) in our data. The JR GC method predicts that every –NO2 structural unit in a molecule should contribute 16.738 kJ mol−1 to the ΔHvap of the molecule. This contribution is much higher than those of most other structural units in the JR GC method parameter set for ΔHvap. This leads to an overestimation in ΔHvap for highly nitrated molecules. A similar scenario is observed for JR GC Pc predictions for highly brominated molecules (point d in Fig. 2). The molecules corresponding to points a–d were included in the training set for model development using stratified sampling discussed in section 2.4. In summary, Fig. 3 visualizes the 3D relationship between the input features, MW, and JR GC prediction, with experimental data for all properties.
We start by establishing notation. We define the dataset
p:= {(yGCi, MWi), yexpi}ni=1 for all molecules n and each property p ∈
:= {ΔHvap, Pc, Tc, Tb, Tm, Vc} of interest. We define the vector yexp = [yexpi|∀ i ∈ {1, …, n}] for each property, where p is omitted for convenience. Similarly, we define the input feature vector xi = [yGCi, MWi] which is stacked vertically to form the input feature matrix X ∈
n×d, where d = 2. Our goal is to train GP models to predict yexp based on the inputs X.
d denote an index and f:
d →
denote a random variable that is indexed by x (i.e., the stochastic process). A GP is specified by a mean function
m(x): = [f(x)]
| (7) |
| k(x, x′): = Cov[f(x), f(x′)]. | (8) |
The notation f(x) ∼ ![[capital G, script]](https://www.rsc.org/images/entities/char_e112.gif)
(m(x), k(x, x′)) denotes that f(·) follows a GP distribution with mean function m(·) and covariance function k(·,·). Equivalently, by the definition of a GP, for any finite subset x1, …, xn of random variables, f = (f(x1), …, f(xn))
, follows a multivariate normal distribution. This distribution is defined by a mean vector and covariance matrix governed elementwise by eqn (7) and (8), respectively. That is, f ∼
(μ, K), where μ = (m(x1), …, m(xn))
and
![]() | (9) |
In Bayesian nonparametric statistics, a GP is used as a prior for a random variable indexed by an infinite set. Upon observing a finite subset of these random variables, the posterior is another GP. This is commonly applied in regression settings to recover latent functions. See relevant texts75,107 for a more complete introduction to GPs.
A kernel refers to a function that defines a similarity measure between pairs of points. In the context of GPs, a kernel is a positive-definite function that defines the covariance structure. For example, the squared exponential (SE) (i.e., Gaussian) kernel is given by
![]() | (10) |
A more general form of eqn (10) is the rational quadratic (RQ) kernel given by
![]() | (11) |
Finally, we review the Matérn kernel defined by
![]() | (12) |
In principled inference, the structure of the length scale matrix Λ is used to model (lack of) prior information about the function. In an isotropic GP, a single length scale is used for all input dimensions. Mathematically, this means the length scale matrix is written as Λ = λ2I. This modeling choice enforces that all input dimensions are equally important and have the same effect on the output. Alternatively, if one wanted to use separate length scales for each input dimension, one could select kernels (eqn (10)–(12)) with automatic relevance detection (ARD). This allows the kernel to capture the varying relevance of different dimensions, meaning that some dimensions can be more influential than others in predicting the output. Mathematically, this means the length scale matrix is written as Λ = diag(λ12, …, λd2).
= {(xi, yi)}ni=1 composed of n pairs of regressors xi ∈
d and observations yi ∈
. The goal is to recover the latent data-generating process f(·). In most practical settings, the underlying process is perturbed by noise ε. That is,| yi = f(xi) + εi, i ∈ {1, …, n}, |
. In GPR it is assumed that f(·) ∼ ![[capital G, script]](https://www.rsc.org/images/entities/char_e112.gif)
(m(·), k(·,·)). This assumption is called the prior. By linearity of expectation, [yi|xi] = m(xi) |
| Cov[yi|xi, yj|xj] = k(xi, xj) + σ2nhi,j, |
t×d. Under the GP prior on f(·), the finite set of training and test outputs follows a joint multivariate normal distribution. That is,
n×n, K* ∈
n×t, and K** ∈
t×t are covariance matrices. To make predictions at the test points X*, one can leverage the conditional distribution of the test outputs given the training data
. This is done with the finite-dimensional conditional distribution
![]() | (13) |
. If the elements of θ are not chosen by the modeler, they must be inferred from the sample data
. Furthermore, one might be interested in comparing the performance of several GP models and selecting the best-performing model. The evidence (i.e., marginal likelihood) accomplishes both objectives.The evidence is given by
| p(y|X) = ∫p(y|f)p(f|X)df, |
y|X ∼ (μ, K + σ2nI), |
, that is
![]() | (14) |
![]() | (15) |
In the SI section S1.3, we describe several alternate GPR model structures. For completeness (see section 3.5), we compare these model alternatives. Ultimately, we find that the model structure described above performs well for all six thermophysical properties, balancing model performance with complexity. Thus, all of the results in the main text focus on the model structure defined in eqn (15) unless otherwise explicitly noted.
GP models were implemented using GPflow109 (version: 1.13.1) with the limited-memory Broyden–Fletcher–Goldfarb–Shanno bound (L-BFGS-B) algorithm to perform maximum likelihood estimation.
The L-BFGS-B algorithm was chosen for convenience since hyperparameter optimization is computationally inexpensive, scipy.minimize is integrated as part of GPflow, and L-BFGS-B is a popular and robust gradient-based optimization algorithm.110 We found that this algorithm, used in conjunction with multistarts, was reliable and performant. Alternative global search algorithms such as the simplicial homology global optimization (SHGO) algorithm111 or other libraries designed for hyperparameter optimization, such as Optuna,112 may be explored as future work.
Hyperparameter tuning was repeated ten times to avoid local hyperparameter solutions. In the first training pass, all hyperparameters were initialized at 1.0. In subsequent repeats, the length scale (
) and α were uniformly sampled from the bounds [10−5, 100]. σ2f was selected from a log-normal distribution with bounds [0, 1.0] and σ2w was always initialized at 1.0.
The optimization bounds for α were set to [10−5, 5 × 103] and all other hyperparameters were optimized within the limits [10−5, 102]. We checked the condition number of the kernel matrix K to ensure the GP models were reasonably scaled.
The stratified sampling algorithm is robust to the choice of random seed (see SI subsection S2.5). For several of the properties, such as Tb, Tc, Pc, and Vc, other random seeds did not change the stratified sampling train/test splits. However, some changes in the train/test splits and consequently in the results were observed for ΔHvap and Tm when different random seeds were used. The results of using ten additional random seeds are summarized in Table S7 for ΔHvap and Tm.
As noted above, we performed a single, global training/testing split using multilabel/multifeature stratified sampling, yielding the training set (80%) and the testing set (20%). Then, hyperparameter optimization was performed by minimizing the negative LML as described in subsection 2.3.5. For each property, the optimal hyperparameter set (see Table S2) was used to make predictions on the testing set once. These predictions are used to calculate the final model performance metrics reported in this work.
![]() | (16) |
![]() | (17) |
![]() | (18) |
![]() | (19) |
![]() | (20) |
is the average value of yexp.
The GCGP method provides significant correction to the systematic bias in the JR GC models (see Fig. 4). The coefficient of determination (R2) values of the predictions of the GCGP test set are ≥0.85 for five out of six and ≥0.90 for four out of six properties modeled in this work. The MAPE values of the test set are less than 5.5% for five of the six properties modeled. These prediction accuracy metrics are competitive when compared to other ML-related efforts in the literature81,85,116–120 to predict some of the properties modeled in this work. Some of these methods in the literature utilize tens to hundreds of input features,81,85,116,120 with some requiring quantum mechanical calculations of molecular descriptors116,121 or energy minimization of molecular structures117 to generate input features. The GCGP method uses only two input features derived from fast and straightforward GC-based calculations. Furthermore, the same input feature type is used for all properties, potentially eliminating the need to individually determine a unique set of input features for every material property prediction task, which is the current norm in the literature.
The GCGP method provides the greatest improvement for Tm, for which the JR GC method exhibits the greatest systematic bias (see Fig. 2). Performance metrics for the original JR GC method for Tm are poor: test set R2 = −0.29 and MAE = 75.0 K. In contrast, the proposed GCGP method is much more accurate for Tm with the test set R2 = 0.73 and MAE = 40.6 K. This is remarkable because the GP has only two input features: the molecular weight and the GC predictions, which often exhibit significant bias.
As discussed in section 2.2.1, molecules with very different functional groups and molecular structures can have similar values of Tm. For example, the aliphatic hydrocarbon 2-butyne with molecular formula C4H6 and the aromatic compound N,N-dibutylaniline with molecular formula C14H23N both have the same Tm value of 240.95 K according to the CRC Handbook of Chemistry and Physics.19 These values agree with the values reported in the NIST WebBook.106 This convoluted or unclear link between molecular constitution and structure with Tm makes it difficult for the GP to learn and correct the systematic bias in the JR GC predictions for Tm. This may also explain why the JR GC method performs extremely poorly for Tm prediction. Other works in the literature102,121,122 have encountered similar challenges in using ML techniques for the prediction of Tm. Hughes et al.121 reported that Tm was the most difficult property to predict among the several properties they considered in their work.
Hughes et al. used 168 2D and 53 3D (221 total) molecular descriptors obtained from quantum mechanical calculations. The best testing set R2 obtained for Tm in their work was 0.46.121 Li et al.102 used deep learning with protein sequences as input features for predicting Tm for proteins and obtained a testing set R2 of 0.75 for Tm. Venkatraman et al.122 used several ML techniques using semi-empirical (PM6) electronic, thermodynamic, and geometrical descriptors to predict Tm for ionic liquids. The testing set R2 values ranged from 0.53 to 0.67 for different ML techniques. The GCGP Tm predictive performance is thus competitively comparable to other (more complicated) methods in the literature for Tm, potentially offering better predictive performance while maintaining computational efficiency and parsimoniousness. Table S3 in the SI shows the effect of different settings of the white noise kernel variance (σ2w) on the model training metrics for Tm in our work.
The JR GC method also shows significant systematic bias for Tb and Tc. The application of the GCGP method significantly increased the testing set R2 values from 0.43 to 0.85 and from 0.67 to 0.94 for Tb and Tc, respectively. The results for Tm and Tb show that the GCGP method greatly improves the predictive accuracy of simple GC-based models, especially for scenarios where the GC models have extremely poor predictive performance.
The GCGP method also provides correction to observable systematic bias even when the systematic bias is small, and the overall predictive accuracy of the JR GC method is very high. The results for Vc in Fig. 4 demonstrate this. The testing set R2 for the JR GC prediction of Vc is 0.98. The GCGP method did not increase R2 for the test set, and thus it may seem that there was no bias correction obtained by applying the GCGP method. The MPE value for the GC prediction of Vc for the test set is −1.54%, while the GCGP MPE for the test set is −0.08%. A comparison of the MPE for the predictions of Vc, coupled with visual observation of Vc results in Fig. 4(d), allows us to infer that the systematic underestimation of Vc for molecules with higher MW and Vc in the GC predictions was corrected. The prediction error was no longer observably systematically biased using the GCGP method. A significantly negative MPE indicates systematic underestimation, as is the case for the GC-only predictions. This is in agreement with the observed Vc results in Fig. 4(d) for both the training and testing set results.
In one study, Cao et al. used a 424-dimensional GC-based fragmentation as inputs to GPs, and obtained testing set R2 of 0.891, 0.986, 0.435, and 0.887 for Tb, Vc, Tc, and Pc, respectively.86 More recently, Cao et al. used 231-dimensional GC-based-fragmentation inputs to a GP as well as other ML models. They obtained testing set R2 of 0.882, 0.788, 0.749, and 0.621 for Tb, Tc, Pc, and ΔHvap,298K, respectively.85 In this work, we obtained testing set R2 of 0.85, 0.97, 0.94, 0.92, and 0.93 for Tb, Vc, Tc, Pc, and ΔHvap,Tb, respectively. This suggests that the GCGP method has superior overall predictive performance compared to the far more complicated GC-fragment inputs to GP methods in the literature. We, however, caution against overinterpreting this comparison. A direct benchmarking of these methods using the same data is recommended as future work.
Overall, the GCGP method offers a novel approach for accurately and efficiently predicting thermophysical properties, is applicable to a wide range of properties, and utilizes a significantly lower number of input features compared to most of the other predictive ML-based models in the literature. Section 3.3 provides more discussion of the ΔHvap and Pc results.
![]() | ||
| Fig. 5 Error vs. thermophysical property for (a) JR GC and (b) GCGP models. Cornflower blue (red) represents MAE (RMSE). Solid colors (stripes) represent the training (testing) data. | ||
Fig. 5 shows the GCGP models are more accurate than the JR GC predictions for all of the properties. The only exception appears to be that the test error metrics in the JR GC model (Fig. 5(a)) are marginally less than those of the GCGP model for ΔHvap (Fig. 5(b)) as also observed in Fig. 4. This is due to the nuanced bias in JR GC ΔHvap predictions, which is discussed in more detail in section 3.3.
Fig. 5(b) shows that for all models, there is more error in the test set than in the training set. This trend is reasonable, as one would expect to see slightly more error in out-of-sample predictions. The only exception is the RMSE value for Pc. The training set RMSE for GCGP Pc prediction is marginally higher than the testing set value. We consider this to be an artifact of the train/test split. Furthermore, the RMSE values of the training and testing sets for Pc are almost identical.
The stratified sampling method used in this work is robust to the choice of random seeds in train/test splits; however, for Tm and ΔHvap, different random seeds give slightly different train/test splits. Table S7 shows that for most random seed choices, the training performance metrics are better than the testing performance metrics, and the training set errors are generally lower than those of the testing set, as expected. Furthermore, Table S7 shows that for all random seed choices, the model performance metrics for the testing set are not widely different, indicating that the GCGP method is robust to the choice of train/test splits.
![]() | ||
| Fig. 6 Comparison of GCGP and JR GC ΔHvap predictions for five highly fluorinated molecules not in the original training or testing data sets. Error bars visualize 95% prediction intervals. Experimental (out-of-sample) data were taken from Yaws' Critical Property Data for Chemical Engineers and Chemists.18 | ||
There were only two highly fluorinated molecules and one highly nitrated molecule in the collected experimental data for ΔHvap. The highly fluorinated molecules had the lowest JR GC ΔHvap predictions, and the highly nitrated molecule had the highest JR GC ΔHvap predictions in the data set (see Fig. 2). Our use of stratified sampling based on the input features ensured that the data for these three molecules were placed in the training set. To demonstrate that the GCGP method indeed learned and was able to correct for the unique chemical constituent-based systematic bias for ΔHvap, we obtained additional experimental data for five highly fluorinated molecules from Yaws' Critical Property Data for Chemical Engineers and Chemists as available in the Knovel database.18 We obtained JR GC ΔHvap predictions for these molecules. We then applied the GCGP method (using the GCGP ΔHvap model previously trained using the original training data) to also predict ΔHvap for these out-of-sample molecules with GCGP predicted uncertainties (see Fig. 6).
Fig. 6 shows how well the GCGP method corrects systematic bias and significantly improves the accuracy of ΔHvap predictions for highly fluorinated molecules. None of the five molecules in Fig. 6 were present in the original ΔHvap data set (both training and testing) used in this work. No highly fluorinated molecules were in the testing set in the original data set, as the two highly fluorinated molecules in the original data set were placed in the training set by the stratified sampling method. Interestingly, the GP leveraged sparse training data from the region of the input feature space corresponding to highly fluorinated molecules and was able to correct the systematic bias in JR GC ΔHvap predictions with high accuracy. This further underscores the power of the GCGP method. Similar results can be expected for ΔHvap predictions for highly nitrated compounds and for Pc predictions for highly brominated compounds.
This result is notable when considering that, unlike most ML methods in the literature, which utilize input features that encode the chemical identity of molecules in detail, our approach does not explicitly provide the chemical identity of molecules to the GPs. Our GPs are not explicitly informed about the presence or absence of certain chemical moieties, yet they perform well in correcting systematic bias that arises from the presence and quantity of these chemical moieties in molecules.
Fig. 7 shows the percentage of GCGP 95% prediction intervals that overlap with the experimental values for both the training and testing sets. We observe that for the training sets for all properties, the percentage of data points whose 95% prediction intervals overlap with the experimental data points is greater than 95%, with Pc and Tm being the only exceptions with 93.42% and 94.35%, respectively.
![]() | ||
| Fig. 7 Percentage of GCGP predictions that match with experimental data within predicted 95% confidence interval. | ||
A more interesting analysis is how well the prediction intervals overlap with the experimental values for ‘unseen’ data (testing set). Remarkably, for all six properties, the percentage of the testing set predictions with 95% prediction intervals overlapping with the experimental values is greater than 90% and greater than 94% for four of the six properties modeled.
GP predicted uncertainties for ΔHvap for highly fluorinated molecules (out-of-sample data) are shown in Table S5 with 95% prediction intervals visualized as error bars in Fig. 6. These predicted uncertainties are higher than the average uncertainties in the training and testing set predictions for the original ΔHvap data. The high uncertainties are expected due to the sparsity of data in the input feature space corresponding to highly fluorinated molecules in the training dataset.
Therefore, the 95% prediction intervals from the GCGP method are reliable for unseen or new molecules and have a greater than 90% empirical likelihood of representing the range of the true values even in the absence of experimental data. This is particularly important when screening new molecules for a range of applications using the GCGP method.
In assessing the sensitivity of the GCGP method to kernel design and model structure, we will focus our discussion on the LML defined in eqn (14). Fig. 8(a)–(f) show the LML for each thermophysical property investigated. Each of the four models studied in this work makes different assumptions about the relationship between the GP output (predictions of yexp) and its features (MW and/or yGC). Table 2 shows the mean and kernel function used for each GP model such that yexp ∼ GP =
(μ, K(X)) where X represents the model-specific feature(s).
![]() | ||
| Fig. 8 LML (eqn (14)) vs. model architecture 1–4 (eqn (S1)–(S4)). I = isotropic kernel, A = anisotropic kernel. (a) Heat of vaporization, ΔHvap, (b) critical pressure, Pc, (c) boiling temperature, Tb, (d) critical temperature, Tc, (e) melting temperature, Tm, and (f) critical volume, Vc. | ||
| Model | Mean function (μ) | Kernel function (K(X)) |
|---|---|---|
| 1 | 0 | K(yGC, MW) |
| 2 | yGC | K(MW) |
| 3 | yGC | K(yGC, MW) |
| 4 | β0MW + β1yGC + β2 | K(yGC, MW) |
For each of the four model architectures investigated, five isotropic parameterizations of different kernel functions were assessed. The LML of the anisotropic RQ kernel is also shown to allow comparison between the anisotropic and isotropic kernels for the six thermophysical properties studied.
We note that the formulation of the LML does not explicitly and fully account for model complexity that may arise due to differences in the number of parameters in the mean function, especially for low-data scenarios, as we have in this work.
We have applied information from computed LML values, keeping in mind the limitation highlighted above. Uncertainties in computed LML values may arise from randomness in train/test splits, randomness in kernel hyperparameter initialization during retrainings, uncertainties in the optimized hyperparameters, and other factors. In the following discussions, LML values within a 1% difference or an absolute LML difference of 1.0 from each other (whichever is greater) are considered similar. More details are provided in the SI subsection S2.4.1.
For the RQ kernel, we find that the LML values for anisotropic kernels are similar to those for isotropic kernels for all properties except ΔHvap as shown in Fig. 8. Similar results are observed for all other anisotropic kernels, compared to their isotropic counterparts, regardless of the kernel functional form. This shows that the GCGP method is robust to ARD application. Isotropic and anisotropic kernels provide similar performance with the GCGP method. Based on these results, we chose to implement the final model using isotropic kernels for all properties.
Also, Fig. 8 shows that model 2 performs the worst for all thermophysical properties. This result is as expected, as model 2 is not complex enough to be informative. Furthermore, model 2 is the only model that utilizes a single descriptor (MW). Thus, MW alone is not a good enough descriptor to model GC discrepancy. Taken as a whole, these results justify our decision to include both molecular weight and GC prediction as descriptors.
Model 3 was found to give slightly better LML values compared to model 1 overall. Model 3 has a more physically meaningful and intuitive mean function with no additional trainable parameters compared to model 1. Model 1, however, performed slightly better than model 3 for properties that had very poor GC predictions, such as Tm. This is expected since the use of the JR GC predictions as the mean function is less valid when the GC predictions are poor.
Model 4, with almost double the number of trainable parameters, with three additional parameters compared to other models, had similar LML values compared to model 3 for Tm and Vc. Model 4 had slightly better LML values compared to model 3 for other properties. Considering the significantly higher number of additional trainable parameters in model 4, while offering only a slight improvement in LML values compared to model 3 in general, we chose model 3 for the final model implementations. We, however, note that models 1, 3, and 4 all offer good and reliable predictive performance with the GCGP method.
Finally, we find that given the selection of model 3 and isotropic kernels for final model implementation, the RQ kernel with an additional trainable parameter known as the shape parameter α, has more flexibility to model the range of properties studied in this work, regardless of the smoothness (or roughness) of the surface to be learned. Further discussion is provided in the SI subsection S2.4.2 and the kernel choice rankings in the Table S6.
Regardless of kernel choice, ARD application, or model structure (with the exception of model 2), the GCGP method generally gives good and comparable predictive performance. Therefore, the GCGP method is robust to kernel choice and design and also robust to model structure, with the exception of overly simplistic modeling choices like model 2.
Notably, Tm has the largest training dataset and, consequently, a likely more heterogeneous dataset compared to other properties modeled in this work. We first examine if the limitation in the predictive performance of the GCGP method for Tm is a result of the relatively much larger and (likely more heterogeneous) dataset for Tm. We perform an analysis in which we implement the GCGP method for molecules found in both the Tm and Vc (which has the highest predictive accuracy) datasets. We also repeat the analysis for molecules found in both the Tm and ΔHvap (which has the smallest data size) datasets. The results are presented in Table S4 of SI section S2.2. The results show that the limitation in the predictive accuracy for Tm persists even for smaller datasets, with the intersectory dataset of Tm and ΔHvap showing worse predictive performance, while the intersectory dataset of Tm and Vc show similar predictive performance compared to the model that used all of the available Tm training data. Note that even the smallest dataset (ΔHvap) modeled in this work is significantly diverse, containing molecules across several tens of families of organic compounds.
As discussed in subsection 3.1, Tm is a challenging property to model using ML due to the hard-to-decode relationships between molecular structure and Tm, as several very structurally different molecules can have similar Tm.
To explore ways to improve the GCGP predictive performance for Tm, alternative or additional physics-informed descriptors were considered. Specifically, the enthalpy of fusion ΔHfus is related to Tm through the entropy of fusion ΔSfus. It has been reported in the literature that for many organic molecules, the relationship between ΔHfus and Tm is linear. This relationship is known as Walden's rule.123,124 We obtained GC predicted ΔHfus data from the JR GC model for all molecules in the original Tm datasets for which the JR GC parameters for ΔHfus were available. This resulted in a subset of 5563 data points. A single train/test split was performed on this new data subset using the approach already described in subsection 2.4 based on Tm and MW only. This fixed train/test split was used for all subsequent analyses performed. Fig. S9 shows data visualization of ΔHfus with experimental Tm. Fig. S9 shows that MW normalized GC ΔHfus offers a clearer trend with experimental Tm compared to ΔHfus alone.
Table 3 presents the results for several implementations of the GCGP method for Tm with ΔHfus either as a replacement for MW or as an additional input. Using ΔHfus/MW as an additional feature (see (e) in Table 3) resulted in a notable improvement in both the training (as shown by the LML values) and predictive performance metrics compared to the case of using only MW and GC as input features (see (a) in Table 3).
| Input set | LML | Train | Test | ||||
|---|---|---|---|---|---|---|---|
| R2 | MAPE/% | MAE/K | R2 | MAPE/% | MAE/K | ||
| a | −3517 | 0.732 | 12.66 | 39.26 | 0.736 | 12.62 | 39.31 |
| b | −3469 | 0.729 | 12.68 | 39.30 | 0.735 | 12.89 | 39.89 |
| c | −3678 | 0.717 | 12.93 | 40.38 | 0.708 | 13.19 | 40.93 |
| d | −3425 | 0.762 | 11.99 | 36.98 | 0.755 | 12.30 | 38.07 |
| e | −3198 | 0.952 | 5.23 | 16.27 | 0.774 | 11.62 | 35.96 |
This result is interpretable, considering the opposing effects MW has on Tm. Increasing MW generally increases Tm due to increased enthalpic interactions, but up to a certain threshold. At significantly higher MW, entropic limitations due to less efficient molecular packing as the molecules get bigger become significant, resulting in a counteracting effect on Tm. The input set (e) has both MW and an inverse of MW multiplied by ΔHfus. This possibly enables the GP to better capture this competing enthalpic–entropic effect of MW on Tm compared to the other alternatives ((a)–(d)) in Table 3.
The improved performance of the GCGP method for Tm using input set (e) is better than most other, more complicated methods in the literature.101,121,122 The improved performance is comparable to that from the work of Cao et al.,81 which used a 424-dimensional higher-order GC-based fragmentation input to a GP coupled with a warping function. They obtained a testing set R2 of 0.779, which is similar to the testing set R2 of 0.774 obtained in this work with the GCGP method using only three easy-to-compute, physics-informed, and interpretable input features.
This result demonstrates that the GCGP method is tunable for improved predictive performance by using additional physics-informed descriptors from simple first-order GC models. Other descriptors that can be used to enhance the GCGP predictive performance for Tm include Tc, Vc, Pc, and Tb, among others. The addition of these additional descriptors may significantly enhance the predictive accuracy of the GCGP method for Tm, potentially yielding one of the most accurate methods in the literature for predicting this challenging property across various diverse classes of organic molecules, while maintaining simplicity, efficiency, and improved interpretability.
![]() | ||
| Fig. 9 Comparison of (a) training (left) and (b) prediction (right) timings for all properties. The labels Tm,2 and Tm,3 represent timing tests for the original Tm model with two input features (Fig. 4(f)) and the enhanced Tm model with three input features (input set e in Table 3). The training set sizes are provided in parentheses in (a). Batch size in (b) is the number of molecules for which predictions are made after model training. | ||
All timing tests were performed in single-threaded mode on systems running the Red Hat Enterprise Linux operating system (version 9.6). The systems are equipped with the AMD EPYC 7532 CPU (32 cores/64 threads, base frequency 2.40 GHz, boost up to 3.30 GHz) with 250 GB of total memory. For each test, a total of ten replicates on the same machine were obtained. Fig. 9 shows the average computational times along with one standard deviation obtained from the ten replicates.
The computational time required for the full deployment of the GCGP method on new data will, of course, include the time required to obtain the GC inputs for use in the GP models. Shi and Borchardt reported that the JRgui software takes approximately 9 minutes to process 4450 SMILES strings on Windows 10 with Intel Core i7-4790 CPU, 3.6 GHz, and 16.0 GB memory.103 This includes the time for computing and exporting 187 additional descriptors from RDKit, which are not used in the GCGP method. Developing software tailored specifically to the GCGP method could significantly reduce the time needed to obtain GC inputs. This provides an opportunity for future work.
Fig. 9 shows that the time required for training the GP model, as well as making predictions on new data, is predominantly dependent on the size of the training data. This is an expected result for GPs. Fig. 9 shows that the training time required increases for the Tm model that uses an additional input feature compared to that which uses only two input features. However, the time required to make predictions on new data does not change significantly with an increase in the number of input features.
Fig. 9 shows that the GCGP method is reasonably fast for making predictions on new data, even when the GP was trained on several thousand data points, as is the case for Tm. The GCGP method requires only a few seconds to make predictions for several thousand molecules for the cases of Tm and Tb, while being more than an order of magnitude faster for other properties, as modeled in this work, with smaller training data. Thus, the GCGP method offers a route for fast and reliable property predictions for high-throughput screening of chemical systems for materials discovery.
We conclude by highlighting some limitations and opportunities for future research related to the development and application of the proposed GCGP method.
First, the accuracy of the GCGP method may be limited by the accuracy of the input GC method, as is the case for Tm in this work. This provides an opportunity for tunability for improved predictive accuracy for properties that are difficult to predict, such as Tm. As demonstrated in subsection 3.6, additional physics-informed descriptors that can be obtained from simple GC-models, tailored to the target property, can be used to tune and improve the predictive accuracy of the GCGP method for a given target property. Furthermore, another simple way to improve the prediction of Tm, for example, using the GCGP approach, is to switch to a more accurate but still simple GC method for predicting GC Tm. In fact, such a GC method already exists.125 Alternatively, we can use the same structural unit definition as the JR GC method, but design and parameterize a more accurate GC model functional form to provide a more accurate input for the GCGP method.
A second limitation is that the GCGP method requires existing GC models for the molecules of interest. One way to overcome this limitation for molecules that cannot have their properties predicted due to the limitations of unavailable parameters in a given GC method is to have their properties predicted by switching the GC method to another one that is able to predict their properties. This may entail developing a multi-GCGP method that is capable of receiving GC prediction inputs from multiple GC methods to help mitigate the limitation of an individual GC method's inability to cover all of chemical space. For this to work successfully, the identity of the GC method providing prediction input for a given molecule has to be encoded and provided as an additional input feature to the GP. A simpler but less elegant solution may be to build multiple separate GCGP models for the same property, each covering some area of chemical space that other GC methods may not cover.
A third limitation of the GCGP method is that its ability to reliably predict the properties of isomers is limited by the underlying GC method's capacity to distinguish between isomers. Higher order GC methods have been developed to help mitigate some of the challenges with property prediction involving isomers using GC methods.46,47 An interesting opportunity will be to incorporate low-dimensional topological indices such as the Weiner index,126 the Zagreb indices,127 and Randic index128 as additional inputs to the GP. This will have a drawback of higher input feature space dimensions, but can potentially greatly improve the differentiability of isomers for property prediction using the GCGP method.
There have been works in the literature where higher-order GC-based fragmentations have been used as direct inputs to GPs.81,85 It would be interesting to see how the GCGP method performs when used in an implementation involving the direct input of first-order JR GC fragments to the GPs for property prediction. To implement such a model, JRgui103 or similar software would need to be modified to allow outputs of the JR GC fragments in addition to computed properties. This provides additional opportunity for future work.
Furthermore, another future opportunity is to extend the GCGP method to predict properties under varying conditions of temperature and possibly pressure. This may be achieved by adding temperature as an input feature to the GP and training against sufficient data to capture the temperature dependence of the target property.
Finally, a contribution that would be highly valuable is integrating the GCGP method with CAMD workflows. The improved predictive accuracies and easily accessible, reliable uncertainty estimates from the GCGP method could result in a significant improvement in the reliability and robustness of CAMD workflows for identifying optimal molecules and processes across various applications.58
All codes and final results are also available on the project's GitHub (https://github.com/MaginnGroup/GCGP/tree/master) repository.
Supplementary information is available. See DOI: https://doi.org/10.1039/d5me00126a.
| This journal is © The Royal Society of Chemistry 2025 |