Open Access Article
Miruna T.
Cretu
ab and
Jesús
Pérez-Ríos
*b
aDepartment of Chemistry, Imperial College London, London SW7 2AZ, UK
bFritz-Haber-Institut der Max-Planck-Gesellschaft, Faradayweg 4-6, 14195 Berlin, Germany. E-mail: jperezri@fhi-berlin.mpg.de
First published on 11th January 2021
We show that by using intuitive and accessible molecular features it is possible to predict the temperature-dependent second virial coefficient of organic and inorganic compounds with Gaussian process regression. In particular, we built a low dimensional representation of features based on intrinsic molecular properties, topology and physical properties relevant for the characterization of molecule-molecule interactions. The featurization was used to predict second virial coefficients in the interpolative regime with a relative error ≲1% and to extrapolate the prediction to temperatures outside of the training range for each compound in the dataset with a relative error of 2.1%. Additionally, the model's predictive abilities were extended to organic molecules unseen in the training process, yielding a prediction with a relative error of 2.7%. Test molecules must be well-represented in the training set by instances of their families, which are high in variety. The method shows a generally better performance when compared to several semi-empirical procedures employed in the prediction of the quantity. Therefore, apart from being robust, the present Gaussian process regression model is extensible to a variety of organic and inorganic compounds.
![]() | (1) |
.
Two-body interactions are the most relevant interactions to the macroscopic properties of a gas,5 hence B2 values have been tabulated for many compounds.6 Since B2 can be derived from intermolecular potentials, the latter can be obtained from experimental B2 through a proper parametrisation of the potential function.7,8 This is conducive to the calculation of fluid properties such as enthalpy of vaporisation9 and transport coefficients.9–11 The knowledge of B2 also helps to estimate critical points12 and optimum conditions for crystal growth, which would otherwise require extensive screening experiments.13
When it comes to the determination of the second virial coefficient, computational cost and experimental obstacles often come into play. The theoretical approach to estimate B2 from the interaction potential was developed ever since the 1930s14,15 and is adapted nowadays to more complex potential functions. However, the process is computationally expensive for all but simple molecules. Furthermore, experimental procedures give accurate results for certain ranges of temperature, however they are faced with the challenge to acquire reliable compressibility data.16 In the case of empirical approaches, the law of corresponding states17 leads, in some cases, to very accurate results, whereas in other situations, the accuracy is low.
To provide an alternative to the traditional methods of calculating B2, we propose tackling the problem within the new paradigm of data-intensive science.19 The existence of a large and high quality database of temperature-dependent second virial coefficients6 fulfills the most vital prerequisite for the application of machine learning. The choice of input features for learning is then a matter of physical and computational intuition. Among notable previous works on B2 estimation is the one of Di Nicola et al.,20 which uses thermodynamic input features and artificial neural networks (ANN) to predict B2 with high accuracy. This method, however, requires the construction of a complex ANN, together with the knowledge of five thermodynamic properties, which are difficult to obtain, as discussed above. As the authors also suggest, this method should only be used when “high accuracy is required”,20 due to its complexity. Furthermore, the prediction of second virial coefficients has been addressed before by Mokshyna et al.,21 by assuming a functional form of the dependency of B2 on temperature. The methodology involved modelling molecular structures using the Simplex Representation of Molecular Structure (SiRMS),22 based on which a Random Forest model was trained to predict second virial coefficients.
In this paper, we propose the prediction of second virial coefficients of organic and inorganic compounds in a simple, universal manner. Our approach is based on Gaussian Process Regression (GPR) fed with a low dimensional input featurization scheme (see Fig. 1). To offer perspective to the reader, it is of importance to introduce the potential contexts in which the prediction of second virial coefficients is desired. That is, one might want to predict the quantity outside of an already studied temperature range, as well as predict second virial coefficients for new molecules. Our work addresses both aspects, but we conclude that for accurate results, the new molecule should belong to classes of compounds already studied in the training set.
![]() | ||
| Fig. 1 Schematic representation of the method designed for the prediction of B2(T) using Gaussian process regression. A chosen molecule is characterized by a set of input features obtained using RDKit.18 From the input data, the trained GPR model is used to predict B2(T) for the desired molecule. | ||
We show that the chosen features succeed to incorporate the most relevant characteristics of the second virial coefficient, in that the model succeeds to predict B2 of an unseen molecule with a relative error of 2.7%, over whole ranges of temperatures. Moreover, the power of our model is reinforced by the successful extrapolation (relative error 2.1%) to temperatures outside of the training range for any molecule in the dataset. Our method's universality stems from its applicability to compounds belonging to a wide range of families and from the availability and accessibility of input features for any compound. The simplicity stems from the facile practice to generate input features and from the ease of applying computationally inexpensive GPR (for the number of data points considered in this work). Different featurization combinations were tested to yield the best, lowest dimensional scheme finally. All the input data were generated using RDKit,18 an open-source toolkit for cheminformatics implemented in Python. While most of the features we used are basic molecular properties of compounds, the Morgan fingerprint is a representation of the connectivity of atoms in a molecule.23 This mostly caters for molecular characterisation and for identifying common fragments within different molecules.
The diversity of data is notable with regard to the physical and chemical properties of molecules. For instance, the inorganic compounds cover a broad spectrum of molecules and atoms starting from noble gas atoms to polyatomic molecules such as boranes. Whereas within the organic compounds, one finds ketones, which have important industrial applications,25 carbonyl compounds that appear as a natural product of pollution26 or siloxanes: an incredibly versatile class of molecules that has been proposed as a candidate for Bethe–Zeldovich–Thompson fluids,27 or that shows exciting properties as a surfactant.28
| f(x) ∼ GP(m(x),K(x,x′)). | (2) |
The models were developed and analysed using MATLAB's already implemented tools.
That being said, we devised what properties would be most relevant to describing the second virial coefficient, from a physical perspective. The features used in this work belong to three categories: physical properties that can describe molecular interactions (partial atomic charges and valence electrons), topology features that characterize similarity and complexity of compounds, deduced from cheminformatics (Morgan and E-state fingerprints) and intrinsic properties of molecules (molecular weight), to account for their different sizes. All of the input features are available and easy to compute, and in our case, they were generated using RDKit.18 A further explanation for the choice of physical and topological features is outlined below.
• The minimum and maximum partial charges of a molecule are correlated with the molecule's dipole moment. This is supported by the recent work of Veit et al., which implements a partial-charge model to predict dipole moments of molecules.31 The presence of a dipole moment in a molecule leads to a dipole–dipole interaction apart from the van der Waals interaction of non-polar molecules. The dipole moment of a molecule is proven to increase the attractive forces between molecules and therefore to lower B2 for a given temperature.32 This shows a direct relationship between B2 and the magnitudes of the minimum and maximum partial charges. The partial charges used in this work were computed using RDKit, which employs the procedure described by Gasteiger.33 This computes charge distribution in molecules based on the identities of individual atoms and their connectivities.
• Morgan fingerprints represent a well-known method for molecular characterization in terms of topology and connectivity within a molecule. In particular, a molecule is characterized by a fingerprint that contains 1024 bits, and each of these bits represents a fragment, i.e., a possible scenario of individual atoms and their environment (meaning all neighbouring atoms within a diameter of four chemical bonds) within the molecule. The “extended connectivity” of atoms is computed using Morgans extended connectivity algorithm.23 Therefore, the complexity of a molecule can be assessed by counting how many bits out of 1024 are needed to describe connectivities in a molecule, as well as element types, charges and atomic masses.23 Furthermore, Morgan fingerprints can be used to generate a similarity score to a reference molecule. This can be obtained through commands implemented in RDKit.18 In our study, the reference molecule was chosen to be the one with the highest number of nonzero bits in the Morgan fingerprint, i.e., the most complex molecule from this point of view: 2-ethylthiophene. In this way, a similarity score to the fingerprint of the reference molecule was attributed to each molecule in the database, as an input feature. Intuitively, this is a measure of comparison between environments, connectivities and chemical features within different molecules, which can further be relevant to comparing 2-body interactions for different molecules.
• The E-state fingerprint has also been used to characterize the molecules in the data set. This fingerprint is based on the electrotopological state indices of atoms within a molecule.23 These encode information related to the valence state, electronegativity of atoms and the molecule's topology. In particular, we translate the information for each molecule into a numerical descriptor through the ratio between the total summation of E-state indices for all atoms and the summation of the number of times each possible atom type appears in the molecule.
One of the most common error estimators is the mean absolute error (MAE), defined as
![]() | (3) |
In GPR, the predictions are being made after examining correlations between input features and observations in the training set. This is done without prior knowledge of the test set and, implicitly, no weighting on it, making the root mean squared error (RMSE), which is defined as follows:
![]() | (4) |
![]() | (5) |
![]() | (6) |
The results of using temperature, molecular weight, minimum and maximum partial atomic charges, and similarity of the Morgan fingerprint to that of a reference molecule, as input features, are shown in Fig. 3. The training was done on 5547 randomly selected data points, using 5-fold cross-validation, and 1386 predictions were made on the remaining “out-of-sample” points. It is easily noticed from the figure that most of the predicted values for the second virial coefficient agree with the true experimental values, translating into an excellent performance and predictive capability of the model at hand. This performance is characterized by an RMSE of 58 mL mol−1 and a normalized error of 0.8%, as it is shown in Table 1. For reference, a multiple linear regression (MLR) model with the same descriptors was implemented and registered an RMSE of ∼525 mL mol−1. This indicated the impact of non-linearity on the relationship between second virial coefficients and the proposed descriptors, and implicitly the requirement of a non-linear method such as GPR in this matter.
| Dimension | Features | Test RMSE (mL mol−1) | Test MAE (mL mol−1) | Test rE (%) |
|---|---|---|---|---|
| 4D | (T,MW,δmin,δmax) | 80 ± 6 | 27 ± 0.6 | 1.1 ± 0.1 |
| 5D | (T,MW,δminδmax, MFsimilarity) | 58 ± 2 | 22 ± 0.6 | 0.8 ± 0.03 |
| 5D | (T,MW,δmin,δmax, MFnonzeros) | 68 ± 7 | 23 ± 0.5 | 0.9 ± 0.1 |
| 5D | (T,MW,δmin,δmax, E-state) | 67 ± 4 | 24 ± 0.4 | 0.9 ± 0.1 |
| 5D | (T,MW,VE, MFnonzeros, MFsimilarity) | 155 ± 10 | 56 ± 3 | 2.1 ± 0.1 |
| 6D | (T,MW,VE,δmin,δmax, MFnonzeros) | 60 ± 2 | 22 ± 0.4 | 0.8 ± 0.03 |
| 6D | (T,MW,VE,δmin,δmax, MFsimilarity) | 57 ± 3 | 22 ± 0.4 | 0.8 ± 0.04 |
To further analyze the performance of our GPR model we have calculated its learning curve, i.e., the model performance as a function of the number of points in the training set while keeping the number of data points in the test set constant, which is shown in the inset of Fig. 3. As a result, it is observed that the model's learning capabilities are converged around 4000 data points of the training set. Therefore, the interpolation performance of our model cannot benefit from having a larger number of training points for the families of compounds already present in the training set.
The combination and the number of input features for our model were selected after the implementation of different featurization schemes and the comparison of their performances. The results of this procedure are shown in Table 1. Here, it is noticed that when the number of valence electrons is used as a predictor instead of the partial atomic charges, a much poorer performance is obtained, at the same dimensionality (5D). This is suggestive of the importance of minimum and maximum partial atomic charges as predictors in our model, presumably succeeding to account for the strength of interactions between molecules, more than just for their internal electronic structure. In addition, we notice that although the E-state fingerprint contains additional information concerning the valence state of atoms, it does not show an improved performance to that of the Morgan fingerprint in a 5-dimensional representation. Indeed, this correlates with our previous statement about the major role of partial charges in comparison with the number of valence electrons regarding molecular interactions.
To get a measure of the importance of individual predictors relative to each other, the automatic relevance determination (ARD)34 rational quadratic kernel function was used in GPR:
![]() | (7) |
![]() | ||
| Fig. 4 Ranking of predictors based on the characteristic length scale of each predictor, obtained with an ARD rational quadratic kernel. The symbols used in the figure were defined in the caption of Table 1. The errors associated with each weight are the result of performing 5 iterations. | ||
While Gaussian process regression serves well as an approach to smooth interpolation,29 it usually performs worse in extrapolation tasks. Our model illustrates well the former statement, exhibiting excellent interpolation in the previous subsection. In addition to this, the representation of input features, as well as the chosen kernel function prove to lead to reasonable extrapolation capability, with only a few molecules experiencing the limitations of the model. The rational quadratic kernel differs from the squared exponential one (a popular choice of kernel function for GPR) in that it contains an additional parameter (α) which determines the relative weighting between large and small-scale variations.29 This eventually leads to a better generalization of the long term trend, compared to a squared exponential kernel, and therefore better extrapolation. The fact that the same model achieves both interpolation and extrapolation translates into considerate choice of both input features and kernel function.
Nonetheless, the model succeeds to predict second virial coefficients for organic molecules with an RMSE of 195 mL mol−1 (see Fig. 6), which is indicative of the greater capability of the input features to describe organic compounds, rather than inorganic ones. This is naturally expected, as Morgan fingerprints incorporate valuable information for organic molecules concerning topology, element types and atomic charges. Fig. 7 reveals some examples of predicted, as well as true values for B2 plotted against temperature. Instances from three different organic families are presented, for which the predicted curve is smooth. At the same time, it can be seen that the inorganic molecule phosphine performs poorly compared to the other instances.
The model was trained using 5-fold cross-validation, having a training RMSE of 62 mL mol−1 and a test RMSE of 195 mL mol−1 (relative error 2.7%). Generally, the model predicts with great accuracy second virial coefficients for molecules well-represented in the training set, such as hydrocarbons (see Fig. 7a). However, the model is not transferable to molecules which have no resemblance to the training set, being limited, in this case, to “out-of-sample” compounds that interpolate. Nonetheless, the wide range of families present in our database offer an optimistic view towards the applicability of our model, as various commonly encountered classes of organic compounds are encompassed.
Finally, the performance of our model was analysed in contrast to the performances of established methods for the calculation of B2 for polar substances through empirical equations and through the corresponding states principle.37Table 2 contains deviations from experimental values of second virial coefficients calculated through four different methods. The data reinforces the points made previously about our model. That is, the model is characterised by a poor performance in predicting second virial coefficients for inorganic compounds, but proves a generally better performance relative to other methods in the prediction for organic compounds.
, where yi and yi* are experimental and calculated values of B2, respectively. Abbreviations for B2 calculation methods are as follows: PC: Pitzer and Curl (1957),35 TD: Tarakad and Danner (1977),36 ECS: Xiang (2002),37 GPR: Gaussian process regression method described in this work
| Compound | PC RMS | TD RMS | ECS RMS | GPR RMS |
|---|---|---|---|---|
| 1,3-Dimethylbenzene | 8 | 8 | 6 | 4 |
| Phenol | 10 | 6 | 6 | 2 |
| 1,2-Dichloroethane | 10 | 9 | 9 | 11 |
| 1-Propanol | 13 | 13 | 16 | 12 |
| Bromoethane | 9 | 17 | 9 | 13 |
| Water | 10 | 34 | 20 | 26 |
| 1-Butanol | 11 | 10 | 23 | 4 |
| 2-Pentanone | 21 | 7 | 6 | 7 |
| This journal is © the Owner Societies 2021 |