Predicting second virial coefficients of inorganic and organic compounds using Gaussian Process Regression

We show that by using intuitive and accessible molecular features it is possible to predict the temperature-dependent second virial coefficient of organic and inorganic compounds using Gaussian process regression. In particular, we find that a low dimensional representation of features based on intrinsic molecular properties, topology and physical properties relevant for the characterization of molecule-molecule interactions, succeeds to predict the second virial coefficient of any molecule with a relative error $\lesssim 1\% $.


I. INTRODUCTION
The long-standing goal to establish a relationship between the behaviour of a gas and its microscopic properties has admirably been achieved by the virial equation of state 1,2 . Besides offering a rigorous depiction of the pressure, p(T, ρ) over a wide range of temperature 3 , T , and density, ρ, the virial equation is founded on a solid statistical mechanics framework 4 . The virial equation, encapsulates the departure from ideality of a gas in an infinite series of temperature-dependent coefficients, B i (T ), which correspond to the molecular interaction in isolated clusters of size i. Therefore, B i (T ) is the i-th virial coefficient and is related to the role of i-body interactions in a system. In Eq. (1) R denotes the ideal gas constant and the series is truncated up to a certain cluster size N . Two-body interactions are the most relevant to the macroscopic properties of a gas 5 , hence B 2 values have been tabulated for many gases 6 . Since B 2 can be derived from intermolecular potentials, the latter can be obtained from experimental B 2 through a proper parametrisation of the potential function 7,8 . This is conducive to the calculation of fluid properties such as enthalpy of vaporisation 9 and transport coefficients [9][10][11] . The knowledge of B 2 also helps to estimate critical points 12 and optimum conditions for crystal growth, which would otherwise require extensive screening experiments 13 .
When it comes to the determination of the second virial coefficient, computational cost and experimental obstacles often come into play. The theoretical approach to estimate B 2 from the interaction potential was developed ever since the 1930s 14,15 and is adapted nowadays to more complex potential functions. However, the process is computationally expensive for all but simple molecules. Furthermore, experimental procedures give accurate results for certain ranges of temperature, a) Electronic mail: jperezri@fhi-berlin.mpg.de however they are faced with the challenge to acquire reliable compressibility data 16 . In the case of empirical approaches, the law of corresponding states 17 leads, in some cases, to very accurate results, whereas in other situations, the accuracy is low. To provide an alternative to the traditional methods of calculating B 2 , we propose tackling the problem within the new paradigm of data-intensive science 19 . The existence of a large and high quality database of temperature-dependent second virial coefficients 6 fulfills the most vital prerequisite for the application of machine learning. The choice of input features for learning is then a matter of physical and computational intuition. Among notable previous works on B 2 estimation is the one of Di Nicola et al. 20 , which uses thermodynamic input features and artificial neural networks (ANN) to predict B 2 with high accuracy. This method, however, requires the construction of a complex ANN, together with the knowledge of five thermodynamic properties, which are difficult to obtain, as discussed above. As the authors also suggest, this method should only be used when "high accuracy is required" 20 , due to its complexity.
In this paper, we propose the prediction of second virial coefficients of organic and inorganic compounds in a simple, universal manner. Our approach is based on Gaussian Process Regression (GPR) fed with a low dimensional input featurization scheme (see Fig. 1), yielding a prediction of the second virial coefficient with a relative error 1%. Our method's universality stems from its applicability to compounds belonging to a wide range of families and from the FIG. 2. Experimental values of second virial coefficients from the filtered database as a function of temperature. The two bar charts, in the inset, show classifications of inorganic and organic compounds in the dataset, respectively, as well as the number of data points for each class of compounds. The associated box plots for temperature and second virial coefficients values are also shown, with a 1.5 maximum whisker length.
availability and accessibility of input features for any compound. The simplicity stems from the facile practice to generate input features and from the ease of applying computationally inexpensive GPR (for the number of data points considered in this work). Different featurization combinations were tested to yield the best, lowest dimensional scheme finally. All the input data were generated using RDKit 18 , an open-source toolkit for cheminformatics implemented in Python. While most of the features we used are basic molecular properties of compounds, the Morgan fingerprint is a representation of the connectivity of atoms in a molecule 21 . This mostly caters for molecular characterisation and for identifying common fragments within different molecules.

II. THE DATASET
A comprehensive database of second virial coefficients for pure organic and inorganic substances is made available through the compilations of Dymond et al. and Gmehling et al. 22 , totalling over 9300 values for a temperature range from 0.63 to 1473.15 K. Subsequent to filtering, our dataset comprises 1720 inorganic and 5213 organic compounds, divided in diverse types of classes (see Fig. 2). While for some compounds, experimental values of B 2 (T ) were reported for more than 200 temperatures, for other substances there existed only one data point in the set. When different B 2 values were registered for the same compound at the same temperature, an average of the B 2 values was taken. Further filtering of the data was performed by leaving out compounds with less than 3 data points and by eliminating the values which were off the temperature-dependent trend and were therefore of unphysical meaning.
The diversity of data is notable with regard to the physical and chemical properties of molecules. For instance, the inorganic compounds cover a broad spectrum of molecules and atoms starting from noble gas atoms to polyatomic molecules such as boranes. Whereas within the organic compounds, one finds ketones, which have important industrial applications 23 , carbonyl compounds that appear as a natural product of pollution 24 or siloxanes: an incredibly versatile class of molecules that has been proposed as a candidate for Bethe-Zel'dovich-Thompson fluids 25 , or that shows exciting properties as a surfactant 26 .

III. GAUSSIAN PROCESSES
In the context of solving non-linear regression problems, Gaussian process regression (GPR) can be viewed as a nonparametric approach. In other words, GPR does not assume any functional form to find the fitting to a given data set. Rather, GPR employs a Gaussian distribution of functions to match the observed variables. Next, Bayesian inference, i.e., the estimation of the probability of an event given the occurrence of a previous one, allows a prior distribution of data to develop into a posterior one. In the case of GPR, the prior distribution, p( f |x) is a multivariate Gaussian distribution, usu-ally with a zero mean function m(x) and with a covariance matrix defined by a kernel designated by the user, K(x, x ), which stores information about the correlation between the input points 27 . The distribution of functions is therefore defined as: The posterior distribution, p( f |x, y), which is also normal multivariate, is obtained by conditioning the joint Gaussian prior distribution on the observations (y). This allows to make predictions (y * ) for new, unobserved data.

IV. FEATURIZATION METHODS
The choice of input features for our model was primarily guided by chemical and physical intuition, as well as by domain expertise. The features belong to three categories: physical properties that can describe molecular interactions (partial atomic charges and valence electrons), topology features that characterize similarity and complexity of compounds, deduced from cheminformatics (Morgan and E-state fingerprints) and intrinsic properties of molecules (molecular weight), to account for their different sizes. All of the input features are available and easy to compute, and in our case, they were generated using RDKit 18 . A further explanation for the choice of physical and topology features is outlined below.
• The minimum and maximum partial charges of a molecule are correlated with the molecule's dipole moment, which in turn has an influence on the interaction potential. The presence of a dipole moment in a molecule leads to a dipole-dipole interaction instead of the van der Waals interaction of non-polar molecules.
The dipole moment of a molecule is proven to increase the attractive forces between molecules and therefore to lower B 2 for a given temperature 28 . This shows a direct relationship between B 2 and the magnitudes of the minimum and maximum partial charges.
• Morgan fingerprints represent a well-known method for molecular characterization in terms of topology and connectivity within a molecule. In particular, a molecule is characterized by a fingerprint that contains 1024 bits, and each of these bits represents a fragment, i.e, a possible scenario of individual atoms and their environment (meaning all neighbouring atoms within a diameter of four chemical bonds) within the molecule. The "extended connectivity" of atoms is computed using Morgan's extended connectivity algorithm 21 . Therefore, the complexity of a molecule can be assessed by counting how many bits out of 1024 are needed to describe connectivities in a molecule, as well as element types, charges and atomic masses 21 . Furthermore, Morgan fingerprints can be used to generate a similarity score to a reference molecule. This can be obtained through commands implemented in RDKit 18 . In our study, the reference molecule was chosen to be the one with the highest number of nonzero bits in the Morgan fingerprint, i.e. the most complex molecule from this point of view: 2-Ethylthiophene. In this way, a similarity score to the fingerprint of the reference molecule is attributed to each molecule in the database, as an input feature.
• The E-state fingerprint has also been used to characterize the molecules in the data set. This fingerprint is based on the electrotopological state indices of atoms within a molecule 21 . These encode information related to the valence state, electronegativity of atoms and the molecule's topology. In particular, we translate the information for each molecule into a numerical descriptor through the ratio between the total summation of E-state indices for all atoms and the summation of the number of times each possible atom type appears in the molecule.
A. Model performance evaluation GPR, as a general fitting approach, needs a method to characterize its performance. In other words, an error estimation is needed for the proper evaluation of GPR models and the posterior identification of outliers of the model.
One of the most common error estimators is the mean absolute error (MAE), defined as where N is the total number of values in the data set, y i are the true values of second virial coefficients, and y * i are the predictions.
In GPR, the predictions are being made after examining correlations between input features and observations in the training set. This is done without prior knowledge of the test set and, implicitly, no weighting on it, making the root mean squared error (RMSE), which is defined as follows: a practical evaluation tool. The RMSE of predictions on the test data will be used along this work. However, when predicting physical or chemical quantities, it may be better to have a dimensionless error estimator. The normalized error (r E ), given as does not have units since it is defined as the ratio between the RMSE and the extension of the data. Therefore, the normalised error is an important error estimator regarding GPR, and it will be used throughout this work.

V. RESULTS
Second virial coefficients are learned at a given temperature through a GPR model and from molecular and cheminformatics-based properties of compounds. The results of using GPR based on a rational quadratic kernel (see appendix) and of using temperature, molecular weight, minimum and maximum partial atomic charges, and similarity of the Morgan fingerprint to that of a reference molecule, as input features, are shown in Fig. 3. It is easily noticed from the figure that most of the predicted values for the second virial coefficient agree with the true experimental values, translating into an excellent performance and predictive capability of the model at hand. This astonishing performance is characterized by an RMSE of 59.58 ± 1.43 mLmol −1 and a normalized error of 0.81 % as it is shown in Table I.
To further analyze the performance of our GPR model we have calculated its learning curve, i.e., the model performance as a function of the number of points in the training set while keeping the number of data points in the test set constant, which is shown in the inset of Fig. 3. As a result, it is observed that the model's learning capabilities are converged around 4000 data points of the training set. Therefore, the performance of our model cannot benefit from having a larger dataset.
The combination and the number of input features for our model are selected after the implementation of different fea-    Table I. Here, it is noticed that when the number of valence electrons is used as a predictor instead of the partial atomic charges, a much poorer performance is obtained, at the same dimensionality (5D). This is suggestive of the importance of minimum and maximum partial atomic charges as predictors in our model, presumably succeeding to account for the strength of interactions between molecules, more than just for their internal electronic structure. In addition, we notice that although the E-state fingerprint contains additional information concerning the valence state of atoms, it does not show an improved performance to that of the Morgan fingerprint in a 5-dimensional representation. Indeed, this correlates with our previous statement about the major role of partial charges in comparison with the number of valence electrons.

l O H s W L Q Y K E I V d u n G g S X 0 E C O A t q R A h r 4 A l r + 6 D r 3 W / e g N A / l H Y 4 j c A M 6 k L z P G c V M 8 s z D c r c H A q m X d B E e M A m 4 T N N y 0 T N L d s W e w F o k z o y U y A x 1 z / z q 9 k I W B y C R C a p 1 x 7 E j d B O q k D M B a b E b a 4 g o G 9 E B d D I q a Q D a T S b 3 p 9 Z J p v S s f q i y k m h N 1 N 8 T C Q 2 0 H g d + 1 h l Q H O p 5 L x f / 8 z o x 9 i / d h M s o R p B s u q g f C w t D K w / D 6 n E F D M U 4 I 5 Q p n t 1 q s S F V l G E W W R 6 C M / / y I m l W K 8 5 Z 5 f y 2 W q p d z e I o k C N y T E 6 J Q y 5 I j d y Q O m k Q R h 7 J M 3 k l b 8 a T 8 W K 8 G x / T 1 i V j N n N A / s D 4 / A H R a J X 5 < / l a t e x i t >
To get a measure of the importance of individual predictors relative to each other, the automatic relevance determination (ARD) rational quadratic kernel function (see appendix) is used in GPR. ARD allows the assignment of separate length scales for each predictor, instead of the same one for all of them. If an input's length scale is large, the distance one needs to move in the input space so that the function values become uncorrelated is also large, so that the covariance will become almost independent of that input. The predictor data is standardized to allow for consistency. In this way, a weight is assigned to each input feature, as shown in Fig. 4. The ranking is consistent with our previous evaluation of the featurization schemes' performances (see Table I): temperature is the most important, followed by partial atomic charges and/or molecular weight. Morgan fingerprint similarity is expected to TABLE I. Predictors ranking by the RMSE score from 5-fold cross validation (CV). The symbols used in the table are assigned as follows: T is the temperature, MW stands for the molecular weight, δ min , δ max are minimum and maximum partial atomic charges, respectively, MF nonzeros is the number of nonzero bits in the Morgan fingerprint, MF similarity is the similarity of the compound's fingerprint to that of the reference compound, E-state encodes information on the E-state fingerprint and V E is the number of valence electrons. The results are obtained using GPR on a total of 6933 training and test points, from 5 iterations.  perform better than the number of nonzero bits in the fingerprint. It is worth noticing that partial charges are better ranked than the number of valence electrons. Thus, confirming our intuition on the relevance of partial charges over number of valence electrons, which are supported by the results shown in Table I.
The applicability of a model can be evaluated by performing a residuals analysis. The best performing model was chosen for analysis, with a 5 dimensional featurization scheme and 5-fold cross validation GPR based on a rational quadratic kernel. For regression models, a normal distribution of residuals usually suggests that they are random and independent of the true B 2 values, meaning that the predictive information is well captured. While our residuals do not perfectly follow a normal distribution, the box plot in Figure 5 shows that the residuals are almost symmetrically distributed and that the data is very compact, gathering around the mean. This figure also shows the structures and corresponding temperatures of the main outliers of our model. It is noticed that the out-liers' temperatures and true B 2 values fall within the interquartile ranges (see Fig. 2), meaning that it is not the case that the model fails to extrapolate for extreme temperatures or B 2 values. Most of the outliers are organic compounds, with 6 of them having cyclic structures. It is interesting to notice this, since the database comprises more organic than inorganic compounds, so that one could have expected a more accurate prediction for organic compounds, rather than for inorganic. The inorganic molecules that are not well represented by our model are ammonia-D3, oxygen at temperatures within the interquartile range and helium at high temperatures (∼1300K), whose residuals cluster at low values of B 2 (see Fig. 5). Most of the outliers that can be seen in Figure 5 occur due to a tendency in overestimating B 2 for compounds at temperatures which are either a minimum or a maximum in the set of input data for a specific compound, when not many points are provided for it. This is a characteristic behaviour of fitting operations over a small number of points, that our GPR model appears to comply with.

VI. CONCLUSIONS
We have developed a method for estimating second virial coefficients using Gaussian process regression with a relative error 1%. This has been possible through the use of a low-dimensional representation of predictors based on accessible, intuitive, and reproducible molecular features, conveniently obtained through RDKit. There is no requirement to make a distinction between organic and inorganic or small and large molecules, as the input features succeed to describe well a generous number of types of compounds. The analysis of residuals generated by our model reinforces the applicability of the model to molecules belonging to a large variety of classes. When compared to traditional techniques used to calculate second virial coefficients, our method stands out in particular through its simplicity and through its efficiency, avoiding the difficulties posed by computational cost or by experimental obstacles. The input features are readily obtained through RDKit and the time required to train our best model is approximately 74 seconds on a 2 GHz Intel Quad-Core i5 machine. Finally, it is worth emphasising the important role the existence of a comprehensive and high quality database has played in this work.