Predicting second virial coefficients of organic and inorganic compounds using Gaussian process regression
We show that by using intuitive and accessible molecular features it is possible to predict the temperature-dependent second virial coefficient of organic and inorganic compounds with Gaussian process regression. In particular, we built a low dimensional representation of features based on intrinsic molecular properties, topology and physical properties relevant for the characterization of molecule-molecule interactions. The featurization was used to predict second virial coefficients in the interpolative regime with a relative error ≲1% and to extrapolate the prediction to temperatures outside of the training range for each compound in the dataset with a relative error of 2.1%. Additionally, the model's predictive abilities were extended to organic molecules unseen in the training process, yielding a prediction with a relative error of 2.7%. Test molecules must be well-represented in the training set by instances of their families, which are high in variety. The method shows a generally better performance when compared to several semi-empirical procedures employed in the prediction of the quantity. Therefore, apart from being robust, the present Gaussian process regression model is extensible to a variety of organic and inorganic compounds.