Low-cost prediction of molecular and transition state partition functions via machine learning

We have generated an open-source dataset of over 30 000 organic chemistry gas phase partition functions. With this data, a machine learning deep neural network estimator was trained to predict partition functions of unknown organic chemistry gas phase transition states. This estimator only relies on reactant and product geometries and partition functions. A second machine learning deep neural network was trained to predict partition functions of chemical species from their geometry. Our models accurately predict the logarithm of test set partition functions with a maximum mean absolute error of 2.7%. Thus, this approach provides a means to reduce the cost of computing reaction rate constants ab initio. The models were also used to compute transition state theory reaction rate constant prefactors and the results were in quantitative agreement with the corresponding ab initio calculations with an accuracy of 98.3% on the log scale.


Dataset
In Figure S1 we show the activation energies of the reactions in the dataset used to generate the partition function dataset. In Table S1 we show the partition function equations employed to compute ( ).
where " is the moment of inertia and is the symmetry number. The electronic partition function was approximated as equal to the electronic ground state degeneracy, $%,' .

Search for optimal input features
For the Qest model, we screened over a series of geometry based input features which included Autocorrelation (AC) [1], EncodedBonds (EB) [2], and Coulomb Matrix (CM) [3]

as well as Smooth
Overlap of Atomic Positions (SOAP) [4], and Many Body Tensor Representation (MBTR) [5]. To create these, the MolML [2] and DScribe [6] software packages were used. See section 2.2. Graph based featurization techniques were not considered and this is in plan for future work. For each featurization we tested feature and target standardization; we considered min-max scaling and gaussian normalization. The model hyperparameters used during this screening are shown in Table S2.
Overall, we searched over 45 possible combinations of input feature, feature standardization, and target standardization and found that EncodedBonds with feature min-max scaling and target normalization were optimal. The average performance of the different featurizers is shown in Figure  S2. We repeated the screening of standardization for QesTS using EncodedBonds, which yielded no standardization as optimal.

Vibrational
."/ = 9 -708 "9: Models using Encoded Bonds performed best on average. Also, the best performing combination used Encoded Bonds with feature min-max scaling and target normalization.

Featurization parameters
The parameters used to produce the input features are described in subsection 2.

and 2.2.2. Smooth
Overlap of Atomic Positions (SOAP) and Many-body Tensor Representation (MBTR) input features were produced with the DScribe software, while Autocorrelation, Encoded Bonds, and Coulomb Matrix were produced with the MolML software.

DScribe Software
The DScribe software package [6] was used to generate SOAP and MBTR input features. The parameters used are listed below.

SOAP -Smooth Overlap of Atomic Positions
The parameters used to produce SOAP features are listed in python dictionary format: } For a description of these parameters, we refer the reader to Ref [6].

MBTR -Many-body Tensor Representation
The parameters used to produce MBTR features are listed hereafter in python dictionary format: We refer the reader to Ref [6] for a description of these parameters.

MolML Software
MolML [2] was used to define the EncodedBonds, Coulomb Matrix and Autocorrelation features.
MolML determines parameters such as the element types present and the maximum number of atoms during featurization. Due to limited RAM, the entire dataset could not be loaded at once, so it was scanned for a set of 3 structures which contained both the full array of chemical elements in the dataset, as well as the largest molecular system. The featurizers were fit to this minimal set, and then used to transform the entire dataset in chunks. The minimum subset of systems is given below in xyz format and angstrom units:

Optimization of hyperparameters
After screening for input features and standardization (Section S2) we carried a hyperparameter optimization. The search space for hyperparameter optimization is shown in Table S3.
The Tree Parzen [7,8] sampler along with the median pruning algorithm was used to identify the optimal hyperparameters (see Table S4) with Optuna [9] using a 5 trial startup for both models. This corresponds to 1258 and 7685 completed trials for the Qest and QesTS model, respectively.
The number of epochs over which to train the model for the optimal hyperparameters was determined by using early stopping averaged over the 5 folds. From this we found that 39 and 27 epochs were optimal stopping points for the Qest and QesTS models respectively.

Model Performance
The distributions of error on the test sets for the Qest and QesTS models are shown below.

Outliers
To limit extrapolation in the final model, outliers were dropped from the overall dataset as following.
We identified reactions containing any example with 12 or more feature values ±6 standard deviations away from that feature's mean. This resulted in removing 1,108 reactions.