MolE8: finding DFT potential energy surface minima values from force-field optimised organic molecules with new machine learning representations

The use of machine learning techniques in computational chemistry has gained significant momentum since large molecular databases are now readily available. Predictions of molecular properties using machine learning have advantages over the traditional quantum mechanics calculations because they can be cheaper computationally without losing the accuracy. We present a new extrapolatable and explainable molecular representation based on bonds, angles and dihedrals that can be used to train machine learning models. The trained models can accurately predict the electronic energy and the free energy of small organic molecules with atom types C, H N and O, with a mean absolute error of 1.2 kcal mol−1. The models can be extrapolated to larger organic molecules with an average error of less than 3.7 kcal mol−1 for 10 or fewer heavy atoms, which represent a chemical space two orders of magnitude larger. The rapid energy predictions of multiple molecules, up to 7 times faster than previous ML models of similar accuracy, has been achieved by sampling geometries around the potential energy surface minima. Therefore, the input geometries do not have to be located precisely on the minima and we show that accurate density functional theory energy predictions can be made from force-field optimised geometries with a mean absolute error 2.5 kcal mol−1.

During our initial investigation, we found the neural network models performed poorly on molecules containing trans-epoxide motifs. We optimised two structures shown Figure S2 to compare the cis and trans epoxide structures. The trans-epoxide is substantially less stable than the cis-epoxide by 82.9 kcal mol -1 . Therefore, we felt it is reasonable to remove the trans-epoxide structures from the database.
The trans-epoxide motif is characterised by large angle around the carbon atom. All the angles around the two carbon atoms forming the epoxide motif in the two structures above are shown in Table S1. The angle in red is significantly larger in the trans motif compared to the cis motif. We reasoned more highly strained molecules could be identified by looking at the angle around the carbon atom. The critical large angle was chosen by investigating different connectivity carbon motifs, Figure S3. For 3 connectivity carbon atoms, we remove the molecule if [carbon angle -120°] is larger than 35° or [360° -sum of three carbon angles] is larger than 20°. For four connectivity carbon atoms, we found carbon angle larger than 29° is a reasonable cutoff point. Figure S3. Different structural motifs for 3 connectivity and 4 connectivity molecules with large carbon angles Overall, the molecules satisfying the following conditions were removed from the database: • The molecule has 4 connectivity carbon with its largest angle greater than 29° • The molecule has 3 connectivity carbon with 360 -sum(angles) is larger than 20° • The molecule has 3 connectivity carbon with its largest angle greater than 35° • The molecule has 2 connectivity carbon with 180 -angle greater than 20° Figure S4.  -195233.36 -195196.84 -195210.15 192 187 C8N0O0H4 -192694.69 -192676.24 -192685 Figure S5. Box-plot diagram of energy distribution of all molecular formulas where the maximum minus the minimum energy range overlaps. The plot g) is the distribution of all the overlapping molecular formulas and the plots a) to g) are expanded, narrower energy range for clarity. The energy unit is kcal mol -1 . All neural networks utilised have two hidden layers, both with the same number of nodes as the input layer (761 features). The two hidden layers have ReLU activation functions and the output layer has the linear activation function. The first hidden layer weights are initialised from the random uniform distribution in the range -500 to 100 and the second and the output weights are initialised from the random uniform distribution in the range 0 to 0.1. The first layer biases are initialised from the random uniform distribution in the range 0 to 10 and the second and the output layer biases are initialised from the random uniform distribution in the range 0 and 0.01. All neural network models are trained using the mean squared error loss function. The weights are trained using the batch size of 64 via the Adam optimiser method.52 The Adam parameters are set to the Keras defaults with β_1= 0.9, β_2 = 0.999 and ϵ = 1.0 × 10-7. The L2 regularisation methods are implemented to avoid overfitting. The layer weight regularisers and bias regularisers are both set to 0.1 in all the hidden layers and the output layer. Learning rate is an important parameter in ML model training. If the learning rate is too large, the error may diverge during the training and if the learning rate is too small, the training time becomes too lengthy. Optimal learning rate may also be dependent on the number of input features. We therefore train the models with five different learning rates for each representation we test. The training epochs are set to 7000 for all the neural network methods. 7000 epochs are long enough for the training set and the test set error to converge (ESI, Figure S6). Table S3. Full list of different representations and the learning rates tested. In Trial A1, both the angle and the dihedral connectivity information is included, in Trial A2 only the angle connectivity information is included and in Trial A3 no connectivity information is included. All errors are stated in kcal mol -1 .

Representations
Learning

Testing different feature generation for NN models
Our initial investigation focused on finding the optimal number of features for our proposed molecular representation. The number of features can be varied by choosing different bandwidths in bond lengths, angles, and dihedral KDE plots. In Figure S7, larger bandwidths have made the KDE plots smoother and the smaller peaks in the CN bond length distribution are not detected as a maximum. Therefore, we explored different bond, angle and dihedral bandwidth combinations as discussed later. We also varied the number of features by subdividing the angle and the dihedral by the number of connections the carbon atom has. For the angle features, if the central atom within the three atoms forming the angle is carbon, the connectivity information is recorded. The KDE is then plotted separately for the two, three and four connectivity carbons. For example, in the 4-methoxybutanal, CCO angle in the carbonyl and the ether will be assigned to the same CCO group for the KDE plot without the connectivity information but to different groups (CC4O and CC3O) with the connectivity information, Figure S7. For the dihedral features, if the central two atoms within the four atoms forming the dihedral is carbon, the connectivity information is recorded for the 2 connectivity carbons. The KDE is then again plotted separately for the two connectivity dihedrals and the other dihedrals. For example, in 4-methoxy-2-hexyne, the CCCC dihedral across the triple bond and the single bond will be assigned to the same CCCC group for the KDE plot without the connectivity information but to different groups (CC2C2C and CCCC) with the connectivity information.
We have also trained KRR and NN models without the presence of NHx group features and also with the presence of OH group features. For the KRR method, the models trained with NHx, without NHx and with OH gave MAE of 1.232 kcal mol -1 , 1.232 kcal mol -1 and 1.224 kcal mol -1 respectively. For the NN method, the models trained with NHx, without NHx and with OH gave MAE of 1.796 kcal mol -1 , 1.802 kcal mol -1 and 2.163 kcal mol -1 respectively. Therefore, we have chosen the to include NHx group in the feature generation but not the OH groups. Figure S8. Variety of different representations are tested. In Trial A1, both the angle and the dihedral connectivity information is included, in Trial A2 only the angle connectivity information is included and in Trial A3 no connectivity information is included. The mean absolute errors (MAE) are stated in kcal mol -1 . The best performing method is highlighted in yellow. Figure S9. The energy prediction KRR model hyperparameter search. The best performing method is highlighted in yellow The architectures tested include one hidden layer with q nodes, two hidden layers with q-q nodes, three hidden layers with q-q-q/2 nodes and three hidden layers with q-q-q nodes, where q is the number of features in the input layer. The molecules are allocated to the training and the test set using the molecular formula grouping described in section 2.1. The best performing model has 2 hidden layers, which is the same architecture as section 2.1. Other NN hyperparameters that can potentially affect the performance include the degree of regularisation and the optimiser. We further explored different L2 regularisation factors, ranging from 0.05 to 0.20. We also tested stochastic gradient decent (SGD) and root mean squared propagation (RMSprop) optimisers from Keras for each regularisation, ESI Table S6. The optimal regularisation factor is 0.10. The SGD and RMSprop optimisers do not perform better than the Adam optimiser (section 2.1).   For the Coulomb matrix representation, the number of input features has been set to 351. NN is set up with the same ReLu activation function for the hidden layers and a linear activation function for the output layer. The weights and the bias are also initialised using the random uniform distribution. We initially tested the one, two and hidden layer NNs with different learning rates, Table S10. The convergence of the training set and the test set errors justified the 7000 epochs used, Figure S10. Further hyperparameter search is then performed on the lowest MAE model, Table S11. Different optimisers and L2 regularisation factors have been tested. Once the optimal hyperparameters have been identified, we trained the NN using the CM representation on the 190, 570, 1900, 5700 molecule training sets. A variety of different learning rates were evaluated as shown in Table S12.

ML Models and Hyperparameter Search
For BOB and SLATM representation, we have set up the NN with ReLu activation function for the hidden layers and a linear activation function for the output layer. The mean squared error was used as the error function with the Adam optimiser. The NN had two hidden layers, each with 351 nodes and 0.1 L2 regularisation. We then trained the NN using the BOB and SLATM representation on the 190, 570, 1900, 5700 molecule training sets. We have again tested a variety of different learning rates for each training set, Table S12. The key results are summarised in Table S13. Figure S10. The error convergence for CM representation and NN training. The method used 2 hidden layers, 0.00000005 learning rate, Adam optimiser and 0.10 L2 regularisation factor

Further remarks on MLR methods
MLR is the simplest machine learning approach and we can use it to verify that our feature generation method is going in the right direction. In the main text, we have mentioned that MLR gave MAE of 25.7 kcal mol -1 . If the predicted energies are completely random, the MAE will be significantly larger than 25.7 kcal mol -1 . To justify this point, we have written a code which predicts the energy of the molecules in the test set by generating a random number in between the highest and the lowest value in the y-vector. The MAE then increases to 75360.2 kcal mol -1 . We have also performed MLR on coulomb matrix (CM) representation and the resulting MAE is 68.05 kcal mol -1 . The results again support our argument that MLR model trained with well-constructed features should perform better than predicting random values. Figure S13. SchNet, PhysNet and MolE8 error convergence comparison