Benchmarking machine-readable vectors of chemical reactions on computed activation barriers

In recent years, there has been a surge of interest in predicting computed activation barriers, to enable the acceleration of the automated exploration of reaction networks. Consequently, various predictive approaches have emerged, ranging from graph-based models to methods based on the three-dimensional structure of reactants and products. In tandem, many representations have been developed to predict experimental targets, which may hold promise for barrier prediction as well. Here, we bring together all of these efforts and benchmark various methods (Morgan fingerprints, the DRFP, the CGR representation-based Chemprop, SLATMd, B2Rl2, EquiReact and language model BERT + RXNFP) for the prediction of computed activation barriers on three diverse datasets.

Note that in earlier works, S1,S2 only Gaussian kernels were considered for these representations.The inclusion of Laplacian kernels in the hyperparameter optimization improved the accuracies of the ML models for most of the datasets studied here.

S1.2 Random Forest models
The best hyperparameters are found in a Bayesian search through the parameter space detailed in Table S2.
The best parameters found for the DRFP and MFP models are given in Table S3.

S1.3 Chemprop
The hyperparameter space to be searched is as implemented in chemprop S3 version 1.6.1, and summarized again in Table S4.
The best parameters resulting from the search are summarized in Table S5.Table S5: Best parameters resulting from the hyperparameter search for each dataset for the Chemprop model.

S1.5 EquiReact
The optimal parameters for the EquiReact models are taken from Ref. S4

S2 Data augmentation for language models
To verify whether the inclusion of data augmentation was beneficial, models were tested with 10 SMILES randomizations (rand) and none.No intermediate numbers of randomizations were tested.The optimal set of hyperparameters listed in Table S6 were used.Models were trained with a batch size of 32.The resulting MAEs are summarized in Table S8.Since the models showed either improvement or no change with data augmentation, we used 10x data augmentation for the results in the main text.

S3 RXNMapper confidence
RXNMapper reported an average confidence indicated in Table S9 for each of the three datasets.The confidence is especially low for the Proparg-21-TS dataset, due to the foreign nature of the chemistry compared to the data on which it was pre-trained.

S4 SMILES for Proparg-21-TS S4.1 Failed conversion
The Proparg-21-TS dataset S5 contains 754 structures of intermediates before (2) and after (3) the rate-limiting stereocontrolling transition state of the catalytic benzaldehyde propargylation reaction (Fig. S1).For one of the entries in the dataset S5,S6 (labelled as 3jbp3R) the 2 structure corresponds to a non-covalent complex between 1 and benzaldehyde.xyz2mol from cell2mol failed to produce a disconnected molecular graph, thus we excluded this entry from training.

S4.2 Comparison of xyz2mol, fragment-based and stereochemistry-enriched SMILES
xyz2mol from cell2mol correctly determined atom connectivity from xyz but failed in assigning bond types and atom charges.For example, for 2 (Fig. S2a) in entry 1abp1R, S5 the SMILES string is which corresponds to an unreasonable structural formula shown in Fig. S2b.To address this issue, we built an alternative set of SMILES strings using dataset-specific knowledge.We will refer to these as "fragment-based" SMILES.They are constructed as follows.
Different entries of the dataset vary by: a) substituents in the bipyridine N, N ′ -dioxide catalyst; b) ligand rearrangement around the Si center; c) conformation of the coordinated benzaldehyde leading to different enantiomers.Since the catalysts were assembled from a library of fragments S5,S6 and 2 and 3 core structures persist across the dataset, the SMILES can be constructed using simple combinatorial rules.The resulting SMILES string for 2 of entry 1abp1R reads phenyl propargyl ketone product in 3 which is indicated with @ and @@ tags.This resulted in a set of injective SMILES for the Proparg-21-TS dataset.We note that the procedure we used to atom-map the SMILES (graph matching with the atom-mapped graphs obtained from xyz) could have switched the two Cl atoms possibly affecting the Chemprop results.Performance of the 2D-based methods with the three types of SMILES strings (from xyz2mol, fragmentbased and stereochemistry-enriched) is compared in Table S10.While the SMILES quality improves from xyz2mol SMILES to combinatorial, only the MFP benefits slightly from the change.Most methods do not change.Including stereochemistry information again leads to a marginal improvement in most cases, notably in the Chemprop, but actually reduces the ability of other models including the DRFP.Unfortunately, the SMILES-based methods are not written to exploit stereochemistry information.The DRFP for example looks for circular substructures in reactants and products.The presence of stereochemistry flags may confuse the notion of these substructures.These results point to weaknesses in current 2D-structure based methods to handle datasets that vary in stereochemistry, even when the stereochemistry is explicitly encoded in the SMILES strings.

Table S2 :
Search space for the Bayesian optimization of hyperparameters for RF models.
and are repeated below.

Table S7 :
Best model hyperparameters for EquiReact for the three datasets, evaluated on the first set of random splits, as reported in Ref.S4

Table S10 :
Comparison of 2D models MAEs [kcal/mol] for the Proparg-21-TS dataset on different sets of SMILES.The BERT+RXNFP results are given for datasets without data augmentation, 10-fold cross-validated, run for 5 epochs.The hyperparameters are those for the xyz2mol SMILES strings.