Angus
Keto
a,
Taicheng
Guo
b,
Nils
Gönnheimer
c,
Xiangliang
Zhang
b,
Elizabeth H.
Krenske
a and
Olaf
Wiest
*c
aSchool of Chemistry and Molecular Biosciences, The University of Queensland, Brisbane, QLD 4072, Australia
bDepartment of Computer Science and Engineering, University of Notre Dame, USA
cDepartment of Chemistry and Biochemistry, University of Notre Dame, USA. E-mail: owiest@nd.edu
First published on 28th March 2025
Practical applications of machine learning (ML) to new chemical domains are often hindered by data scarcity. Here we show how data gaps can be circumvented by means of transfer learning that leverages chemically relevant pre-training data. Case studies are presented in which the outcomes of two classes of pericyclic reactions are predicted: [3,3] rearrangements (Cope and Claisen rearrangements) and [4 + 2] cycloadditions (Diels–Alder reactions). Using the graph-based generative algorithm NERF, we evaluate the data efficiencies achieved with different starting models that we pre-trained on datasets of different sizes and chemical scope. We show that the greatest data efficiency is obtained when the pre-training is performed on smaller datasets of mechanistically related reactions (Diels–Alder, Cope and Claisen, Ene, and Nazarov) rather than >50× larger datasets of mechanistically unrelated reactions (USPTO-MIT). These small bespoke datasets were more efficient in both low re-training and low pre-training regimes, and are thus recommended alternatives to large diverse datasets for pre-training ML models.
Transfer learning20,21 involves retraining an existing ML model on a new domain of chemistry (Fig. 1A) and in many cases improves model accuracy while reducing training costs by not necessitating a brand new model. It requires two related datasets: one pre-training dataset that is used to train an initial model and another to re-train (fine-tune) the model on the target reactions/domain. In theory, the shared principles between these two domains can be learned during pre-training and leveraged during fine-tuning to produce a more effective model than training alone. However, the ideal relationship between the pre-training and re-training datasets in reaction prediction is not clear: Should the emphasis be on the size of the pre-training dataset, on molecular structure or on similarity of the reaction mechanisms? Chemical intuition would posit the mechanism, specifically the electron flow, contains the most applicable information but this requires a model that properly encodes this information. In contrast, the data hungry nature of neural networks would suggest that a significantly larger (one or more orders of magnitude) and more diverse dataset be more effective.
![]() | ||
Fig. 1 (A) General transfer learning workflow. (B) Pericyclic reactions and their mechanistic similarities. |
We investigate the following question: in situations where data are scarce, do models pre-trained on mechanistically related reactions require less data than models trained on diverse reaction data? We address this question through studies of two target pericyclic reactions: [3,3] rearrangements (Cope22,23 and Claisen24,25 rearrangements) and [4 + 2] cycloadditions (Diels–Alder26,27 reactions) (Fig. 1B). These reactions were chosen not just for their synthetic utility of atom-economy efficient transformations,28–31 but crucially because they share a common mechanistic feature: the shuffling of electrons around a six-membered cyclic transition state. These reactions are compared and pre-trained with datasets from the Ene reaction, which shares the cyclic movement of six electrons, and the Nazarov cyclization, a 4-electron electrocyclic reaction. Our work examines whether ML models can recognize these shared mechanistic principles, in this case when predicting the major product of these reactions.
For transfer learning, we also generated pre-training datasets representing different sizes and chemistry: (1) 80%-of-∼480000 diverse reactions from the USPTO-MIT database,8,32 (2) 80%-of-9537 Diels–Alder reactions (DA1), (3) 40%-of-9537 Diels–Alder reactions (DA2), (4) 80%-of-3289 Cope and Claisen rearrangements, (5) 80%-of-2322 Ene reactions, (6) 80%-of-1029 Nazarov cyclizations where the reactant and product were represented as their charge-neutral forms (Naz1), (7) 80%-of-1029 Nazarov cyclizations with the reactant and product represented as their protonated forms (Naz2). The Jupyter notebooks to regenerate these datasets using a Reaxys license are available as described in the ESI.†
(1) NERF (non-autoregressive electron redistribution framework) algorithm.33 NERF predicts the changes in edges of a molecular graph (corresponding to the changes in bond order that define a chemical reaction) using connectivity and nodes characterised by atom type, aromaticity, charge, and positional and segment embeddings. Its design principles33 and performance11 have been previously documented.
(2) Chemformer,17 a natural language processing (NLP) model built on the Bidirectional Auto-Regressive Transformers (BART)34 architecture.
For each pre-training dataset, 10 separate pretrained NERF models were created, using 10 random splits. To reduce computational cost, only one USPTO model, trained on the split used in Jin et al.,32 was created. The model with the highest accuracy from each set of 10 was then fine-tuned on CC training data. This fine-tuning occurred on 10 random splits of five different ratios of CC training data (between 10% and 85%). With 5 fine-tuning splits investigated, there were 50 transfer learned models per pretraining dataset. Top-1 accuracy (i.e. accuracy according to the most confident prediction) was used.
The black line in Fig. 2A depicts the baseline situation where no pre-training was undertaken before training the CC model. Without any pre-training, the NERF model only achieves predictive accuracies of >90% when 80% of the CC dataset (2795 reactions) is used for training. This shows the CC dataset is sufficiently large to develop an effective NERF model (>90% accuracy) without pre-training but only if a large percentage of the dataset is used for training.
Next, we investigated the effect of pre-training using different reactions. Fig. 2A shows the effect of pre-training by Diels–Alder (orange), USPTO-MIT (red), Ene (purple), and Nazarov (green) reactions as the pre-training datasets. For the pretrained models, any result above the baseline indicates that the pre-training step has enhanced the model's predictive accuracy. The most relevant data split to be considered is the lowest training regime where only 10% (328 training reactions) of the CC dataset is used as this most closely mirrors the low-data scenarios common in developing areas of chemistry. All six pre-training datasets prove beneficial here but the greatest benefit came from pre-training with Diels–Alder data: DA1 and DA2 achieved accuracies of 76.0% and 73.1%, respectively, compared to the baseline of 62.7%. Pre-training on the USPTO-MIT dataset had a moderate benefit (68.9%), while pre-training on the Ene and Nazarov datasets were least beneficial (64.1–66.7%). These results demonstrate the balance between mechanistically similar versus using larger but more general pre-training data sets. Even though the Diels–Alder pre-training datasets were 48 times smaller than the USPTO-MIT dataset, the mechanistically related Diels–Alder reactions were more efficient pre-training sources. The difference of the effectiveness of the DA1 and DA2 datasets (which is half the size of DA1) for pre-training shows, in analogy to the results in Fig. S1† discussed above, that pretraining dataset size affects accuracy, in line with the observed the lower performance of the smaller Nazarov datasets. The standard deviations (Table S1†) range between 0.6–2.3% for all approaches, indicating that the model performances and pretraining benefits are robust.
The benefit of pre-training drops off as more training data is introduced. When 85% of the CC dataset is used as training, the highest performing pretrained model (DA1) has a Top-1 accuracy of 92.3% compared to the baseline of 90.7%. All other models are within 0.9% of the baseline. In high training regimes, there may be fewer areas of chemical space that these pre-training datasets can help elucidate.
Similar to the Cope and Claisen predictions, the lowest data regime for Diels–Alder reactions is the most relevant and displays the greatest performance improvement from pre-training. All pre-training approaches were beneficial. Pre-training on USPTO-MIT gave the highest accuracy (82.9%) when 10% of the Diels–Alder dataset was used, while pre-training on Cope and Claisen rearrangements was next, with an accuracy of 78.5%. This latter result is noteworthy given that the CC pre-training dataset is ∼145× smaller than the USPTO-MIT used for pre-training. As the amount of training data increases, the effect of pre-training drops off noticeably. When 40% or more of the Diels–Alder reactions are used for training, only USPTO-MIT pre-training delivers a noticeable increase in accuracy relative to the baseline. Model performance is again robust, with standard deviations of between 0.5–2.3% (Table S2†).
All pericyclic pre-training datasets are smaller (≤3289 reactions) than the dataset for the Diels–Alder reaction used in the baseline model, and consequently better accuracy is obtained from pre-training with the large USPTO-MIT dataset even though it largely contains unrelated reactions. While the pericyclic pre-training reactions impart specific reactivity information, the USPTO-MIT imparts a broad and general understanding of reactivity due to its large size and the diversity of reactions it contains.
To understand what knowledge gaps in the training data the pre-training was helping with, the performance increase across different Diels–Alder sub-categories were investigated relative to a baseline of no-pretraining (Fig. 3). For further comparison, the pre-training approaches were also compared with a non-pre-training approach that simply used selected additional training examples (reactions with 17 triazines and 32 oxazoles) reported previously.11 This would illustrate not only what area of chemical space benefits the most, but also whether pretraining or manual data mining is more effective. Analysis was conducted on the 80:
10
:
10 split for comparability. The pre-training and additional training approaches all increased the Top-1 accuracies overall. Underrepresented sub-categories, including intramolecular, aromatic, and hetero-Diels–Alder reaction centres, showed the highest improvement. Pre-training with the USPTO-MIT dataset is the most beneficial approach but the focused datasets (here Cope and Claisen pretraining) can be a small fraction of the size of a generalized dataset and be effective. This assumes that the needed mechanistic information is contained in the small dataset and the model is capable of using this information. The alternative approach using key additional training examples shows that extracting from the literature, or even carrying out these experiments, can also be effective. However, this may not be feasible for some chemistries. Overall, the best approach for a new dataset will depend on factors including training and pre-training dataset sizes as well as whether there is possible untapped literature data.
In Fig. 4, the performance of each NERF model is compared against the non-pretrained NERF baseline of 62.7% Top-1 accuracy. When pre-trained using equally sized pre-training datasets, the most accurate Cope and Claisen predictions were obtained from models pretrained on Diels–Alder reactions (+3.7%, Fig. 4A). Pre-training on Ene and Nazarov reactions also increased performance but to a lesser extent (+0.7 to 1.9%). USPTO-MIT pre-training had a negative impact on accuracy, suggesting that the benefits seen for pre-training by USPTO-MIT data in Fig. 2A were the result of the general chemical understanding provided by the entire dataset and could not be replicated by selecting only a small subset of its reactions. This confirms again the hypothesis that the more mechanistically similar the dataset, the more effective the pretraining.
This same equal size pretraining set approach was then applied to predicting Diels–Alder reactions and the same number of training reactions (328), which represents 3% of the Diels–Alder dataset, were used (Fig. 4B). This was for comparability and to further investigate the impact of pretraining when the baseline Top-1 accuracy (39.5%) is very low. Pretraining approaches had an outsized impact here because of this low baseline accuracy. In agreement with Fig. 4A, mechanistically related pretraining data proved more effective, as seen by the +13.1% increase in Top-1 accuracy when pretraining on Cope and Claisen reactions. USPTO-MIT pretraining however resulted in a decrease in accuracy of −10.5%, reinforcing that the entire USPTO-MIT dataset is needed for effective pretraining. Pretraining on Ene reactions had a beneficial effect of +5.4% while Nazarov reactions saw only extremely minor changes (+0.2%). In order to get the improvement in accuracy with USPTO-MIT data, it appears the entire dataset needs to be used as pretraining.
To further understand the effectiveness of pericyclic vs. non-pericyclic pre-training data, we visualized reaction fingerprints (rxnfp35) of these reactions in two dimensions using UMAP36 (Fig. 5). The clear separation between the pericyclic datasets and the USPTO-MIT dataset reinforces the distinctness of the reactions in these datasets. Meanwhile, the pericyclic reactions are positioned closer together and in some cases even overlap. Diels–Alder reactions occupy large areas of chemical space near and between Cope and Claisen rearrangements. This suggests why Diels–Alder reactions are the most effective form of pretraining for Cope and Claisen reaction: Diels–Alder reactions are diverse and mechanistically relevant.
![]() | ||
Fig. 5 UMAP of rxnfp fingerprints of USPTO-MIT, Diels–Alder, Cope and Claisen, Ene, and Nazarov reaction datasets. |
Compared with Chemformer, NERF appears to be more efficient at utilizing pre-training data and derives greater improvements in accuracy from pre-training. The major cause of this difference is the learning target set for these two models, i.e., ‘what should be learned’. Although both Chemformer and NERF are implemented as neural networks based on Transformer architecture, Chemformer is trained to learn the sequence (input) to sequence (output) correlation, while NERF is trained to learn the difference between reactant (input) and product (output). As a result, NERF is more likely to capture the mechanistic similarities that can be transferred between different reactions.
Footnote |
† Electronic supplementary information (ESI) available: Additional figures, explanations, link to Github, Reaxys reaction, IDs and link to Jupyter notebook to regenerate dataset. See DOI: https://doi.org/10.1039/d4dd00412d |
This journal is © The Royal Society of Chemistry 2025 |