Matthew Ball
ab,
Dragos Horvath
b,
Thierry Kogej
a,
Mikhail Kabeshov
a and
Alexandre Varnek
*b
aMolecular AI, Discovery Sciences RD, AstraZeneca, 431 83 Gothenburg, Sweden
bLaboratory of Cheminformatics, University of Strasbourg, 67081 Strasbourg, France. E-mail: varnek@unistra.fr
First published on 6th August 2025
The selection of optimal reaction conditions is a critical challenge in synthetic chemistry, influencing the efficiency, sustainability, and scalability of chemical processes. While machine learning (ML) has emerged as a promising tool for predicting reaction conditions in computer-aided synthesis planning (CASP), existing approaches face many significant challenges, including data quality, sparsity, choice of reaction representation and method evaluation. Recent studies have suggested that these models may fail to surpass literature-derived popularity baselines, underscoring these problems. In this work, we provide a critical review of state-of-the-art ML techniques, identifying innovations which have addressed the key challenges facing researchers when modelling conditions. To illustrate how relevant reaction representations can improve existing models, we perform a case study of heteroaromatic Suzuki–Miyaura reactions, derived from US patent data (USPTO). Using Condensed Graph of Reaction-based inputs, we demonstrate how this alternative representation can enhance the predictive power of a model beyond popularity baselines. Finally, we propose future directions for the field beyond improving data quality, suggesting potential options to mitigate data issues prevalent in existing literature data. This perspective aims to guide researchers in understanding and overcoming current limitations in computational reaction condition prediction.
We can start by defining what ‘conditions’ consist of. Conditions are the contents ‘above the arrow’ in a chemical reaction, defining the physicochemical environment under which a reaction occurs – see Fig. 1. This can consist of ‘reagents’: chemical species which take part in the reaction, but do not contribute a heavy atom to the product. Examples of ‘reagents’ include solvents, catalysts, ligands and bases in the case of a Suzuki coupling, but the scope of these ‘reagents’ will vary as a function of the reaction type being investigated. ‘Conditions’ are also comprised of physical parameters like temperature, pressure and time (and countless more), all of which influence the rate and feasibility of a reaction.
![]() | ||
Fig. 1 Introducing reaction condition prediction. Because of the large possible scope of conditions, a decision must be made when creating models to limit the scope of ‘conditions’ considered. |
For modelling purposes, conditions can be encoded in the form of some vector, c. The definition of such a vector is a key challenge in reaction informatics: what is the best way to encode the ensemble of different species and parameters – reagents, temperature and pressure for example – in a single numeric vector? This vector requires a clear structure, containing elements associated with reagents and thermodynamic parameter values. At its most simple, this is a one-hot encoded vector, where the presence of a species is marked by the corresponding entry in the vector, and this is frequently used in condition prediction.30–32 To make these labels more general, simple empirical categories can be used, like ‘hydrophobic/polar/protic’ for solvent, or ‘(Lewis) acid/base’ for catalyst. Whilst the predictions of these targets may be less specific, they can help mitigate data sparsity which will be discussed in a later section. Moving towards the continuous space, descriptors might be calculated from the structure of the reagents.33–35 Alternatively, agents may be characterised by their experimental properties, like dielectric constant or Kamlet–Taft values.36
Therefore, for some reaction r under conditions c, reactivity modelling can be formulated as:
ŷ = f(r;c) | (1) |
The reaction outcome ŷ can be – from the most accurate and less available, to the most empirical and more common – a reaction rate constant, a yield value or simply a binary classifier (feasible/infeasible). ŷ can therefore be categorical (in the case of feasibility) or continuous. Any continuous prediction can be reformulated as a categorical one, by selecting a cutoff for ‘acceptable’ values for yield, rate etc. In general, feasibility models are the most popular, given that the presence of a reaction in a reaction database implies its feasibility, unless explicitly labelled as ‘failed’, which is unfortunately not customary.27,28 Therefore, in the absence of negative data, feasibility models act as ‘one-class classifiers’ – for two-class classifiers, either experimental failures or assumedly infeasible ‘decoy’ examples must be provided.37
This formulation is the generalisation of single-molecule quantitative structure–property relationship (QSPR) approaches to reactions. But there are additional challenges that must be considered in reaction informatics: the added complexity of reactions (compared to single molecules), resulting from the consideration of multiple reacting species and how they interact; in addition to the increased data pressures, like quality and sparsity, that the consideration of reaction conditions impose.
Like classical QSPR, eqn (1) can be used to obtain ‘optimal’ condition predictions either directly or indirectly. By selecting different conditions, we can evaluate f(r; c) to predict the reaction outcome of interest. Then, we can select the conditions which lead to the most desirable outcome. This is equivalent to selecting the set of conditions c, from the available set , that maximises the objective function f(r; c) for a given reaction r. Formally this can be expressed as:
![]() | (2) |
![]() | (3) |
This paper will mainly focus on the ‘direct’ prediction of conditions: what are the conditions required for the reaction to proceed to give the desired outcome (e.g. a maximal yield, a feasible reaction etc.).
ĉ=g(r;y) | (4) |
In eqn (4), like any ‘inverse QSPR’ approach, condition prediction requires navigating many-to-many mappings between reactions and viable conditions.2 This is the idea that a single reaction can occur under multiple different conditions, and inversely that a single set of conditions can be used for multiple reactions. Due to this, machine learning (ML)-based approaches to reaction condition prediction are diverse and highly dependent on the problem setup and dataset used. Raghavan et al. introduce the concept of ‘global’ models and ‘local’ models.45 ‘Global’ models are trained on large amounts of literature data, often spanning a wide range of reaction types and aim to generalise across reaction space, like those in ref. 30, 46 and 47. In contrast, ‘local’ models might focus on a single reaction type and a well-defined set of reactants and conditions.45 Examples of ‘local’-type models span from models for conditions of Michael additions48 to C–N couplings.49,50 This classification of these models as ‘local’ is arbitrary, as the applicability domain of these models varies in focus and comprehensiveness on a continuum. Ultimately, as the focus of a model shifts from a ‘global’ analysis of all reactions listed in a database to targeted modelling of specific reactions around selected reactants, the ‘conditions’ requiring consideration may also collapse to a subset of locally relevant options. Subsequently, the methods applied to predicting conditions require adaptation, paying attention to the constraints of the dataset of interest, and the scope of the conditions to be predicted.
This work examines the unique challenges of reaction condition prediction, particularly those related to data quality, model design (input and output) and evaluation. We then move on to review the state-of-the-art ML approaches, highlighting their progress and limitations. Finally, we present a case study of heteroaromatic Suzuki–Miyaura reactions from the USPTO dataset curated by Beker et al.1 In particular, our case study assesses the impact that reaction representation – how a reaction equation is encoded – has on the predictive power of condition prediction models. Here, we utilise Condensed Graph of Reaction (CGR) fragment representations51 to explore if this reaction encoding can improve models' predictive power, beyond a strong popularity baseline. To conclude, we provide an outlook on the field, identifying key directions for future research and development.
As introduced by Raghavan et al. ‘global’, large-scale datasets typically cover a wide range of different reaction classes, with high substrate diversity but limited condition exploration for a given substrate.45 A small collection of these can be found in Table 1. Models trained on these datasets are capable of suggesting conditions over a wide range of reaction types. However, Afonina et al. found that predictions of a ‘global’ model (ref. 30) on a smaller, more focused dataset containing only hydrogenation reactions were not satisfactory, losing out to a simple popularity-based model. They hypothesised that the poorer performance is a result of the model not being biased towards a specific reaction type.2 Even filtered versions of ‘global’ datasets to only include a single reaction type often lead to poor generalisability and applicability to industrial use cases, such as the screening of high-yielding conditions for new reactions.52 In this case, it was suggested that these general datasets are too biased towards specific reagents for given reaction types to be useful for creating yield/condition prediction models that are useful for prospective applications.
In contrast, the ‘local’, small-scale datasets cover a much smaller range of reaction classes, but with a smaller substrate diversity and higher condition exploration for each substrate. Models trained on this sort of data21,38,59 can show more satisfactory results, and crucially better predictive propensity within their applicability domains38 versus models trained on ‘global’ datasets. The downside is that models trained on this data cannot be expected to generalise to other reaction types, due to the narrow scope of the training data. The other issue is data availability, as many smaller-scale datasets originate from proprietary ELNs within pharmaceutical companies.
Errors within chemical reaction data can arise in the form of missing reactants, reagents or products; mis-assigned reaction roles; incorrect SMILES representations and incorrect atom-mapping. There are a number of approaches for dealing with these issues, by either resolving the problems or removing the reaction from the dataset. As discussed by Gimadiev et al., reactions should undergo 4 steps of curation before they can be used for reactivity modelling: chemical structures curation, transformation curation, reaction conditions curation and endpoints curation63 (see Fig. 2).
![]() | ||
Fig. 2 Summarising key features of reaction data sources and the subsequent steps required to curate these sources. |
The exact details of the chemical structures curation are usually a subset of steps suggested by Fourches et al.: detection of valence violations, ring aromatisation, normalisation of specific chemotypes, standardisation of tautomeric forms and the splitting of ions, among others.65 ‘Transformation curation’ aims to resolve issues with unbalanced reactions, atom-to-atom mapping, reaction role assignment and duplicate detection. For unbalanced reactions, dealing with missing reagents, reactants and products can be done using ML tools by suggesting replacements for these missing species,46,66,67 and this improved data quality was shown to improve model performance in product prediction.46 Alternatively, rule-based tools can be used to fill missing small molecules and balance reactions.68,69 The same consideration needs to be paid to the representation of reaction conditions, where text-based entries for reaction conditions must be collected and mapped to the appropriate SMILES string.64
When considering reaction condition prediction specifically, role-assignment of reagents is incredibly important but this is not trivial. Many existing ‘global’ approaches divide reagents into roles such as catalysts, solvents and agents (which encapsulate additives, acids, bases etc.).30–32,47 A single reagent can perform multiple different functions across different reaction types (or even within a single reaction), leading to challenges when assigning a reagent to a particular class. This is particularly pronounced when considering a wide range of reaction types, as is the case in ‘global’ models. For such models, it is often the case that a reagent role simply cannot be assigned beyond ‘Agent’, ‘Solvent’ or ‘Catalyst’.30,47,70 Therefore, there are a larger number of classes within this reagent type and subsequently, a more challenging classification problem. Another aspect of conditions curation is understanding which reagents take part in the reaction, and which ‘reagents’ are part of other procedural processes, for example workups or purification. More high-fidelity labelling of reaction roles could lead to higher quality datasets for condition modelling, as provided by modern databases such as ORD.23 Furthermore, trusting the labelling of reaction roles from large datasets such as USPTO can lead to issues. Frequently reaction components are mislabelled,64 leading to ambiguity in what is a reactant versus a reagent. To rectify this, atom-mapped reaction equations can be used to determine what are reactants, by identifying which species contribute ‘heavy atoms’ to the product. Once reactions are in a standardised format and the roles of all components have been assigned, duplicate reactions need to be dropped. Duplicate entries are common, due to scientists adopting transformations reported elsewhere in the literature. Additional treatment of rare conditions may also be required, as Wigh et al. report that the removal of these entries can improve performance of condition prediction models.64 It is crucial to adopt standardised curation protocols to not only benefit reactivity prediction tasks but enable fair comparisons of model performance.
Briefly, experimental noise refers to noise caused by human or experimental error, for example, errors in experimental protocol, which caused the loss of product. This results in large variance in recorded yield for reactions performed under the same conditions as Voinarovska et al. show.29 Depending on how yield information is used in the modelling process, the extent that this affects condition prediction models varies. For models which don't use yield information at all (and assume all non-zero-yield reactions are successful), these problems shouldn't affect model performance. Conversely, where yield information directly influences model training, high variance in yield could hypothetically lead to the incorrect ‘optimal’ conditions being identified.
Selection bias refers to the tendency of chemists to select established conditions (or the reagents that are simply available in the lab) when performing reactions, ultimately leading to large imbalances in the dataset where few conditions are explored and can lead to models that are trained on this data to learn little more than popularity trends.1
The final type of bias discussed by Strieth-Kalthoff et al. concerns the reporting of results, and particularly the bias of high yielding ‘successful’ results. This issue is further exacerbated by the common practice of reporting only the optimal outcome from a series of identical experiments, often without accompanying error estimates, which further complicates modelling. As a result, there are large imbalances in the distributions of yields across a data source, which prevents models from learning which reactions don't work and ultimately reduces performance.27 Maloney et al. called for an improvement in the reporting of experimental yields, and an increase in the amount of these ‘low yielding’ reactions being reported, thereby making these reactions more common in chemical reaction databases.28
Despite these biases, there are approaches which can counteract this (although, they have their own issues which need to be considered). For example, it has been demonstrated that the introduction of synthetic ‘negative’ data (labelled, impossible reactions) in appropriate quantities can lead to improved performance in yield prediction27 or retrosynthesis applications.37 Alternative approaches include the sampling of ‘hard negative’ conditions. These ‘hard negatives’ (incorrect reagent or solvent predictions assigned a high probability by the model) were combined with true labels to generate diverse training examples to help the model distinguish between correct and incorrect conditions.32 Schwaller et al. artificially expanded existing data via data augmentation using permuted and randomised reaction SMILES strings, resulting in an improvement in R2 of up to 0.15 for a yield prediction model.71 Various forms of data augmentation have also been applied in the prediction of retrosynthesis7,72 or reaction products.73 Other options to leverage existing chemical knowledge include transfer learning and this has shown promise in modelling reactivity,3,74 although in certain cases this ‘transfer’ of information can hinder the models' predictive capabilities via ‘negative’ transfer.75 This emphasises that, although these strategies can aid the situation, care must be taken to ensure that the additional data is not causing a decrease in performance, and that the introduction of the new data is not bringing significant biases with it.
While data quality, bias, and sparsity are critical challenges in any reactivity modelling, they manifest themselves differently in condition modelling due to the many-to-many relationship between reactions and potential conditions, which we will explore now.
In our case study (see later), we use the CGR, aiming to strike a balance between an informative representation and computational cost. The CGR encodes a reaction as a single pseudo-molecule, arising from the superposition of the reactant and product graphs of molecules in a reaction51 (see Fig. S2). This pseudo-molecule contains ‘dynamic’ bonds, representing the bonds that are broken or made during the reaction. However, this requires atom mapping, which is not trivial, and even state-of-the-art computational tools79,80 cannot achieve perfect accuracy.81 The requirement for atom mapping aside, CGRs have emerged as a powerful representation for chemical reactions and have shown strong performance when used as input for the prediction of reaction properties such as activation energies, rate constants and protecting group reactivity.82–84
Most often, a one-hot encoded c vector is targeted, where the presence of a given reagent is indicated by a binary label to be predicted by the approach. Continuous variables, like temperature or pressure, are often treated in the same way by ‘binning’ the variable into discrete categories2 (or can be modelled as a regression task30,47).
When modelling with ‘local’ datasets, where data sparsity may be less pronounced, modelling variables at a higher fidelity may be possible. As a further benefit of this scenario, some condition factors may be sine-qua-non prerequisites for the given class of reactions, and therefore already known – hence no longer explicitly included in the output vector c. In contrast, when modelling with ‘global’ datasets, where the prerequisites for the conditions may vary across reaction types, this is not possible.
‘Global’ models are often evaluated using classification metrics like top-k accuracy.30,32,47,85 However, the ‘ground truth’ in literature-derived datasets is inherently ambiguous: multiple valid conditions may exist for a reaction, but only a subset are documented. For example, a model predicting methanol instead of the ‘ground truth’ ethanol for a polar protic solvent is penalised equivalently to one predicting toluene, even though methanol is chemically plausible but untested. Conventional metrics fail to distinguish between chemically invalid predictions and valid-but-unexplored alternatives.
For ‘local’ models, multiple condition sets may be successfully applied to a single reaction. In such cases, ranking-based evaluation metrics such as mean reciprocal rank or the Kendall tau coefficient can be used to assess performance,78 with ‘true’ rankings based on the outcomes of each condition set. Similarly, when yield prediction is being used to ‘screen’ conditions, one could also use the Spearman correlation coefficient or average yield percentile ranking,35 which emphasise relative performance of conditions over absolute error. Of course, these approaches are less applicable for global models, where the ranking of all possible conditions is unfeasible, and a given reaction may only have a single condition label associated with it.
The ‘gold-standard’ for the evaluation of models would include the testing of predictions in the lab, alongside top-k accuracy, as done by Schilter et al.86 This is particularly important in the case of a model's prediction disagreeing with the ‘ground truth’. However, access to experimental validation is not always possible (and is resource-intensive), but other in silico metrics could also be used. As an example, Wang et al. used the Solvent Similarity Index87 to determine how similar the predictions of ‘incorrect’ solvents were to the ground truth.47 Of course, no in silico metric of similarity can replace experimental validation, but it can help provide further information into the ‘chemical reasoning’ of a model.
Another alternative is to use condition clustering, where reagents with similar chemical properties are categorised in the same cluster. The intuition behind this follows directly from above: in general, we might expect reagents with very similar chemical properties to react in the same way. This approach could be applied post-prediction,77 aiming to evaluate model performance whilst accounting for data sparsity and the many unlabelled positive examples in reaction condition datasets. On the other hand, such an approach could be applied in data pre-processing, reducing the number of classes that a model might need to predict, and subsequently improving performance.1,2 We explore this concept further in our case study, see 7. This is comparable to the concept of ‘binning’ in yield prediction, where the underlying variance in yield data makes modelling of exact yields difficult;29 but effective, useful tools can still be developed by considering yield as a discrete class including ‘zero yield’, ‘low yield’ or ‘high yield’ classes.
The final consideration to be made, like in any reactivity modelling, is testing that a model has learnt meaningful chemistry rather than exploiting underlying patterns in the data.88 For example, using adversarial tests for unrelated representations of reactants (e.g. random or one-hot encoding) to illustrate the improvement that applying such a model can have on the problem of predicting appropriate conditions.78
We have seen how condition prediction presents a unique challenge due to its inherently many-to-many nature. This complexity, combined with dataset sparsity and bias, impacts every stage of model development: from input representation and output encoding to evaluation. The choice of both input representation and output encoding is closely tied to the nature of the dataset and should be carefully considered, particularly for ‘global’ models. Furthermore, standard evaluation metrics in ‘global’ models often fall short, due to the ambiguity of ‘ground truth’ labels. Therefore, it is critical that the evaluation of such models should include experimental validation (in the ideal case), or at the very least careful analysis of ‘incorrect’ predictions, to gain a better insight into a model's performance.
To begin, we refer back to the introduction, and the different definitions of ‘optimal’ conditions. The majority of existing approaches focus on selecting conditions for a given substrate pair which produce the highest yield of the desired product.38,43,59 Though some methods focus on discovering and predicting ‘general’ reaction conditions.39,40,89 We will predominantly focus on models of the former, analysing models based on their architecture. For models that aim to optimise the yield, we can classify these models in the same way we did with the data: ‘global’ models and ‘local’ models.45 ‘Global’ models refer to models that are trained on large amounts of literature data contained within datasets such as US Patent & Trademark Office (USPTO),53 Reaxys,56 Pistachio90 or the Open Reaction Database,23 and can be applied to many different reaction types. On the contrary, ‘local’ models are trained on a single, specific reaction type (often) using HTE data.
Furthermore, it is important to distinguish possible problem setups employed to predict conditions (see Fig. 5). Most ‘global’ models aim to solve some form of ‘classification’ task: which reagent(s) from a selection of reagents are the most appropriate for the input reaction? With this in mind, we begin our analysis by analysing ‘global’ models (see Table 3).
However, the similarity approach does have issues, as similarity searching on large databases can become very slow, and requires special approaches like FAISS.93,94 This can make similarity searching impractical, despite its interpretable nature. The other issue is that similar reactant structures can often exhibit very different reactivity, for example changing the substitution pattern of aromatic systems like indoles can cause vastly different reactions to occur (electrophiles reacting at C(3) versus reacting at nitrogen). Structural encodings must be able to capture this subtle change in reactivity, which might not be possible through simple fingerprints, and more complex DFT-based featurisation methods might be better suited to capturing these differences. Increasing the complexity of models can capture more of this reactivity information, which can lead to better performance which we will discuss now.
Both Afonina et al. and Chen and Li also treat condition prediction as a classification problem, though with some similarities to ‘label ranking’ (see Fig. 5).2,32 Afonina et al. use a ‘Likelihood ranking model’ which enumerates all conditions including acid, base, temperature, pressure and catalyst, encoding reactions using ISIDA CGR fragment descriptors,51 before using a neural network to output the most likely conditions for that reaction. This approach showed strong performance, improving on the work of Gao et al. for hydrogenation reactions (73% top-1 accuracy on a retrospective test set), although performance on the prospective test set showed that a popularity baseline was comparable in performance, achieving correct top-1 predictions 68% of the time. This method requires the enumeration of all conditions, and for large datasets covering many reaction types, enumerating all combinations of conditions is computationally infeasible.2 Chen and Li employed a neural network that shares many characteristics of the ‘likelihood ranking model’. Using a two-stage condition generation and ranking approach, they leveraged a ranking model alongside a generation model to generate plausible conditions prior to ranking, avoiding the need for the enumeration of all possible conditions.32 Again, this yielded good results, finding an exact match to the true condition with the top-1 suggestion 53% of the time. Interestingly, in a short case study, the authors found that the model suggested conditions which were used in the publication but were not recorded in the reaction database. This reiterates the importance of not only recording all reactions performed in reaction databases, but also that care should be taken when evaluating models purely based on top-k accuracy.
The key to all feed forward neural network approaches is the choice of reaction descriptors. Whilst the fingerprints employed by Chen and Li and Gao et al. are computationally inexpensive to calculate, they may not capture the more complex electronic and steric effects that can explain reactivity patterns. With the development of methods to estimate complex descriptors and features in computationally inexpensive ways,96,97 future models may be able to take advantage of this. Alternatively, researchers can look towards more complex architectures, such as graph neural networks and transformers to generate more information-rich encodings for the reactions in order to improve performance, which we will see in the next section.
Applied to condition recommendation, the most notable examples of GNN application are Maser et al., Kwon et al. and Wang et al.31,77,85 Maser et al. used ‘attended relational’ graph convolutional networks (AR-GCNs) to predict conditions for a collection of different coupling reactions, including Suzuki, Negishi and C–N couplings. The models showed good predictive performance over a popularity baseline (31–42% improvement for top-1 predictions). In addition, this model has an accompanying analytical framework, providing interpretability analysis on the learned feature weights to understand the reasoning behind different predictions. However, the performance of this model was marginally worse (2% for top-1 accuracy) compared to tree-based methods also used in the publication on the smaller Pauson-Khand dataset. The authors suggested that the smaller dataset size makes the GCN more prone to overfitting, which made tree-based modelling more suitable here.31
Extending this approach, Kwon et al. used GNNs to encode both reactants and products, combining this with a variational auto-encoder (VAE)108 to predict conditions.85 In comparison to both Gao et al. and Maser et al. this approach resulted in a higher accuracy when allowing multiple predictions from the VAE. However, this approach is more time-consuming versus the others, and no comparison was performed where the models from ref. 30 and 31 could predict multiple conditions.
Finally, Wang et al. use a combination of templates and condition-clustering alongside a D-MPNN acting on CGRs. This work exemplifies one of the first uses of condition clustering to improve performance, by increasing the diversity of predictions, and acknowledging the many-to-many nature of condition prediction.77 By incorporating this clustering, the top-1 accuracy of their method jumps from 45% to 66%, a significant increase. Zhang et al. take a slightly different approach; they encode their reactant and product as graphs before passing through their GNN pretrained on atom level and bond level tasks. These molecular level descriptors from the GNN are passed to a second NN along with a one-hot encoded reaction template, and this is used to predict the most likely solvents and catalysts for a reaction. However, in the prediction of the solvent and catalyst, the identity of the other reaction component is not considered.76 Nonetheless, these models could predict the correct catalyst and solvent 59% and 42% of the time respectively. Ignoring the inter-dependence of the conditions is likely to lead to some drop in accuracy, because the identity of one reagent, along with the reaction will determine the identity of the other reagents. Modelling this dependence is a key part of reaction condition prediction.
GNNs clearly show promising performance in predicting appropriate conditions, indicating the representation that these models learn is comparable to (and sometimes better than), more simple fingerprint descriptors. Moving beyond graphs, reactions can also be described by their SMILES string, to which natural language processing (NLP) methods can be applied, and the final architecture we will look at are the transformer-based models.
Another similar approach was taken by Andronov et al., who repurposed the MolecularTransformer described by Schwaller et al. for reagent prediction.5,46 The final example to leverage the transformer architecture is MM-RCR by Zhang et al.115 This uses a combination of the previous architectures, using a multimodal reaction input consisting of SMILES, Graphs and Text, on top of a large language model (LLM) to predict conditions. This achieves state-of-the-art performance on the same dataset curated by Wang et al. Their ablation study demonstrates the benefits of a multi-modal representation, showing significant (up to 17%) improvement over the same model using a single data modality.
To conclude this section, whilst ‘global’ condition prediction models are highly desirable (and many such models perform to a strong level), the level of detail that can be afforded without making the dataset too sparse means that finer grained details of a reaction such as timing, pH and others are often ignored, despite their importance to synthesis planning. Furthermore, the lack of consistent benchmarking datasets until the work of Wang et al. and Wigh et al. has meant there has not yet been a wide-scale comparison of the existing methods, including performance by reaction class or failure modes, which represents a potential area for future work. When tested on focused reaction datasets, these ‘global’ models can also struggle, as the exposure to many different types of reaction can add ‘noise’ to predictions, as found by Afonina et al.2 On the contrary, smaller-scale models can be tailored to specific reactions, allowing the aforementioned parameters to be predicted, and enabling the incorporation of domain-specific descriptors which enhance performance,38 as we will discuss now.
In a related approach, Eshel et al. use classifiers to assign ranks in order to select conditions for aldehyde deuteration and C–H activation reactions. They incorporate expert knowledge about the reactivity of conditions relative to the substrates they are applied to inform the choice of ordinal ranking algorithms, therefore working in a similar manner to Shim et al. by ranking conditions against one another.117 Both of these recent works suggest that ranking methods could be a strong approach for condition recommendation, particularly in the small-data regime.
All of the above approaches carry out their experiments in an ‘Iterative Learning’120 workflow, designing and creating datasets specifically for building models of reactivity.
The alternative approach is to use existing datasets. As previously discussed, a yield prediction model can be trained and applied to small, focused datasets, with conditions predicted to lead to the optimal outcomes being selected to test (see Fig. 5b). As representative examples, Schwaller et al. created Yield-BERT, a transformer-based model to predict reaction yields, then trained this on a small fraction of a dataset and prospectively screened the rest of the dataset to identify promising conditions.17 Atz et al. used a graph-transformer neural network in a similar manner to screen conditions for a Suzuki-type cross-coupling reaction.121 Both examples exemplify how yield prediction can be incorporated into condition recommendation, provided conditions can be enumerated.
Of course, scaling these approaches to a ‘global’ level is challenging, requiring predictions for all possible combinations of conditions which would be computationally intensive. It is possible that these yield-prediction models could be used as a final computational screen of ‘feasible’ conditions suggested by a different model, analogous to Chen and Li.32 However, for ‘local’ datasets, the yield-prediction route offers a viable method of evaluating and suggesting reaction conditions.
Whilst challenges like data sparsity and evaluation remain, we have seen how progress in reaction condition prediction can come from advances in model architecture. Although progress can also come from rethinking fundamental aspects of modelling, such as data representation. Having discussed the modelling earlier, we have seen how data representation can influence predictive performance. To illustrate this, we apply models to CGR-based reaction representations and demonstrate improved performance over traditional reaction representations.
To demonstrate the impact of reaction representation, we select a different method to encode reactions: CGR fragments. We wanted to see if this encoding could produce models of improved predictive power and crucially, outperform a challenging literature baseline.1 Afonina et al. introduced a method combining a multitask neural network and likelihood ranking based on CGR fragments which can produce lists of viable conditions for hydrogenation reactions,2 and we adopt a similar strategy here.
The USPTO dataset was downloaded directly from ref. 1, where the same curation as that publication was applied, splitting solvents into 6 ‘coarse’ classes, 13 ‘fine’ classes and the bases into 7 classes.1 We choose not to predict the identity of the Pd source, ligand or temperature, to enable comparison with ref. 1, who only predict solvent and base. Full details of the identities of the clusters can be found in the Supplementary Material. The reaction itself is split into reactants and products, leaving 2 reactants and a single product. Following this procedure, we perform atom mapping using Chython,79 and an additional duplicate check, removing all reactions with the same mapped reaction equation, ‘coarse’ solvent class, ‘fine’ solvent class and base class. This leaves us with fewer reactions (5,219) than the original publication (5,434).
We then split the dataset using 5 × 5 cross validation (CV), using stratified sampling of the ‘fine’ solvent class. Whilst this differs from Beker et al. who use random 5 × 5 CV, stratified sampling ensures that the model's evaluation is more accurate, given the unbalanced nature of both the base and solvent targets.1
To generate the model input, ISIDA fragment descriptors51 were generated for each reaction. We used the same procedure as set out in ref. 2, generating atom and bond-centred fragments of length two to four atoms using ISIDA Fragmentor 2017, wrapped by CIMTools.122 We used the same additional settings as that publication, namely Formal Charge encoding and all fragments formation, creating fragments with both ‘dynamic’ and ‘regular’ bonds. Fragments occurring fewer than five times were removed, and the resulting vectors were scaled to zero mean and unit standard deviation. Finally, incremental PCA was performed to get a final CGR fragment vector of length 1500 for each reaction. For a schematic, see Fig. S1.
We created four machine learning models based on vectors formed from the PCA projection of CGR fragment count vectors: a Random Forest (RF); a Gradient Boosting Machine (GBM); a similarity search (kNN) and a multitask neural network (MTNN), similar in architecture to the best model from the work of Beker et al.1 We used ChemProp,83,98,101 based on the D-MPNN architecture, as an additional test of CGRs as a reaction representation for condition prediction. For the RF, GBM, MTNN and D-MPNN models, the hyperparameters were tuned using Optuna,123 once per iteration of the 5 × 5 CV. These hyperparameters were used to test the models across the rest of the folds within that repetition.
As set out in ref. 2, we transformed the independent predictions for solvents and bases into a ranked list of combinations of these reagents using a likelihood ranking approach. To do this we first enumerated all combinations of the solvents and bases. We then determine the probability of each combination by multiplying the probabilities for the solvent and base within each combination, and finally, ranking the combinations in order of probability. The only difference to Afonina et al. is that we do not take the mean of the negative log-likelihoods (and minimising), but rather maximise the probability directly. See Fig. S3 for more information.
Resulting statistical analysis, using the workflow suggested by Ash et al.124 demonstrate that these results are statistically significant (see Fig. S9 and S10). The other CGR-based models were also tested, but for clarity of the plots, only the best model: the MTNN, was selected to be shown. Comparisons between the CGR-based methods can be found in the SI. Therefore, to answer the question of the case study: ‘Can machine learning methods improve significantly upon literature baselines on this dataset?’, these results would suggest that an alternative representation, the Condensed Graph of Reaction, can outperform this baseline, on the independent predictions of solvents and bases.
However, synthetic chemists require combined predictions of all components in a chemical reaction, since solvents and bases may be incompatible, or not lead to a reaction, despite the individual components being sufficient in other cases. Therefore, we combined these independent predictions using the likelihood ranking approach, to give an indication of the performance of such a model when predicting combinations of reagents, the results of which can be found in Fig. 6. We can see that the gap between the CGR-based model and the popularity benchmark is now higher, and similarly with the Morgan fingerprint model. Although as we understand Beker et al. didn't include testing (or optimisation) for their Morgan fingerprint models on a combined reagent prediction task. Nonetheless, these results demonstrate that this CGR-based model can improve on the strong literature popularity benchmark. This is potentially because CGRs explicitly encode more information than Morgan fingerprints, where the transformation is not directly represented. Since the CGRs require atom mapping, the reaction centre is explicitly encoded, rather than this being implicitly encoded in other fingerprints based on the individual reactants.
Additionally, we wished to illustrate the benefit of expert-assigned reagent classification to enable fairer model evaluation. First, we generated ‘exact’ predictions for both the base and the solvent, then the same clustering was applied post-prediction to highlight how clustering causes an increase in model accuracy, suggesting that when models are making ‘incorrect’ predictions, these predictions are still chemically relevant. The results of this can be seen in Fig. 7. It can also be seen that clustering in pre-processing can lead to improved performance, versus predicting the exact reagent and clustering post-prediction.
Our case study of Suzuki–Miyaura reactions demonstrates that existing machine learning methods can overcome popularity metrics, by using an appropriate representation. By using a CGR-based representation, we developed models that outperformed the existing state-of-the-art on the USPTO Suzuki dataset. Despite this, further improvement of the models is possible. Alternative classification metrics (see Fig. S7) show that despite the higher accuracies, these models still require improvements to truly ‘learn’ the underlying chemistry being modelled. This underscores the need for further improvements, either through the use of more complex architectures (though this doesn't always help, see ref. 1) or through other strategies. For example, refining solvent and base clustering to address class imbalance and data sparsity (provided that clusters are both chemically meaningful and useful to end users). Furthermore, this modelling ignores the presence of many other variables in Suzuki reactions, like temperature, Pd-source and ligand. With an increase in the number of variables, the condition space expands: exacerbating the data sparsity problem and increasing the importance of methods to mitigate it.
Nonetheless, this example highlights the critical role of data representation in reaction condition modelling, aligning with our broader argument that thoughtful representation design is key to unlocking improvements in model performance. However, other challenges discussed in this perspective, such as data sparsity and selection bias, remain unresolved in this case study. Bridging these gaps will require continued exploration of strategies such as data augmentation, chemically informed clustering, or more advanced machine learning architectures.
Despite these limitations, the ability of models to outperform popularity benchmarks provides a step forward in bridging the gap between computational predictions and practical applications, even with existing literature data.
The challenges with reaction data are well-documented,27–29 but we expect that with the increased awareness of synthetic chemists to the importance of holistic reporting of experiments that data quality in the future will continue to improve, resulting in improved models. Initiatives like ORD promote standardised recording of reaction data, which will act to counteract the existing biases. However, bridging the gap between existing datasets, and the ‘ideal’ datasets of the future will require continued innovation, such as incorporating procedural data70 data augmentation71 and innovative sampling techniques32 to maximise existing data and create generalisable, robust models.
In reaction condition prediction, the many-to-many relationship between a reaction equation and feasible conditions requires that models should predict multiple conditions for a single reaction equation, and the format of this output is dependent on the task at hand. Although the prediction of ‘exact’ reagents has its place in reaction optimisation, we believe that existing data requires condition predictions to adopt a more general condition encoding. As the scope of reactions considered increases – moving towards a ‘global’ model – and the data becomes sparser, we suggest that model outputs should generalise, for example through the categorisation of similar reagents in order to reduce the number of classes that a model is required to predict from. ‘Local’ models remain valuable in scenarios where data sparsity is less of a concern, such as carefully curated datasets with high condition coverage for each reaction equation. In this case, higher-fidelity condition predictions are possible, and the requirement for output ‘generalisation’ diminishes. With improving large-scale data quality, increasing fidelity of predictions from ‘global’ models may be possible in the future. In the meantime, the selection of an appropriate ‘general’ condition encoding remains an area for future work, and such a representation should incorporate chemical knowledge whilst compressing condition space to mitigate existing data concerns.
We provide an overview of existing models, through the lens of ‘global’ and ‘local’ models, following from the classifications of Raghavan et al.45 These different approaches have leveraged different representations, like strings (in the case of transformers), graphs and reaction fingerprints. Our case study highlights the critical role of reaction representation in reaction condition modelling, emphasising that thoughtful representation design is key to unlocking improvements in model performance. In particular, using reaction representations that explicitly encode the reaction transformation occurring, like the CGR, can improve upon the performance of other representations (like Morgan fingerprints).
By leveraging higher-quality data, the condition prediction models of the future will improve upon this current generation of models. However, during this transition period, we believe that developing novel encodings for both input and output of these models can enhance the practical applicability of these models to synthetic chemists.
This journal is © The Royal Society of Chemistry 2025 |