Bayesian optimisation for additive screening and yield improvements – beyond one-hot encoding †

Reaction additives are critical in dictating the outcomes of chemical processes making their e ﬀ ective screening vital for research. Conventional high-throughput experimentation tools can screen multiple reaction components rapidly. However, they are prohibitively expensive, which puts them out of reach for many research groups. This work introduces a cost-e ﬀ ective alternative using Bayesian optimisation. We consider a unique reaction screening scenario evaluating a set of 720 additives across four di ﬀ erent reactions, aiming to maximise UV210 product area absorption. The complexity of this setup challenges conventional methods for depicting reactions, such as one-hot encoding, rendering them inadequate. This constraint forces us to move towards more suitable reaction representations. We leverage a variety of molecular and reaction descriptors, initialisation strategies and Bayesian optimisation surrogate models and demonstrate convincing improvements over random search-inspired baselines. Importantly, our approach is generalisable and not limited to chemical additives, but can be applied to achieve yield improvements in diverse cross-couplings or other reactions, potentially unlocking access to new chemical spaces that are of interest to the chemical and pharmaceutical industries. The code is available at: https://github.com/schwallergroup/chaos .


Introduction
8][29][30][31][32][33][34][35][36][37][38] As demonstrated in the space of chemical reactions, BO is particularly well suited for trading off exploration and exploitation in the low data regime.Surprisingly, most BO studies report one-hot encoding (OHE), that contains limited chemical information, to perform remarkably well. 31,35This recurring observation raises an important question: why does OHE, with its inherent simplicity manage to deliver competitive results?For instance, Shields et al. 32 compared OHE to more elaborate reaction representations such as quantum mechanical (QM) descriptors.The study found no signicant difference in optimisation performance stating these two representations "largely indistinguishable".This conclusion emerges from evaluating BO across several reaction datasets including the optimisation of Buchwald Hartwig reactions.Consider the case of the Buchwald Hartwig dataset: ve distinct reactions with 790 data points each, covering four variable components to optimise overbase, ligand, aryl halide and additive.Our study, while bearing similarities in examining four different Ni-catalysed photoredox decarboxylative arylations ‡ reactions with 720 data points per reaction, 39 has a distinguishing feature: all other reaction components remain xed, except for the additive being screened.Consequently, the resulting OHE vectors create an orthogonal space where the number of dimensions equals the number of data points, making it difficult for any machine learning method to grasp valuable patterns.This inherent constraint forces us to think beyond OHE and leverage alternative representations to combine with BO and pinpoint the optimal additives for given chemical reactions.Accordingly, we have examined representations that not only address these limitations but also ensure computational efficiency on par with OHE.

PAPER
Additives are critical for altering the reactivity and outcome of chemical reactions. 40,41According to the IUPAC Gold Book denition, additives are "substances that can be added to a sample for any of a variety of purposes". 424][45] Identifying optimal additives can signicantly enhance reaction efficiency, selectivity and yield, leading to cost-effective and sustainable chemical processes. 46,47In this study, we introduce a BO-based approach for efficient exploration of the additive § search space.Subsequently, we explore a range of representation methods to determine the most appropriate ones for uncovering additive-induced yield improvements.This approach not only streamlines experimental design and optimisation but also holds immense promise for various applications within the eld of chemistry.While Prieto Kullmer et al. 39 screened these compounds using high-throughput experimentation (HTE), not all laboratories can access robotic platforms.Synthetic chemists, however, could highly benet from using BO to discover the optimal additives, allowing them to improve a reaction without the need for exhaustive (and expensive) testing of all possible combinations (Fig. 1).Compared to existing applications of BO to chemical reactions (e.g., Buchwald-Hartwig reactions 48 ), the additive dataset is substantially more challenging.Firstly, OHE is ill-suited for this task as it results in high-dimensional vectors, with only one active dimension per additive.The resulting extreme sparsity and lack of shared information make it difficult to address the complexities of the dataset.This kind of representation limits the use of machine learning models, which can struggle to extract valuable insights.While seemingly intuitive, we empirically conrm these shortcomings, with details in the results section.As we demonstrate, applying BO in this setting fails to improve over random search.Secondly, the additives in this paper exhibit greater structural diversity than the components screened in other HTE studies.This distinctiveness signicantly increases the computational demands for generating humanlabelled atomic or local QM descriptors.We overcome these limitations by using computationally efficient reaction and molecular representations with a maximal diversity initialisation scheme and exible surrogate models.Finally, the inherent complexity of the dataset coupled with the limited predictive signal{ between the representations and the output (yield) poses a signicant challenge for optimisation.Existing research, however, suggests that the application of BO can still help reach promising results even in those scenarios. 49Despite these challenges, we demonstrate that augmenting BO with adequate reaction representations, initialisation schemes and appropriate surrogate models results in an efficient search towards the best-performing additives in less than 100 evaluations while using as little as ten initialisation reactions.
The structure of this paper is as follows: Section 2 details the data and the representations, Section 3 covers methodology, followed by a presentation and discussion of results in Section 4. We conclude and offer future directions in Sections 5 and 6.

Data
Fig. 1 Visualisation of Bayesian optimisation pipeline for additive screening.Starting from the HTE dataset, 39 we extract either additive smiles or reaction components to generate reaction smiles.We propagate these smiles through a molecular (i.e.) fingerprints, fragprints, xtb, cddd, mqn, chemberta or reaction encoder (rxnfp, drfp) into features.The built features allow us to select initial points leveraging methods like clustering to set up the Gaussian process surrogate model.The BO loop then runs for a predetermined number of iterations with the objective of reaching the global optimum that corresponds to the highest UV210 product area absorption.§ The term "additive" refers to a single selection from a set of 720 examined additives.
{ The term "limited signal" refers to the low validation scores indicating poor alignment between the data representations and the desired output in the low data regime.This complexity results in a challenging modelling scenario.
to improve the reactivity of challenging Ni-catalysed photoredox decarboxylative arylation reactions in a high-throughput experimentation setup.They examined different crosscoupling reactions on separate reaction plates, each containing the same set of diverse additives.The aim was to determine additives that can further enhance the reaction efficiency of already highly reactive substrates.In total, the dataset consists of 720 additives used in four distinct reactions.We provide a brief description of each reaction below, with detailed explanations available in the ESI: † Reaction plate 1: investigates the impact of additives on the decarboxylative C-C coupling of Informer X2 (a highly reactive aryl halide substrate) 50 and cyclohexanoic acid.
Reaction plate 2: explores the inuence of additives on the coupling between 3-bromo-5-phenylpyridine and cyclohexanoic acid.
Reaction plate 4: assesses the effect of additives on the coupling between Informer X2 and hexanoic acid.

Data representation
2][53] Different representations impact the efficiency and accuracy of optimisation by capturing unique chemical aspects.In each of the four reactions within the screening dataset, the additive stands out as the sole variable component, with all other components kept xed.
This property offers two primary ways to encode these reactions: one by isolating the additive and the other by considering the holistic reaction.The former pertains to molecular representations of the additive, while the latter could involve reaction ngerprints or global QM descriptors.

Molecular descriptors
2.2.1 Traditional cheminformatics descriptors.Describing molecules through molecular ngerprints is a common approach in computational chemistry. 54,55ogether with mqn descriptors, [56][57][58] they offer representations summarising molecular structure.In our experiments, we leveraged both mqn descriptors and Morgan ngerprints (referred to as ngerprints henceforth).Additionally, we explored combining ngerprints with encoded fragments of a molecule (computed using RDKit 59 ), essentially forming a more comprehensive representation (aptly coined fragprints [60][61][62][63] ).The enriched fragprints provide insights into the overall structure and the specic constituents of the molecule.Though computationally efficient compared to descriptors involving intensive human labour or simulations, traditional cheminformatics descriptors might not capture the complexity of chemical interactions.
2.2.2 Local QM descriptors.This need for higher delity representations brings us to local quantum mechanical (QM) descriptors.Chemically meaningful representations offer advantages, especially in the low-data regime. 64Previous studies employed mixtures of molecular and atomic QM descriptors to enrich the feature space. 32,48However, local atomic QM descriptors are computationally expensive and require deep domain knowledge.Additionally, they are typically limited to molecules with similar functional groups and they may not be suitable for the broad diversity of additives in the screening dataset. 39Given these limitations, we explored an alternative approach using xtb features. 65,66Xtb, short for "extended tightbinding" offers a balanced trade-off between computational cost and chemical accuracy.It captures information about molecular orbitals, charges and other quantum mechanical properties, especially valuable when the electronic structure plays a central role in dening the outcome of the reaction.However, their computational expense and domain-specic requirements make them less accessible for broader applications.
2.2.3 Data-driven descriptors.Though rich in chemical signicance, the QM descriptors require careful selection and rigorous preprocessing to ensure the captured information is relevant and accurate.Traditional cheminformatics descriptors resolve these issues but at the price of severe oversimplication.On the other hand, data-driven methods stand out as compelling alternatives offering a versatile and scalable way to represent chemical data, balancing computational efficiency and the capture of complex chemical interactions.
We focus on data-driven methods that utilise simplied molecular-input line-entry system (SMILES) representations. 67miles codes are textual representations of molecules that encode the molecular graph structure in a simple string format.Their textual nature allows employing advanced machine learning models originally designed for natural language processing tasks. 680][71][72][73] In this study, we specically employ two datadriven molecular descriptors.First, CDDD (Continuous and Data-Driven molecular Descriptors), which translates between semantically equivalent but syntactically different molecular structures like smiles and InChI representations. 70Second, ChemBERTa, a BERT-based model pre-trained on a large corpus of chemical smiles strings using an optimised pretraining procedure. 71,74,75

Reaction descriptors
Translating from molecular descriptors to reactions poses an interesting challenge.For instance, Schneider et al. 76 computes the reaction ngerprint by subtracting the molecular ngerprints 54,55 of the reactants from those of the products.Another approach is to concatenate different reaction components and create an information-rich nal vector.Although this method offers considerable exibility, it comes with the 'curse of dimensionality': 77 concatenated vectors can quickly increase in size based on the number of reaction components.This property can limit their general applicability, as the variable number of reaction components creates variable-sized vectors, which are inconvenient for machine learning models.A straightforward yet effective alternative is one-hot encoding (OHE).This technique maps each component of the reaction to a unique binary vector, where a single active dimension indicates the presence of that specic component.To represent the entire reaction, we can concatenate these one-hot encoded vectors resulting in a xed-size binary vector, serving as a surprisingly effective representation for Bayesian optimisation of chemical reactions, although, as already mentioned, less suitable for our use case.
Recent approaches have looked to map reactions directly to a ngerprint, independent of the number of reaction components and the underlying representation.Schwaller et al. 69 derived data-driven reaction ngerprints (RXNFP) directly from the reaction smiles by employing transformer models 78 trained for reaction type classication tasks.Reaction smiles is an extension of the regular smiles notation that represents not just a single molecule, but entire chemical reactions.It includes the smiles strings of reactants and reagents on one side (separated by dots) and the product on the other side, separated by a special character ">>".The benet of this approach is its ability to map reactions to highly versatile continuous representations regardless of the number of reaction components.However, using rxnfp in this project's scope might not be adequate since additives play a relatively minor role in reaction type classication.On the other hand, Probst et al. 79 introduced the differential reaction ngerprint (DRFP).This representation is based on the symmetric differencek of two sets generated from the molecules listed le (reactants and reagents) and right (products) from the reaction arrow using a method that captures the environments around atoms in a radial manner, termed 'circular molecular n-grams'.Their design makes them extremely exible, effectively encoding the interplay of diverse reaction elements and maintaining a robust performance in scenarios where both a single or multiple reaction components may vary.

Methods
In this section, we detail our methodological approach to using Bayesian optimisation for the chemical dataset in question.We rst describe the specic Bayesian optimisation framework employed, its necessary elements such as the surrogate model, acquisition functions and strategies employed to initialise the BO search.
Several components play crucial roles in determining the outcome of BO-based search strategy.Firstly, the representation of chemical reactions dictates how the model interprets the data.Secondly, the kernel choice in the surrogate model shapes the learned relationships between data points.Thirdly, the initialisation strategy inuences the starting point and path of the optimisation process.Lastly, the acquisition function guides the decision on where to sample next.
We applied BO on a dataset of 720 screened additives across four unique reactions aiming to maximise the UV210 product area absorption.To evaluate the BO approach, we initiate the search with a set of 10 starting points.The optimisation process runs for up to 100 iterations during which we monitor the performance against the remaining dataset, comprising over 600 data points.We measure the success of the optimisation by assessing how many of the top performing reactions we identify during these iterations.For this reason, we dene a top-n neighbourhood metric as a set of n reactions with the highest yield for each reaction plate.The motivation behind the top-n neighbourhood search is to provide a diverse set of highperforming additives, giving researchers more exibility in their choice based on factors such as availability, complexity, and price.This approach allows for a more exible and pragmatic selection of additives and reects the practical constraints and requirements of real-world applications.To nd the optimal conguration, we carry out a grid search over combinations of parameters, namely data representation, kernel, initialisation strategy and acquisition function, repeating the runs across 20 different seed values to ensure robust ndings.The limitation with one-hot encoding OHE on this dataset directed us towards the exploration of other molecular and reaction representations, both computationally and chemically reasonable, while steering away from the intensive demands of quantum molecular descriptors.For data representation, we extensively evaluated ngerprints, fragprints, mqn and xtb features, data-driven cddd and chemberta descriptors and holistic reaction representations such as rxnfp, drfp and OHE.We used a Gaussian process surrogate model and assessed the inuence of different kernels (Matern, Tanimoto, Linear).To select the 10 starting points we used an initialisation strategy (random, clustering and maximising the minimum distance between the selected points) and for guiding the search towards promising regions we compared acquisition functions-upper condence bound (UCB) and expected improvement (EI).The core objective of this study was to identify whether BO can emulate or even surpass the outcomes of HTE and if so, under which conguration.We used the rst of the four available reactions to evaluate the combinations of parameters over 20 different seed-runs and nally carried out the optimisation loop for the remaining reactions using the best-performing setup.
Below, we delve deeper into each of the necessary BO elements (Table 1), explaining our choices and their implications.
Table 1 Overview of the variables tested in Bayesian optimisation including kernel types, initialisation methods, acquisition functions and reaction representations.We ran each combination through 20 different seed-runs to ensure statistical significance and replicability

Bayesian optimisation
We can dene many problems in scientic discovery as a global optimisation task of the form x * ¼ arg max x˛X f ðxÞ; (1) where f : X /ℝ is a function over a design space X .As previously discussed, the molecular and reaction design space can be both discrete and continuous, and can consist of structured data representations such as graphs and strings.Eqn ( 1) is a blackbox optimisation problem as we do not know the analytic form of f or its gradients and may only query f pointwise.Furthermore, evaluations of f require laboratory experiments and are high-cost and time-consuming.Lastly, our observations of f are subject to a (potentially heteroscedastic 80,81 ) noise process.BO 82 is an adaptive strategy that has recently emerged as a powerful solution method for black-box optimisation problems with proven success in applications including machine learning hyperparameter optimisation, 83,84 chemical reaction optimisation, 32 protein design, 85 and as a sub-component in AlphaGo 86 and Amazon Alexa. 87The ESI 1 provides pseudo-code for bo and more details on the algorithm.† The readers who wish to delve further into the mechanics and philosophy of Bayesian optimisation can refer to a vast collection of standout resources.For a more application-focused introduction, the documentation for Meta's Adaptive Experimentation (Ax) Platform offers a comprehensive yet accessible overview. 88Complementary, those seeking a rigorous understanding with mathematical foundations can refer to. 89

Surrogate model
The backbone of Bayesian optimisation is a surrogate model approximating the complex relationships and dependencies within the data.A surrogate model is a probabilistic method that acts as a replacement for the true objective function.For its role in BO, the surrogate must combine two primary components: a prediction model and uncertainty estimates.The prediction model produces the mean function value (subject to measurement noise) across the input space.The uncertainty estimates quantify the model's condence in its predictions.This denition allows a variety of models to act as surrogate components in the Bayesian optimisation setup.Any model that can output predictions over the input space and condence over predictions is a potential choice for a surrogate model.A favoured selection is oen a Gaussian process because of its exibility, simplicity, and ability to capture complex functions with relatively few hyperparameters to tune (admitting second-order optimisers such as L-BFGS-B, 90 for the marginal likelihood loss function).
Gaussian processes easily adapt to different problem domains by changing the kernel function, which denes the covariance structure between input points.In Gaussian processes, kernel functions measure the similarity between data points in the input space.This similarity is then used to predict the function value for a new input by considering its proximity to previously evaluated data points.Different kernel functions can capture different types of relationships between data, 91,92 and their choice plays a signicant role in determining the properties of the surrogate model, such as smoothness, periodicity, and stationarity.Selecting kernel functions that are appropriate for the chosen reaction representations is essential in the context of reaction optimisation.
Among the kernels developed for chemical reactions, we nd the Tanimoto kernel 62,93,94 effective for binary representations due to its ability to quantify structural overlap.The Linear kernel is oen sufficient if the descriptors are informative enough or the problem has a Linear nature.Additionally, we consider the Matern kernel for its exibility in capturing varying degrees of smoothness in the data, making it a suitable choice for more complex reaction spaces.

Acquisition function
A exible probabilistic surrogate model captures prior beliefs about the black-box objective f(x) guiding the acquisition function aðx; DÞ towards promising regions of the search space.The acquisition function balances between the exploration of uncertain regions and the exploitation of high-yield areas.More specically, exploration refers to sampling points in the design space where the model's prediction uncertainty is high, while exploitation involves sampling points where the model predicts high function values.This trade-off is central to the success of Bayesian optimisation, as it ensures that the method does not prematurely converge to suboptimal solutions.6][97][98] In the context of chemical reaction optimisation, computational overhead from BO is oen negligible compared to the time and resource drain of actual chemical experiments.

Design spaces: reaction versus BO conguration
In reaction optimisation, two design spaces serve distinct yet interlinked roles.The "reaction design space" covers the possible combinations of reaction components and conditions, and the "BO conguration design space" entails the model parameters and optimisation frameworks facilitating exploration of the reaction design space.
Reaction design space contains potential combinations of additives, reactants, catalysts, solvents, and reaction conditions such as temperature, pressure and concentration.In this study, the focus narrows down to a set of possible additives.
On the other hand, Bayesian optimisation conguration design space includes model parameters and optimisation frameworks that enable effective exploration of the reaction design space.Here we explore parameters such as the choice of reaction representations, kernel functions and data initialisation methods.Understanding the interplay between these factors is key to achieving efficient search and optimisation.For example, kernel choice may depend on the reaction representation which, as a consequence, dictates the optimisation success.

Model initialisation
Initialising the BO algorithm with a diverse set of sample data is one of the determining factors for effective reaction optimisation. 34Using Gaussian processes as the surrogate models allows us to operate effectively within the low data regime due to their well-calibrated uncertainty estimates.For a detailed description of the Gaussian process in the context of structured inputs, ref. 61 and 62.In the domain of chemistry, operating within a low data regime is oen the norm rather than the exception.Furthermore, chemists might face a dual incentive when empowered with BO solutions: starting the optimisation process early to save time and resources, while also needing a diverse set of data to initialise the optimisation models effectively.
0][101] This selection leads to increased precision in uncertainty measurements, and subsequently, more accurate model predictions.To achieve this, we employ maximum diversity initialisation schemes that enable us to explore the structured search space of reactions and select a representative sample of points to accelerate the optimisation process.These schemes include clustering, maximal coverage, and random sampling baseline.
3.5.1 Clustering-based initialisation.We utilise the kmeans clustering algorithm to group the available data into several clusters.This algorithm partitions the data into k clusters, each dened by the centroid located at the mean of the points in that cluster.We select the data points closest to the centroids as the initial points for the Gaussian process surrogate of the BO search.This approach ensures a set of diverse initial points that qualitatively describe the entire search space taking into account the structure of the data.To unify the clustering method across different representations (both continuous and binary), we rst perform a principal component analysis (PCA) narrowing down the representations to 10 most signicant principal components.Although we considered other methods including k-medoids** with different distance metrics, k-means demonstrated better convergence in our experiments (Fig. 2).
3.5.2Maximal coverage initialisation.The maximal coverage algorithm, also known as the farthest point rst algorithm or maxmin sampling, is another method useful for surrogate model initialisation.This method iteratively adds subsequent data points by selecting those that maximise the minimum distance to already selected data points, thereby increasing the coverage of the search space.The process begins with a randomly selected point and continues until we reach the desired number of initial points.Depending on the nature of data representation, we can employ custom distance metrics such as Jaccard or Euclidean to effectively cover the unique chemical space.

Random sampling initialisation.
Finally, we consider random sampling as a simple yet effective baseline initialisation process.While it does not actively seek diversity or exploit any structure of the dataset, it serves as a reference point initialisation scheme, particularly convenient in high-dimensional spaces.Mainly due to its simplicity, a primary drawback of this method is its lack of strategy or guidance, which may lead to poor coverage of the search space compared to the previously mentioned methods.Random initialisation is also more prone to redundancies or the possibility of selecting similar points, thereby reducing the diversity of the initial points and potentially resulting in a slower convergence rate.

Results & discussion
This segment provides a comprehensive assessment of the BO approach when applied to the additive screening dataset.With the established methodological procedures outlined in the Methods section, we now turn to the results obtained from varied congurations and parameters including reaction representations, surrogate model kernels, data initialisation strategies and acquisition functions.
To reiterate, we focus on identifying the top-performing reactions within the evaluated BO iterations.This is measured using the top-n neighbourhood metric, aiming for a selective and diverse array of high-yield reactions.As a compromise between top one and top 10 discovered additives and aiming for a clear visualisation we show the percentage of top ve performing additives discovered during the optimisation process across different representations in the Fig. 3.The same plot shows a signicant importance of the reaction representation choice for the success of the BO strategy.Given the unique Fig. 2 t-SNE visualisation 102 of the fragprints representation of Reaction 1 in the latent space.The colours describe the clusters, highlighting the central additives with their corresponding molecular structures.We discover the phthalimide additive, identified as the best overall additive in the original study, within the initial clustering.This compelling side effect of the clustering demonstrates its ability to effectively describe the latent space and identify appropriate initial additives.
** The k-medoids method is similar to k-means but uses the most centrally located data point in a cluster (medoid) to represent that cluster.This method can employ various distance metrics, allowing it to be more exible based on the data representation type.nature of additive screening, we can encode reactions using either reaction or molecular descriptors.As the additive is the only variable component in additive screening, it uniquely describes each data point per reaction in the dataset.However, reaction representations, like drfp, inherently capture more comprehensive information by considering the interplay of all reaction components.This representation emerged as particularly effective, especially when combined with Matern kernel, contrasting our expectations about the binary-tailored Tanimoto kernel.Moreover, Matern kernel dominates other alternatives over majority of representations highlighting its adaptability and robustness.
Focusing on the internal structure of additives only, both ngerprints and fragprints emerge as strong contenders.The slight advantage of fragprints suggests the potential relevance of molecular fragments in the context of evaluated additives.Among the continuous representations, data-driven feature-rich representations such as cddd underperform in BO tasks despite having higher validation scores in model t related metrics (see Fig. 1 in ESI).† This outcome may be due to the overcomplexity of this representation (continuous 512-dimensional vectors) accompanied by the constraints of a low data regime.While cddd can capture intricate chemical features and relationships, it also introduces a high degree of complexity into the model which can be challenging to decipher with only a small number of points in the initialisation.
Importantly, we are oen inclined to associate the complexity of the feature with its dimensionality.While the connection can be made for continuous representation, in binary representations, the high dimensionality oen takes on a different meaning due to the nature of the input space.As a consequence, binary data translates to "practical" dimensionality that is generally lower than what one might encounter in a Euclidean space.For example, binary representations in our experiments, such as ngerprints and drfp, form 512dimensional design spaces (Table 2), but the complexity they introduce to the model is signicantly lower compared to the 512-dimensional continuous cddd representations, enhancing the BO performance as a result.
Another data-driven reaction representation such as rxnfp shares similar path to cddd as shown in Fig. 4. We can use the same argument based on low-data rich-features coupling setup as with cddd, yet with a considerable difference in the encoded information between the two representations.rxnpf allow us to encode the whole reaction with interrelation between additives and other reaction components.However, the design of rxnfp may not be well-suited for task at hand.Out of the box, the rxnfp representation aims to capture the global information of a reaction including all reactants, reagents, and the transformation itself.They encode information valuable for distinguishing reaction types.In the unique setup of additive screening, where the only variable component is the additive, this global reaction information may dilute the effect of the additive, considering their limited role in this task and therefore undermine the performance of BO.
Xtb features, on the other hand, include properties related to the additive's electronic structure and molecular properties and result in low-dimensional continuous representations.However, similar to drfp, they show an increased sensitivity to the choice of kernel.The discovery of the phthalimide ligand additive in the original study and the consequent mechanistic understanding it provided 39 served as initial reasoning why xtb features might be an effective representation for BO search in this paper.The specic electronic properties of phthalimide, such as its electron-withdrawing capacity, signicantly inuence the oxidative addition step.These properties play a crucial role in facilitating the reaction by stabilising the transition state or the reactive intermediates.The xtb features, Table 2 An overview of the different representations used in the Bayesian optimisation process along with their respective dimensions and types.The dimension column indicates the number of features in each representation, while the type column specifies the nature of the data-binary, mixed (for fragprints since they include encoded fragments on top of the fingerprint representation) or continuous.The table presents the diversity in the data representations explored in this study, illustrating the range of complexity and information encapsulated in each   The remaining representations-mqn, chemberta and as anticipated, OHE-show below-par performance, indicating their limited utility in BO search.Given its inherent design we expected OHE to result in poor exploitability of the data and therefore limit the model in learning from this representation.As a consequence, the outcome is oen worse than random search as shown in Fig. 4. Similarly, mqn and chemberta perform on par with random search.

Repr
Following on the inuence of various reaction representations, we evaluated the remaining parameters and represented the results in the Table 3. Alongside the data representation, the choice of kernel, initialisation strategy and acquisition function further dictate the success of the Bayesian optimisation process.The table provides an aggregate overview of the performance of each of these parameters, measured in terms of the percentage rate of identifying the top one and top ve additives and the validation R 2 score evaluated on the remaining 610 additives aer the 100 BO iteration starting from the 10 initial compounds.The Matern kernel stands out, achieving the highest success rate in identifying valuable additives, albeit with noticeable variance.Tanimoto and Linear kernels, display lower success rates and inability to adapt to diverging underlying data distribution coming from different reaction representation alternatives.Moreover, the Linear kernel, while having the highest R 2 score, performs the least in terms of identifying top additives.As mentioned, this result conrms the premise that the bestperforming combination in terms of Gaussian process regression, does not necessarily yield the best results in a Bayesian optimisation setting.This observation underscores the importance of considering the interplay between representations, initialisation strategies, and the broader optimisation context when evaluating performance.The choice of initialisation which determines the starting points for the BO process impacts the trajectory towards the optimal values.Cluster-based initialisation, possibly due to its capability to capture diverse regions of the search space, achieves better BO performance scores.The ucb acquisition function slightly outperforms EI for the BO metrics.However, the R 2 score is noticeably higher for EI, signalising that this acquisition function tends to uncover points that improve the surrogate model t, but fails on leading towards optimal values in the search space.For a more comprehensive evaluation of different parameters and their inuence on BO search, refer to Table 1 in the ESI.† In summary, the combined inuence of reaction representation, kernel choice, initialisation strategy and acquisition function shapes the BO's ability to efficiently navigate the search space and identify high-yield additives.The results emphasise the importance of rational parameter selection in achieving the full potential of BO for chemical optimisation.
Building up on our analysis, we proceeded to x the optimal choices for kernel, acquisition function, and initialisation strategy.Specically, we employed the Matern kernel, ucb acquisition function, and clustering initialisation method.With these choices set, we observed the Bayesian optimisation paths over 100 iterations, averaged across 20 seeds, for each of the representations and reaction plates.Fig. 4 reveals resulting patterns and illustrates the strengths and limitations of each representation in the given setup.fragprints, combined with clustering initialisation, begin the optimisation at a substantially higher level but tend to plateau more quickly.Similarly, clustering initialisation works well for xtb, but they have limited success in reaching optimal additives.Impressively, however, both of these representation tend to uncover additives from the higher end of the complex long-tailed target distribution early in the BO loop, facilitated by the clustering of the design space.On the other hand drfp, even though starting from a set of additives with lower objective values, exhibits consistent growth, eventually steering towards the optimum.cddd representation fails to reach the highyielding regions of the search space underscoring the idea that it is not ideally suited for the optimisation task at hand.The ngerprints representation, despite its third-place position in the previous analysis (see Fig. 3), show mixed results across reaction plates in this specic setup, oen performing similarly to random search.This result highlights the sensitivity of BO to the alignment of representation and chosen parameters as the best conguration for this representation included EI as the acquisition function and Tanimoto kernel for the surrogate model.Meanwhile, rxnfp, in combination with Matern kernel lags behind, reinforcing the notion of its optimal pairing with simpler kernels.As expected, OHE consistently performs among the worst, underperforming even against a random search.Its inherent sparsity and lack of inter-data point information render it illsuited for the task.As a comparison, we also evaluated reactionlevel representations: OHE, rxnfp and drfp on Buchwald-Hartwig dataset.Interestingly, OHE has been reported to perform particularly well on this data.Notably, in line with the ndings from our primary study, drfp exhibited consistent and robust performance, showcasing its universal applicability in Bayesian optimisation scenarios across datasets with differing requirements and constraints.For more details on the results on this dataset, refer to the Section A.5 in the ESI.†

Conclusion
Bayesian optimisation is a powerful optimisation method that steers the exploration of the search space towards more promising regions.It is especially valuable in chemistry, where it  32 This study showcases the effectiveness of BO supported by appropriate reaction representations, initialisation strategies and surrogate model specication in guiding the discovery of optimal additives in chemical reactions.The results highlight the importance of selecting suitable priors for optimal BO performance.We observed that drfp, when combined with the clustering initialisation method and a robust and adaptive Matern kernel, consistently outperformed both the one-hot encoding and random search baselines in identifying top-performing additives.Other representations have their merits, such as molecular ngerprints complemented with encoded fragments beneting from the clustering and uncovering points on the higher end of the target distribution during the initialisation stage of BO.Similarly, xtb features facilitate clustering but show mixed performance across different reactions, emphasising their narrower application.Data-driven representations, although rich and expressive, demonstrated difficulties performing with limited data.

Future work
This research underscores the potential of using BO for accelerating additive discovery in chemical reactions, paving the way for more efficient experimental design and optimisation in the eld of chemistry.The reaction type and its unique chemical features inuence the performance of specic chemical representations in the optimisation process.In addition, devising methods to evaluate the t of different representations for distinct sets of reactions could enhance the optimisation process, leading to more accurate and reliable results.Future research should focus on determining the optimal reaction representation, or possibly a dynamic combination of representations for employing bo on different reaction types while incorporating domain knowledge.For example, switching from one reaction representation to another during the BO search.This strategy would allow to incorporate benets of initialising the search at higher objective values while also reaching the optimum; or incorporating data-driven descriptors only once we have collected enough data for their optimal performance.
In this regard, several factors warrant further development.Firstly, potential biases in the dataset and assumptions made in the modelling could impact the generalisability of the results to other chemical reactions.Future work should focus on validating the methodology using diverse datasets and reaction types to ensure robustness and applicability across different contexts.Secondly, while this study investigated several reaction representations and initialisation strategies, additional research should explore alternative representations and strategies that may further improve the performance of BO in additive discovery by adapting to specic reaction types.For example, data-driven representations, although powerful, failed to deliver encouraging results in BO in this study.They could benet from custom specically designed surrogates or ne-tuning strategies on the datasets at hand.By addressing these future research directions and rening the BO methodology, the chemical research community can benet from further advancements in the powerful optimisation approach, ultimately contributing to a more efficient and comprehensive understanding of chemical reactions and their optimisation potential.The research can also extend to a broader range of chemical reactions and applications, such as high-throughput settings where batches of reactions can be evaluated simultaneously. 103,104

Fig. 3
Fig. 3 Bar plot showcasing the performance of different kernels within each representation.The Y-axis represents the percentage of the top 5 discovered additives.Each bar within a representation is colour-coded to indicate a specific kernel.The X-axis enumerates the various representations tested.The black dashed line connects the average performance of different representations, calculated by averaging across all kernels, initialisation methods, acquisition functions and seed-runs.For each kernel within a representation, the performance metrics are averaged across all initialisation strategies, acquisition functions and seed-runs.
these electronic properties, should provide a detailed and nuanced representation of the additive.In scenarios where the additive's electronic structure is the primary determinant of its performance, xtb features might offer a signicant advantage, but they omit other crucial information.Moreover, xtb demand for custom calculations, and they might not be ideal in cases where other factors, such as steric effects, dene the reaction outcome.

Fig. 4
Fig. 4 Comprehensive visualisation of the yield distribution, Bayesian optimisation (BO) traces, and kernel density estimation KDE plots for different reaction representations combined with Matern kernel, clustering initialisation and ucb acquisition function.The left panel displays the UV210 product area absorption distribution (used as a proxy for yield).The middle section contains the BO traces for each representation, with the dotted line marking the optimal values for each reaction.In the right panel, the plots show the KDE of the accumulated best objective selected during the 100 BO iterations for each representation.drfp outperforms other representations for all reaction plates while fragprints demonstrate superior performance in early iterations.

Table 3
Performance metrics, aggregated over various parameters and 20 seed-runs, for different combinations of kernels, initialisation methods and acquisition functions.Metrics include the mean and standard deviation of the percentage of top 1 and top 5 yielding additives discovered during the 100 BO iterations.R 2 scores are evaluated on a held-out set comprising the remaining 610 additives after excluding the initial 10 points and 100 selected by BO 0.12 AE 0.23 0.04 AE 0.14 © 2023 The Author(s).Published by the Royal Society of Chemistry Digital Discovery