Mengjia
Zhu
abc,
Austin
Mroz
de,
Lingfeng
Gui
af,
Kim E.
Jelfs
de,
Alberto
Bemporad
b,
Ehecatl Antonio
del Río Chanona
*a and
Ye Seol
Lee
*g
aDepartment of Chemical Engineering, Imperial College London, South Kensington Campus, London, SW7 2AZ, UK. E-mail: a.del-rio-chanona@imperial.ac.uk
bIMT School for Advanced Studies Lucca, Lucca, 55100, Italy
cDepartment of Chemical Engineering, University of Manchester, Manchester, M13 9PL, UK
dDepartment of Chemistry, Imperial College London, White City Campus, W12 0BZ, UK
eI-X Centre for AI In Science, Imperial College London, White City Campus, W12 0BZ, UK
fSchool of Engineering and Physical Sciences, Heriot-Watt University, Edinburgh, Scotland EH14 4AS, UK
gDepartment of Chemical Engineering, University College London, London, WC1E 6BT, UK. E-mail: lauren.lee@ucl.ac.uk
First published on 28th October 2024
Experimental design plays an important role in efficiently acquiring informative data for system characterization and deriving robust conclusions under resource limitations. Recent advancements in high-throughput experimentation coupled with machine learning have notably improved experimental procedures. While Bayesian optimization (BO) has undeniably revolutionized the landscape of optimization in experimental design, especially in the chemical domain, it is important to recognize the role of other surrogate-based approaches in conventional chemistry optimization problems. This is particularly relevant for chemical problems involving mixed-variable design space with mixed-variable physical constraints, where conventional BO approaches struggle to obtain feasible samples during the acquisition step while maintaining exploration capability. In this paper, we demonstrate that integrating mixed-integer optimization strategies is one way to address these challenges effectively. Specifically, we propose the utilization of mixed-integer surrogates and acquisition functions–methods that offer inherent compatibility with problems with discrete and mixed-variable design space. This work focuses on piecewise affine surrogate-based optimization (PWAS), a surrogate model capable of handling medium-sized mixed-variable problems (up to around 100 variables after encoding) subject to known linear constraints. We demonstrate the effectiveness of this approach in optimizing experimental planning through three case studies. By benchmarking PWAS against state-of-the-art optimization algorithms, including genetic algorithms and BO variants, we offer insights into the practical applicability of mixed-integer surrogates, with emphasis on problems subject to known discrete/mixed-variable linear constraints.
Traditional experimental planning methods, such as full factorial designs, fractional factorial designs, and mixture design,2,3 often involve testing a set of selected conditions exhaustively on a predefined or a trial-and-error scheme, which can be time-consuming, resource-intensive, and impractical for complex systems.4 For example, the extended time frame associated with translating material discoveries into market-ready products, exceeding 20 years on average, presents a significant challenge for the timely implementation of novel technologies in industrial applications.5,6 Moreover, traditional methods rely heavily on expert knowledge, which introduces bias into an already complex design space. To address these challenges, different approaches have been studied and developed over the years to optimize the experimental planning process.
For example, recent developments in high-throughput experimentation (HTE)7,8 have significantly broadened the scope of experimental capabilities, allowing for the collection of several thousand data points within a constrained set of conditions in a reasonable time frame. The process of HTE implies a substantial enhancement in the efficiency and scale of data acquisition compared to traditional experimental methods.9 Yet, deciding on the constrained set of conditions to test remains challenging. One proposed solution is to integrate machine learning methods with HTE to efficiently navigate the search space.10
Generalized subset designs and model-driven approaches such as model-based design of experiments (MBDoEs) have emerged as an answer to the shortcomings of the traditional factorial design of experiments. Generalized subset design involves efficiently selecting representative subsets from a large pool of experimental points by employing combinatorial structures like orthogonal arrays and Latin squares. These subsets are transformed into symmetrical/hypercube representations, where optimal design planes are generated and later reverted to the original experimental space.11 Generalized subset designs are robust but cannot be adaptively updated with information gained and, therefore, can be inefficient when the design space is large. On the other hand, MBDoE aims to fit a reasonably well-performing semi-empirical or empirical model with a small number of information-rich samples.12–14 The parameters of the model can be estimated by different optimality criteria. For example, the commonly used D-optimal design15 can be used to select samples to estimate the parameters by maximizing the determinant of the information matrix. One limitation of MBDoE is its restricted exploration capability,4 and the need for the system to be well understood for an accurate model to be derived. Additionally, MBDoE primarily relies on predefined experimental designs, which may inadvertently overlook certain regions of the experimental space. This limitation can impede the discovery of patterns or relationships within the data. For example, MBDoE may fail to consider the impact of unconventional temperatures or unique solvents if these factors are not incorporated into the initial model assumptions, thereby potentially missing out on superior reaction conditions that could significantly improve the yield of a target product.
Recently, more flexible and adaptive data-driven approaches, such as Bayesian optimization (BO) methods,16–20 have been proposed as a solution, making it possible to adjust the experimental planning dynamically based on accumulated information. For instance, different surrogate-based optimization methods21–23 have been developed over the years. By incorporating Bayesian principles, the experimental space can be navigated more effectively, allowing for more comprehensive exploration and a better chance of uncovering hidden insights that might be overlooked by conventional DoE approaches. This emphasizes the importance of considering alternative approaches, particularly those with enhanced exploration capabilities, to ensure a more thorough and insightful understanding of the underlying phenomena. Prior works have extensively explored surrogate modeling techniques such as Gaussian processes,24 radial basis functions,25,26 and neural networks27,28 to approximate the underlying correlations between the design variables and the desired outcomes.
While BO-based and other surrogate-based approaches have been successfully employed in many applications, including material discoveries,29–35 chemical synthesis,4,32,36 and engineering design,32,37–42 their effectiveness is limited when addressing optimization problems with mixed-variable domains. These domains encompass data structures characterized by a combination of continuous, integer, and categorical variables, and are frequently encountered in real-world chemical problems. As implemented in the current BoTorch43 package (v-0.11.3), a common method employed by BO approaches to handle mixed-integer/mixed-variable cases involves iterating through all potential integer/categorical values in an outer loop while optimizing the continuous variables. This process becomes cumbersome and impractical with an increasing number of integer/categorical variables.4,43 To avoid exhaustive enumeration, Daulton et al.44 proposed a probabilistic reparameterization (PR) approach. In this method, discrete variables are sampled from a probability distribution that is parameterized by continuous variables. This allows for more efficient optimization of mixed-variable problems by optimizing over the continuous reparameterization rather than the discrete space directly.
The presence of discrete or mixed-variable constraints further complicates the optimization process. Constraints can be known or unknown a priori; in this paper, we limit the scope to problems with known constraints. Various BO methods have been developed for constrained problems in continuous spaces. For example, known constraints can be specified upfront to define the feasible region, which is then used to define an indicator function for feasibility. The indicator function can then be coupled with the acquisition strategies, such as expected improvement, to guide the search toward feasible samples.45 Also, polytope sampling46 strategies can be used to generate initial feasible samples. However, these methods were originally designed for continuous and smooth constraint landscapes, making them less effective and often inefficient when applied to discrete or mixed-variable constrained problems, particularly when dealing with a large number of constraints.
To address the challenges posed by constrained mixed-integer spaces in experimental planning, in this work, we propose a different framework – the use of mixed-integer surrogates and acquisition functions, for which we adopt the recently developed optimization package, piecewise affine surrogate-based optimization47 (PWAS). PWAS is selected as it is tailored to address linearly-constrained discrete and mixed-variable domains enabling direct incorporation of discrete/categorical decision variables in the optimization process, which provides a more realistic representation of the problem. Additionally, PWAS can handle mixed-variable linear equality/inequality constraints, which are commonly encountered in physical and chemical systems, ensuring the proposal of feasible solutions throughout the design procedure.
The paper is organized as follows. In Section 2, we discuss the general problem formulation of the experimental planning optimization problem, focusing on problems with discrete and mixed-variable design space. In Section 3, we review existing optimization methods for the target problem. Following that, in Section 4, we lay out the overall steps of PWAS, which we proposed to employ to address experimental planning optimization with mixed-variable domains. We then discuss the implementation and performance of PWAS through three case studies: Suzuki–Miyaura cross-coupling, crossed barrel, and reacting solvent design in Section 5. Conclusion and future research directions are summarized in Section 6.
The general mathematical formulation of the targeted problem is:
X*∈ arg min f(X), | (1) |
• Random search: aamples are drawn uniformly at random within the search domain without any encoding for categorical variables or optimization steps.
• Genetic: different genetic algorithm implementations are available.56,57,59–65 In this paper, we consider the evolutionary algorithm implemented in the distributed evolutionary algorithm in python (DEAP) package.56,57 DEAP handles categorical variables by label encoding them. The solving process involves two phases – an initial sampling and an iterative sampling phase. It first initiates samples randomly within the search domain. Iterative samples are then generated through crossover, mutation, or a combination of both, depending on whether the randomly generated probabilities for the execution of crossover or mutation exceed the default threshold. Evolutionary algorithms often balance exploitation and exploration through crossover and mutation, without explicitly utilizing the input–output correlations. Therefore, when the design space is large, it may require a large number of experiments or simulations to attain desired outcomes, making it not suitable for expensive-to-evaluate problems.66 To address this issue, different surrogate-assisted evolutionary algorithms have been proposed.67,68 Nevertheless, surrogate model selection and relevant parameter tuning remain challenging.68 Moreover, incorporating constraints within the framework can be non-trivial.69
• Hyperopt:58 hyperopt, utilizing the TPE algorithm, offers a framework specifically designed to facilitate the application of Bayesian optimization for hyperparameter selection.70 TPE inherently handles categorical variables, eliminating the need for explicit encoding. The solving process comprises two phases: initial sampling and active learning. During the initial sampling phase, samples are randomly drawn from the design space. In the active-learning phase, TPE operates by optimizing a loss function over a tree-structured configuration space. In handling categorical (discrete) variables, it employs the estimation of distribution (EDA) approach, where candidate points are sampled according to binomial distributions.71 As for the continuous variables, the covariance matrix adaptation – evolution strategy (CMA-ES, gradient-free)72 is utilized. TPE is computationally cheap and simple compared to many other algorithms within the BO framework.58 However, incorporating a relatively large number of constraints is challenging, and is not currently implemented.58,73 Watanabe and Hutter73 attempted to address this challenge by integrating the acquisition function of constrained BO by Gardner et al.74 However, the proposed approach considers the probability of constraint improvement and still allows infeasible samples,73,74 which may be selected and tested in simulations but can not be queried for real experiments. It is also worth mentioning that this approach is proposed to address problems with unknown constraints.
• BoTorch:43 this package implements BO based on GPs. In this framework, categorical variables are one-hot encoded, and users can select different kernels. For instance, for fully-categorical design space, the hamming distance kernel is commonly used. For mixed-integer cases within the reaction optimization domain, it is recommended to use Matérn 5/2 kernel for both continuous and integer (discrete) variables.4,22 Similar to other BO-based approaches, the solving process includes two stages – the initial and active-learning stages. During the initial sampling stage, samples are randomly chosen from the search domain. During the active-learning stage, the next sample to evaluate is determined by optimizing the acquisition function (e.g., expected improvement). We also note that, for mixed-integer cases, BoTorch finds the next point to test by iterating through all possible integer values in an outer loop and optimizing the remaining continuous variables while keeping the integer value fixed. Subsequently, it selects the integer value that returns the best evaluation on the acquisition function. This step can cause the computational time to increase significantly, especially when the number of integer (discrete) variables increases and/or the number of possible options for these variables increases. Alternative approach that utilize PR was proposed and implemented in44 to address this issue, but has not yet been merged into the main BoTorch repository. Additionally, constraint handling for mixed-integer cases has not been implemented43 as conventional approaches, such as trust regions, cannot be trivially integrated with the current framework.
• EDBO:4 this package also implements BO based on GPs, where it considers three descriptors to encode the categorical variables, which are density functional theory (DFT),75 mordred,76 and one-hot encoding. Here, continuous descriptors such as DFT and mordred are used to encode categorical variables so that GP with kernels that are designed for continuous design spaces can also be used/tested.
Different from BoTorch, EDBO pre-trains the GP model with data from the literature for the following two reactions: Suzuki–Miyaura reaction,77 consisting of 3696 reactions in the dataset, and the Buchwald–Hartwig reaction, whose training dataset consists of 3960 unique reactions.75 While EDBO, similar to other BO methods such as BoTorch, improves with additional training samples and can suggest optimal solutions even with poorly trained surrogate models,78 its key advantage lies in its pre-training phase. This pre-training makes EDBO highly efficient, particularly for optimizing reaction conditions when the target reactions share similar features with the training datasets.
Since the target problems often only involve categorical variables with finite options, the workflow of EDBO involves pre-generating all the possible combinations. The solving process involves an initial-sampling and an active-learning stage. Samples are randomly selected from the pre-generated combinations during initial sampling. While in the active learning stage, rather than searching within defined bounds, it exhaustively enumerates the entire domain, selecting the point with the lowest acquisition function evaluation. For mixed-integer cases, continuous variables first need to be discretized. The enumeration procedure in the acquisition step can become computationally expensive as the number of discretization steps increases, highlighting the trade-off between computational time and achieving a better representation of the original domain. Regarding constraint handling, there is currently no implementation integrated into EDBO.
In addition to the method discussed, we also evaluated Gryffin,19 a technique known for its recent advances in optimization problems with continuous and categorical design space. Gryffin proposed a new technique to select categorical variables based on expert knowledge. While Gryffin was included in the initial comparison, its performance in our specific case studies did not yield significant improvements over the other methods. Therefore, the detailed results and performance metrics for Gryffin are provided in the ESI (see Section 1† therein) for completeness. In summary, different methods have been proposed to handle discrete and mixed-variable problems, and several approaches have been tailored for experimental planning problems in chemistry domain. Nevertheless, limited approaches have been proposed to explicitly handle known discrete and mixed-variable constraints to ensure feasible query. And to the best of our knowledge, no such approach has been implemented for discrete and mixed-variable constrained experimental planning problems.
Before going into the explanation of the algorithm, let us outline how samples are processed. The optimization variables are first pre-processed before feeding into PWAS. Specifically, continuous variables are scaled to [−1, 1]. As for integer variables, two strategies are implemented. If the number of combinations of integer variables is smaller than a predefined minimum, PWAS treats these integer variables as if they are categorical; otherwise, each integer variable is assigned with an auxiliary continuous variable scaled to [−1, 1]. The auxiliary continuous variables will later be used to fit the surrogate, while the original integer variables are kept to formulate the constraints (if present) to ensure feasibility. The integer variables and the auxiliary continuous variables are correlated by the scaling parameters. Regarding categorical variables, PWAS handles them by one-hot binary encoding, i.e., Zi is encoded into the subvector [z1+di−1 … zdi]T ∈ {0,1}ni ∀i = 1, …, nd, where We combine all encoded categorical variables and denote them in one vector z ∈ {0,1}dnd, with
Like many surrogate-based approaches, the solution process of PWAS unfolds in two phases: the initial sampling phase and the active-learning phase (see Fig. 2). Due to the nature of our target problems, where experiments and simulations are expensive to run, we assume to operate under a fixed computational budget, i.e., only Nmax experiments/simulations are allowed. During the initial sampling phase, PWAS draws Ninit samples from the feasible domain. Different strategies for initial sampling are applied based on the specifics of the problem.47 Since there is limited or no information about the current problem during the initial sampling stage, static DoE approaches, i.e., space-filling approaches, can be a good starting point, especially when no complex constraint is involved. However, we also stress that using static DoE approaches initially does not restrict the exploration capability of PWAS in the active-learning stage, which will be the case for normal DoE approaches. In general, when only box constraints are present, the Latin hypercube sampling (LHS) method is employed.79 In cases where linear equality and/or inequality constraints exist alongside integer and/or categorical variables, the algorithm initially employs LHS and then discards any infeasible samples. Should the number of feasible samples generated be insufficient, two strategies are used: (1) if there exist only continuous variable, polytope sampler, specifically, the double description method80 is implemented; (2) if there exist integer/categorical variables, a method involving solving mixed-integer linear programming (MILP) problems sequentially is implemented. These MILP problems utilize the exploration functions discussed later in this section as their objective functions, ensuring adherence to given constraints, and thereby obtaining feasible samples. This strategy is particularly useful when constraints are hard to fulfill by random sampling.
After obtaining the initial list of samples to test, experimentation or simulation is conducted to obtain their respective evaluations. These are then used to build a surrogate model. In PWAS, we use PARC81 to fit piecewise affine surrogates. The general procedures of PARC are illustrated in Fig. 3. PARC is a block descent algorithm, where it first groups samples in Kinit clusters. Clusters containing fewer samples than a predefined minimum threshold are discarded, resulting in Kupdated remaining clusters. We then fit piecewise affine (PWA) separation functions among these clusters to form Kupdated partitions of the design space. Within each partition, a PWA surrogate function is fitted to make predictions. We conducted a comparative study to evaluate the effectiveness of the PWA surrogate function on a set of benchmark functions. The results indicate that PWA can provide better fitting to the underlying function compared to GP with RBF, Matérn, and linear kernels, particularly when the number of training samples is limited or when the problem involves discontinuous or sharp transitions caused by categorical variables. Full details and analysis can be found in the ESI, Section 2.†
A cost function comprising separability and predictability indexes is then calculated, which balances between the enhancement of separability among different partitions and the improvement of predictability within each partition. PARC terminates when either the difference in the value of the cost function between two consecutive evaluations falls below a predefined tolerance, or the number of iterations surpasses a predetermined maximum limit. Otherwise, PARC reassigns samples to Kupdated clusters based on the newly fitted PWA separation functions and then iterates the procedure until termination criteria are met.
In the active-learning phase, at each iteration, we query a new sample and refine the surrogate function by re-partitioning and re-fitting the surrogates using PARC, incorporating both existing and newly acquired samples. As merely minimizing the surrogate may overlook the global optimum, PWAS sums the surrogate function, (X), with a distance-based exploration function, E(X), i.e., a(X) =
(X) − θEE(X), where a(X) represents an acquisition function that balances between the exploitation of the PWA surrogates, and the exploration of the design space. Here, E(X) is a pure exploration term that quantifies the distance between X and existing samples X1, …, XN, which is independent of function evaluations. A user-defined trade-off parameter, denoted as θE, is introduced with a default setting of 0.5. This default value has been validated across a variety of benchmarks to ensure optimal performance.47 In PWAS, two types of exploration methods, which can be directly reformulated as mixed-integer problems, are considered: a max-box approach for continuous and integer variables, and the hamming distance method for binary encoded categorical variables. These exploration functions are selected as they can be reformulated as MILPs. Therefore, the acquisition function can be minimized with a standard MILP solver to determine the next sample for testing, with the option to directly incorporate linear constraints, if present. Subsequently, a new experiment or simulation is selected and evaluated to assess the objective function value. This iterative process continues until reaching the maximum number of iterations (Nmax).
All the case studies are solved on an Intel i7-8550U 1.8 GHz CPU laptop with 24 GB of RAM, with all the results available in the GitHub repository at https://github.com/MolChemML/ExpDesign.
Optimization variables | # Options |
---|---|
Aryl halide (X) | 4 |
Boronic acid derivative (Y) | 3 |
Base | 7 |
Ligand | 11 |
Solvent | 4 |
Total # of possible combinations | 3696 |
We employ PWAS to solve the optimization problem and benchmark its performance against established optimization libraries: genetic, hyperopt, BoTorch, and EDBO. Additionally, we consider the results obtained from random search as the baseline. The characteristics of the approaches used in these libraries have been discussed in Section 3. We note that random search, genetic (with DEAP v-1.4.1), hyperopt (v-0.2.7), and BoTorch(v-0.6.6) have been interfaced in the Olympus22 package; therefore, we use the algorithmic structure implemented in the package for benchmark tests with their default parameter values. A customized forked version tailored for our testing is also available on GitHub at https://github.com/mjzhu-p/olympus (Branch “pwas_comp”). Regarding EDBO, categorical variables, namely, distinct chemical entities, are one-hot-encoded and the default setting is used with minor changes to the original package to allow customized input for the number of initial samples (see the changes in the forked version at https://github.com/mjzhu-p/edbo/tree/pwas_comp). As for PWAS, the default setting47 is used, i.e., the number of initial partitions (Kinit) is set to 10, with the trade-off parameter between exploitation and exploration (θE) set to be 0.5. And the categorical variables are one-hot encoded. We note that this study specifically focused on using the one-hot-encoded reaction representation, in line with the original benchmark problem. While more advanced encoding methods, such as differential reaction fingerprints,86 and the integration of expert knowledge to narrow down the design space87 have been shown to influence optimization performance and offer promising search strategy, our objective was to maintain consistency with the benchmark to ensure comparability with previous methods. For readers interested in advanced encoding techniques, we refer to the work of Ranković et al.88
For each optimization method, we conduct 30 repetitions for statistical analysis. Given that the goal of our study is to assess the performance of the algorithms on case studies where only a small number of experiments/simulations can be done due to time and resource constraints, we cap the maximum number of experiments at 50 within each repetition.
While EDBO achieves the highest yield after 50 iterations, PWAS demonstrates competitive performance, surpassing all other tested methods in its ability to identify optimal reaction conditions that maximize the reaction yield. It is important to note that EDBO is expected to outperform other methods in this case study, given that the GP model used in EDBO was pre-trained using the entire Suzuki–Miyaura reaction dataset (3696 reactions; see also in Section 3).4 In contrast, PWAS showed the similar results without requiring the pre-training step, highlighting its capacity to perform effectively with minimal prior data. To provide a clear demonstration of the efficiency of each method, we present a boxplot in Fig. 6. This visualization represents the number of iterations required by each method to achieve a top-20 ranked yield. Each data point on the plot represents the outcome of one specific run, and the statistics presented are derived from 30 repetitions to ensure robustness. On average, both PWAS and EDBO require significantly fewer iterations to attain a top-20 ranked yield when compared to other methods. This suggests their superior efficiency and potential time and resource savings in practical applications. Overall, the comparable performance of PWAS with EDBO demonstrated in the case study shows the ability of mixed-integer surrogates to more efficiently optimize the parameter space with no prior knowledge of the system, which has major implications for situations where prior data (literature or otherwise) is not available or difficult/expensive to obtain—a very common scenario in the chemical sciences.
![]() | ||
Fig. 7 Schematic representation of a crossed-barrel design,38 illustrating the optimization variables, where θ is twist angle of the columns (degree), r is outer radius of the columns (mm), n is the number of hollow columns, and t is thickness of the hollow columns (mm). |
As depicted in Fig. 7, a crossed barrel has n hollow columns with outer radius r and thickness t, twisted at an angle θ. Here, n, r, t and θ are the design variables we want to optimize, whose data types (discrete/continuous) and domains are outlined in Table 2. Due to the involvement of continuous variables (n, r, t), exhaustively enumerating all possible combinations is impractical. Hence, an emulator, as recommended by Hickman et al.,22 is utilized to simulate the process and therefore make it possible to sample over the whole feasible domain. The emulator was modeled as Bayesian neural nets (BNN)22 and trained on over 2500 HTE data points collected by Gongora et al.,38 where the weights and biases are modelled woith a normal distribution. We note that the trained emulator serves the purpose of method comparisons in this case study. Nevertheless, the accuracy of the trained model can be improved if more data could be provided.
Optimization variables | Type | Domain |
---|---|---|
Number of hollow columns (n) | Integer | [6, 12] |
Twist angle of the columns (θ) [degree] | Continuous | [0.0, 200.0] |
Outer radius of the columns (r) [mm] | Continuous | [1.5, 2.5] |
Thickness of the hollow columns (t) [mm] | Continuous | [0.7, 1.4] |
The challenges of this case study involve balancing trade-offs between conflicting mechanical properties, such as strength (the ability to resist an applied force without being damaged) and ductility (the ability to stretch without breaking), and incorporating mixed-integer design choices. This case study is selected due to the mixed-integer nature of the problem and the availability of an adequate number of HTE experimental data to train an emulator.22,38 Similar procedures can be followed to design chemical-related units, e.g., chemical reactors, if data acquisition is possible via experiments or high-fidelity simulations.
We solve this optimization problem with the same set of methods employed in the Suzuki–Miyaura cross-coupling case study, using the packages interfaced in the Olymnpus package for random search, genetic, hyperopt, and BoTorch. As discussed in Section 3, the method implemented in EDBO package requires a pre-defined discrete search space, meaning that the continuous variables, θ, r, and t, need to be discretized. Here, we consider three discretization schemes for the search domain, evenly divided and spaced by: 100, 10, and 10 points (EDBO_1); 10, 10, and 10 points (EDBO_2); and 10, 5, and 5 points (EDBO_3), respectively. These configurations yield 70000, 7,000, and 1750 possible combinations to form the search domain. As for PWAS, two strategies are used to handle integer variables as detailed in the pre-processing step in Section 4, and the same values introduced in 5.1 are used for Kinit and θE. Similarly to the Suzuki–Miyaura cross-coupling case study, we run 30 repetitions for statistical analysis. Within each run, we include 10 initial experiments and then allow a maximum of 50 iterations.
Random | Genetic | Hyperopt | BoTorch | PWAS | EDBO_1 | EDBO_2 | EDBO_3 | |
---|---|---|---|---|---|---|---|---|
Average | 1.85 | 1.77 | 2.80 | 398.68 | 35.36 | 272.54 | 227.54 | 212.92 |
Std | 0.44 | 0.35 | 0.71 | 260.71 | 2.00 | 67.61 | 2.52 | 20.38 |
![]() | ||
Fig. 10 The Menschutkin reaction of phenacyl bromide and pyridine. In the illustration, solvent 2 is preferred which lowers the free energy compared to Solvent 1.13,91 |
A set of 46 functional groups were selected as molecular building blocks. The solvent was represented by integer variables to indicate the number of each functional group present in the solvent molecule. To ensure that only chemically feasible combinations of functional groups are generated during the optimization process and to limit the size of the solvent, a set of chemical feasibility and complexity constraints was imposed. For instance, constraints were used to ensure the octet rule.12 Since the solvent designed must be in the liquid phase at reaction conditions, the normal melting point (Tm) and the boiling point (Tb) of the solvent were added as design constraints. Two physical properties, namely, flash point (Tfp) and octanol/water partition coefficient (Kow), as well as the oral rat median lethal dose (LD50) of the solvent were constrained to reduce health, safety, and environmental impact. In total, the problem consists of 115 linear inequality and 5 linear equality constraints, where one auxiliary categorical and 7 binary variables were introduced to formulate the constraints. The types of design variables and the property prediction model used are summarized in Tables 4 and 5. For a comprehensive description of the mathematical formulation used, the reader is referred to Gui et al.13 and Section 2.3 therein.
Description | Notes |
---|---|
Number of functional group types | 46 (integer) |
Number of auxiliary variables introduced for chemical feasibility | 1 (categorical) and 7 (binary) |
Number of inequality/equality design constraints | 115 (linear)/5 (linear) |
In the following, we provide an overview of the DoE-QM-CAMD13 method. DoE-QM-CAMD employed a multiparameter solvatochromic equation94,95 to correlate solvent properties and the logarithm of the rate constant:
ln![]() | (2) |
In contrast, PWAS, similar to most surrogate-based optimization strategies,29,96,97 solves the problem by employing an active-learning technique with exploration capability. It systematically identifies optimal solvents for examination, effectively balancing the trade-off between exploring new possibilities for model improvement and exploiting known knowledge of reaction kinetics. As opposed to DoE-QM-CAMD, which assumes linear relationship between the expert-derived solvent properties and lnk, PWAS adopts PWA surrogates to represent the correlations between the functional groups within the designed solvent and ln
k, where the relationship between solvent properties and ln
k are learned implicitly. As discussed in Section 4, PWAS leverages PARC81 to fit the surrogates, where PARC first clusters samples in Kinit initial partitions. The initial partitions are then optimized by balancing between enhancing separability, which relies on similarities among different solvents (in this context, functional groups), and improving the predictability of the surrogate function within each partition, in this case, the input–output correlations. Here, the output (ln
k) correlates with the properties of the designed solvent and therefore can be implicitly learned during surrogate fitting, offering insights not available when solely considering individual functional groups.
Consistent with previous case studies, default parameters are utilized when solving the problem with PWAS, including a maximum of 50 experiments, with an initial set of 10 samples. It is important to highlight that the complete QM reaction constant data were generated by exhaustively enumerating all 326 feasible solvents within the defined design space,98 enabling the sampling of new solvents without the need for additional calculations and providing the true rank of the solvents.
Rank | Chemical formula | ln![]() |
|
---|---|---|---|
QM | Pred | ||
a
k [L mol−1 s−1]: rate constant for the Menschutkin reaction, QM: ln![]() ![]() |
|||
1 | CH3NHCHO | −5.92 | −5.92 |
2 | OHCH2NO2 | −6.46 | −6.49 |
3 | CH2OHCH2NO2 | −6.72 | −6.69 |
4 | (CH3)2SO | −6.82 | −6.82 |
5 | (CH2)2OHCH2NO2 | −6.93 | −6.93 |
6 | CH3CHOHCH2NO2 | −6.96 | −6.97 |
7 | CH2![]() |
−6.98 | −6.97 |
8 | CH![]() |
−7.00 | −6.98 |
9 | (CH2)3OHCH2NO2 | −7.10 | −7.10 |
10 | CHCH2![]() |
−7.11 | −7.11 |
Rank | Chemical formula | ln![]() |
|
---|---|---|---|
QM | Pred | ||
a
k [L mol−1 s−1]: rate constant for the Menschutkin reaction, QM: ln![]() ![]() |
|||
1 | CH2OHCH2NO2 | −6.72 | −5.50 |
2 | (CH3)2SO | −6.82 | −5.59 |
3 | CH2OHCH2NO2 | −6.72 | −6.28 |
4 | CH2![]() |
−6.98 | −6.66 |
5 | (CH2)2OHCH2NO2 | −6.93 | −6.74 |
6 | CH3CHOHCH2NO2 | −6.96 | −6.87 |
7 | (CH3)2COHCH2NO2 | −7.23 | −6.91 |
8 | CH![]() |
−7.00 | −6.92 |
9 | CH2CH2![]() |
−7.15 | −6.97 |
10 | CH3NHCHO | −5.92 | −7.00 |
Before examining specific solvents, we first perform a sensitivity analysis using partial dependence plot (PDP) and individual conditional expectation (ICE) plots to investigate the relative influence of solvent properties on the reaction rate constant, which we will then use to assess PWAS's exploitation and exploration capability. Seven representative descriptors considered are refractive index at 298 K (n2), Abraham's overall hydrogen-bond acidity (A), Abraham's overall hydrogen-bond basicity (B), dielectric constant at 298 K (ε), microscopic surface tension at 298 K (γ), aromaticity, and halogenticity, which are used in a successful quantum mechanical continuum solvation model.99 The descriptors of all feasible solvents are calculated using the group contribution method of Sheldon et al.100 As can be seen in Fig. 11, the fluctuation in PDPs is most pronounced for ε, while the fluctuation in ICEs is most significant for three properties, namely, n2, B, and ε, indicating their strong marginal effect on predicting lnk. This finding aligns with the established results for SN2 reactions. As reported in the literature,101–103 solvent effects on reaction kinetics stem from the fact that the solvent–solute interactions stabilize the reactant(s) and the transition state to different extents. In general, when the transition state is (partially) ionic by nature and the reactants are neutral, a solvent with larger dielectric constant, indicating greater polarity, or those with stronger hydrogen bond basicity, meaning they are more potent hydrogen bond acceptors, can lower the free energy of the transition state more than that lowered for the reactants, thereby reducing the overall free energy barrier. It is also known that non-basic, polar aprotic solvents are preferred as they do not solvate the nucleophile strongly, making it more reactive and available for the reaction. Regarding the refractive index, although it does not directly reflect the polarity of solvents, it can greatly affect the solvation of reactants and the overall environment. Besides the dominant solvent descriptors, from Fig. 11, we can see that the relationship between the reaction rate and the solvent properties is not strictly linear. Additionally, there are dependencies among different properties, explaining why the MLR model (2) integrated into the DoE-QM-CAMD approach demonstrates inconsistent performance across the entire design space. While the MLR model could be improved by incorporating second-order terms and interaction terms, as discussed by Gui et al.,98 deciding which terms to include often resorts to a trial-and-error approach, which can be time-consuming and potentially hard to justify. In contrast, PWAS provides a systematic approach to decompose the solvent design space based on their similarity related to rate constants, making it possible to capture the nonlinear relationship between the solvent properties and the reaction rate without making prior assumptions on the functional form, which we demonstrate in the following.
![]() | ||
Fig. 11 Partial dependence plots (PDP) and individual conditional expectation (ICE) plots utilized to assess the influence of diverse solvent properties on the reaction rate across all feasible solvents. n2: refractive index at 298 K, B: Abraham's overall hydrogen-bond basicity, ε: dielectric constant at 298 K, A: Abraham's overall hydrogen-bond acidity, γ: the macroscopic surface tension at 298 K. Solvent properties are calculated based on the property prediction method of Sheldon et al.98,100 Note: each gray line represents the relevant changes of the descriptor for one feasible solvent. |
Focusing on the most significant solvent descriptors, n2, B, and ε, three radar charts are plotted in Fig. 12 and 13, showing the relevant properties of the initial, first-10 active-learning, and last-10 active learning samples. To facilitate comparison, all features are normalized to a range between 0 and 1 using min–max normalization, except for the dielectric constant. Since the dielectric constants of two solvents are significantly higher than the others, the dielectric constant is normalized relative to the remaining solvents, resulting in values of these two solvents exceeding 1. Besides the radar charts, we also depict the structures of the solvents and their categorization based on the constituent functional groups in Fig. 14. Fig. 14a arranges the solvents in the sequence of optimization steps, distinguishing the initial and subsequent active-learning samples with a black line in, while Fig. 14b arranges the solvents into partitions, with orange lines denoting partition boundaries.
By examining Fig. 12–14, it can be observed that PWAS explores diverse solvent structures, covering large ranges of solvent properties, during the initial sample step, and then gradually converges toward clear patterns while maintaining an exploratory nature. These results demonstrate the effectiveness of PWAS at finding a diverse and promising set of solvents over the constrained mixed-integer and categorical domain. One major advantage of PWAS is that it allows one not only to group solvents with similar functional groups into the same partition, but also to place solvents with similar chemical properties into the same partition through the PARC mechanism. These similar findings across the functional group- and property-based design spaces highlight the adeptness of PWAS in identifying key implicit relationships, demonstrating its capability to effectively discern and utilize the links between functional groups and the ensuing solvent properties.
In summary, we compared the effectiveness of two approaches for the solvent design case: PWAS and DoE-QM-CAMD. Our findings reveal the strengths in each method: the DoE-QM-CAMD approach, utilizing a MLR model, demonstrates more robust predictive capabilities across the entire design space; while, PWAS, employing PWA surrogates, can better predict reaction rates in proximity to optimal regions, which is important for our optimization objective. Furthermore, PWAS can learn correlations between solvent properties and reaction rates and offer valuable insights.
For the first two case studies, we compared the performances of PWAS with Random Search and four state-of-the-art methods representing varying optimization strategies, including an evolutionary method, and three methods within the BO framework. Compared to the established methods, PWAS achieved comparable or superior performance in terms of the number of experiments required to achieve satisfactory performance.
In the final case study, the complexity of the solvent design problem, with 10 linear equalities and 115 inequalities, limited the applicability of the benchmark methods. These methods struggle to incorporate a large number of constraints while guaranteeing feasible experimental suggestions. In contrast, PWAS offers a distinct advantage by directly incorporating these constraints within its formulation during the acquisition step. This capability is particularly relevant in chemical optimization problems, where a large number of constraints are often essential to define a feasible and well-defined design space. These constraints ensure that proposed solutions are both synthetically achievable and safe for experimentation. Therefore, the results obtained from PWAS were compared with a recently proposed DoE-QM-CAMD approach.13 The comparison highlighted the effectiveness of PWAS in systematically exploring the mixed-integer solvent design space. This is mainly attributed to its ability to implicitly learn underlying correlations within the data and effectively consider uncertain design space, ultimately identifying high-performing solvents. This level of analysis is particularly beneficial in the field of chemistry, as it provides a deeper understanding of the fundamental principles governing the underlying process.
In conclusion, while BO has undeniably revolutionized the landscape of optimization in experimental planning, especially in the chemical domain, it is important to recognize the potential of other surrogate-based approaches within conventional chemistry optimization problems. This is particularly relevant given the inherent complexity of many chemical problems, which are often characterized by mixed variables and a relatively large number of constraints, making it challenging for many widely adopted BO methods to obtain feasible samples during the acquisition step while still maintaining exploration capability. Nevertheless, BO as a general framework is highly adaptable and can incorporate various strategies to overcome these challenges. Here, we demonstrated that integrating mixed-integer optimization strategies can be an effective way.
Future work could be directed to extend the capabilities of PWAS to handle nonlinear constraints, thereby enhancing its applicability across a wider range of problem domains, particularly those present in the physical world. Additionally, it is useful to investigate the adoption of acquisition strategies in PWAS to existing BO methods, especially the ones with tree-based kernels, to address the target problems.
Footnote |
† Electronic supplementary information (ESI) available: GitHub Repository. See DOI: https://doi.org/10.1039/d4dd00113c |
This journal is © The Royal Society of Chemistry 2024 |