Open Access Article
Yehia
Amar
a,
Artur M.
Schweidtmann
b,
Paul
Deutsch
c,
Liwei
Cao
ad and
Alexei
Lapkin
*ad
aDepartment of Chemical Engineering and Biotechnology, University of Cambridge, Philippa Fawcett Drive, Cambridge, CB3 0AS, UK. E-mail: aal35@cam.ac.uk
bAachener Verfahrenstechnik – Process Systems Engineering, RWTH Aachen University, Aachen, Germany
cUCB Pharma S.A. Allée de la Recherche, 60 1070, Brussels, Belgium
dCambridge Centre for Advanced Research and Education in Singapore Ltd., 1 Create Way, CREATE Tower #05-05, 138602, Singapore
First published on 30th May 2019
Rational solvent selection remains a significant challenge in process development. Here we describe a hybrid mechanistic-machine learning approach, geared towards automated process development workflow. A library of 459 solvents was used, for which 12 conventional molecular descriptors, two reaction-specific descriptors, and additional descriptors based on screening charge density, were calculated. Gaussian process surrogate models were trained on experimental data from a Rh(CO)2(acac)/Josiphos catalysed asymmetric hydrogenation of a chiral α-β unsaturated γ-lactam. With two simultaneous objectives – high conversion and high diastereomeric excess – the multi-objective algorithm, trained on the initial dataset of 25 solvents, has identified solvents leading to better reaction outcomes. In addition to being a powerful design of experiments (DoE) methodology, the resulting Gaussian process surrogate model for conversion is, in statistical terms, predictive, with a cross-validation correlation coefficient of 0.84. After identifying promising solvents, the composition of solvent mixtures and optimal reaction temperature were found using a black-box Bayesian optimisation. We then demonstrated the application of a new genetic programming approach to select an appropriate machine learning model for a specific physical system, which should allow the transition of the overall process development workflow into the future robotic laboratories.
Attempts to utilise generalisations of fundamental physical knowledge of discrete variables can be traced back to the 1950s when Taft demonstrated that steric effects can be isolated, and developed some of the first steric parameters.17 This paved the way for ligand and substrate descriptors,18,19 which are now common in multivariate linear regression models, and mechanistic interrogation.20–23 Successful development towards solvent optimisation has been far more limited. Reizman and Jensen have attempted to bypass physical knowledge of solvents in their microfluidic reaction droplets platform.24 A flow experiment was combined with an algorithm based on sequential adaptive response surface methodology and optimal DoE. The algorithm identified DMSO as a promising solvent out of a pre-selected set of 10 solvents, for the alkylation of 1,2-diaminocyclohexane. As a black box method, it does not provide physical insights.
Exploring the solvent landscape more extensively, and simultaneously gaining physical insights, will almost certainly require a combination of statistics and chemical information. Murray et al. make use of a linear dimensionality reduction approach,25 first demonstrated for solvents in the 1980s separately by Chastrette et al.26 and Carlson et al.,27 to parametrise a solvent map using molecular descriptors. Murray et al. used this concept to obtain a solvent map and thus extend the traditional factorial DoE approaches. In this study we adopted the Principal Component Analysis (PCA) approach as one option for extracting features,28 or meaningful input variables, from the large dimensional descriptor space for the machine learning surrogate models. This is also referred to as feature engineering.29,30 In addition to the previous study on molecular descriptors of solvents,25 we also compute more reaction-specific descriptors, in order to produce more relevant features.
Struebing et al. describe a computer-aided molecular design strategy aimed at utilising physical knowledge that is conceptually similar to the one presented here.31 They used quantum mechanics to compute reaction rate constants in six initial solvents and constructed a qualitative surrogate model based on a linear free-energy relationship between the rate constant and five molecular descriptors. The subsequent step used mixed-integer linear programming to select the next solvent, whose performance was predicted via quantum mechanics as well. This single-objective method was tested on the Menschutkin reaction. Ultimately, nitromethane was identified and verified as the superior solvent. The rate constants for the solvents were shown to exhibit an approximate proportional relationship with the dielectric constant. This demonstrates the potential of a hybrid mechanistic-machine learning approach: it incorporates general prior knowledge into the data analysis framework, rather than leaving it to the algorithm to ‘rediscover’ this information.
The method developed in this study extends the growing use of machine learning in chemistry and chemical engineering,9,32 specifically an attempt to combine multi-objective DoE algorithms with physical knowledge in the form of solvent properties. The workflow and transition of chemistry knowledge into machine learning domain and process domain is illustrated in Fig. 1. We start with a library of 459 candidate solvents. Then, we acquire physical knowledge from property databases and through molecular simulations leading to 17 molecular descriptors. The transition of physical knowledge into the machine learning domain is achieved via a dimensionality reduction that provides features for the Gaussian process machine-learning models then used in Bayesian optimisation with lab experiments and analysis in the loop. The obtained results could be linked back to the physically-meaningful molecular descriptors and represent new, generic physical knowledge. The strategy is computationally inexpensive, in that it does not require high performance computing facilities. Furthermore, it is applicable to multi-objective problems, and to difficult reactions, exemplified here with a transition metal catalysed asymmetric hydrogenation.
![]() | ||
| Fig. 1 The workflow and transition of chemistry knowledge into machine learning domain and process domain, i.e., experiment and analysis. | ||
The example reaction under study is a Rh(CO)2(acac)/Josiphos(cyclohexyl/4-methoxy-3,5-dimethylphenyl) catalysed asymmetric hydrogenation of a complex chiral α-β unsaturated γ-lactam (I), shown in Scheme 1, used to produce UCB Pharma's new anti-epileptic drug Brivaracetam (II).33
![]() | ||
| Scheme 1 Rh–Josiphos catalysed asymmetric hydrogenation reaction of I, conditions, and objectives.34,35 | ||
The aim is to develop a workflow, which reduces experimental bias and, hence, explores non-intuitive solvent selections, leads to a predictive surrogate model which can be used in an optimisation, and also helps to develop new generic physical knowledge.
Physical knowledge about solvents is introduced to the machine learning context via a set of molecular descriptors. Apart from descriptor databases available in the literature, a source of descriptors can be computational software, such as COSMOtherm, which combines quantum chemistry and thermodynamics to predict properties. COSMOtherm has been used since the 1990s for a range of applications, such as the evaluation of new solvents for solid–liquid extraction of biopharmaceuticals,36 partitioning of organic substances into different phases,37 and predicting solute partition in multiphase complex fluids.38 Among the descriptors computed in this study, we explored screening charge density profiles.39,40 These information-rich ‘σ profiles’, which are histograms of screening charge density on the molecular surface, were converted to numerical descriptors per solvent, each defining a different segment of the profile, see Fig. 2. This concept has previously been used in multilinear regression models to correlate CO2 absorption and desorption capacities in various amine solvents,41 as well as rate constants in the case of a Diels–Alder reaction.42 This work was later extended to identify solvents that possess the best reaction performance under model uncertainty,43 which allows to explicitly account for uncertainty in the process of descriptor design, and further solvent selection/design workflow.
This paper comprises three sections. First, we present a hybrid mechanistic-machine learning approach to identify promising solvents in a large decision space. For this we evaluate different descriptors: (i) σ profiles only, and (ii) a dimensionality-reduced set of a larger set of 17 meaningful molecular descriptors. Then, we focus on the promising solvents identified in the first part, and algorithmically determine operating conditions (temperature and solvent mixtures). Thus, this work presents a two-step approach, utilising advantages of both a priori knowledge, and black box optimisation. Finally, we incorporate into the workflow of solvent selection an automated tool for determining optimal machine learning pipelines.
:
H2SO4 (98
:
2 v/v%) mobile phase). As the quality of machine learning models depends heavily on the quality of the training data, experiments were repeated two or three times; experimental error was determined and is shown with the results. Initially we screened 34 solvents from a diverse range of solvent classes, guided by expert knowledge. The aim was to cover as many types of solvents, but using molecules selected by expert synthetic chemists, based on their prior experience. From these we used the descriptors of 25 solvents as inputs to the algorithm in order to test the predictions on the other 9. There are numerous methods of selecting initial data sets; this expert-guided strategy allows for a comparison of algorithm-inspired suggestions and human expertise.44 The objectives of interest were conversion and d.e. We note that using ΔΔG‡ (difference in Gibbs free energy between diastereomeric transition states, commonly employed in data analysis for asymmetric catalysis), showed little difference in the models, compared to d.e. For this case study, we determined a priori that the relation between solvents is approximately consistent across most continuous variables, such as temperature, so that an initial focus on the discrete variable only, with other conditions held constant, is appropriate.
was used in preference over σ1–σ5 within the 17 descriptors.
| Descriptor (units) | Source |
|---|---|
| Molecular weight (g mol−1) | Stenutz45 |
| Density (g mL−1) | Stenutz45 |
| Molar volume (mL mol−1) | Stenutz45 |
| Refractive index (—) | Stenutz45 |
| Molecular refractive power (mL mol−1) | Stenutz45 |
| Dielectric constant (—) | Stenutz45 |
| Dipole moment (D) | Stenutz45 |
| Melting point (°C) | Stenutz45 |
| Boiling point (°C) | Stenutz45 |
| Viscosity (cP) | COSMOtherm39 |
ln Poctanol–water partition coefficient (—) |
COSMOtherm39 |
| Vapour pressure (mbar) | COSMOtherm39 |
| Henry's constant of H2 in solvent (bar) | COSMOtherm39 |
| ln(γ) activity coefficient of I in solvent (—) | COSMOtherm39 |
profiles segmented into three (—) |
COSMOtherm39 |
| σ 1–σ5 profiles segmented into five (—) | COSMOtherm39 |
| t 1–t4: principal components from PCA (—) | — |
An overview of the six models considered in this study is given in Table 2. Models 1–3 are used for DoE, and then models 4–6 are compared to model 3 to investigate model robustness when less chemical information is used.
In the solvents mixing study, the optimisation variables were (i) temperature, and (ii) solvent volume fractions xi=1,2,3. In total, four solvents were mixed, but only three were used as variables to avoid over-specifying the problem. The optimisation was conducted as a batch-sequential optimisation (five reactions per batch). To do this, two further algorithm modifications were made. First, we note that the reactor used does not allow different temperatures for parallel reactions in the same batch. Therefore, because temperature is one of the optimisation variables, we ran the algorithm in two steps. Step 1 is the normal algorithm run, using all variables including the reaction temperature, generating one recipe (‘recipe’ being the set of suggested reaction conditions to be tested experimentally). Step 2 is the application of the normal batch-sequential method in TS-EMO but holding temperature at the previous selection, and, hence, selecting four more recipes at that temperature. In Step 2 we also included the constraint that the solvent fractions xi=1,2,3 must sum to less than 1, so that the fraction of the fourth solvent would be the balance. As this did not work with the NSGA-II implementation normally used in TS-EMO, we switched to the ‘gamultiobj’ from the ‘Global Optimisation Toolbox’ within Matlab. Finally, we used the classification methodology Tree-based Pipeline Optimisation Tool (TPOT), which is described elsewhere;53 its source code is available on GitHub.
The hyperparameters of the surrogate model trained on all solvents tested in this study, shown in Table 3 (model 1, entries 1–5), give information on the impact of each input variable on each objective. This is known as automatic relevance determination, and refers to the length scales that parametrise the covariance matrix.54 The lower the value, the greater is the significance of the variable. Thus, σ3 (model 1, entry 3) bears the greatest influence on conversion, while σ1, σ2, and σ5 are most decisive towards d.e. One can rationalise this by noticing that the alcohols in Fig. 3 consistently show approximately 70% d.e., as they contain the same hydrogen bond accepting descriptor in the σ profiles, which is the information contained in σ5. Solvents of a similar class, such as alcohols and ketones, tend to cause similar d.e., information mostly contained in the extremes of a σ profile (σ1 and σ5). On the other hand, the neutral segment σ3 impacts mainly conversion, which is demonstrated by the differences in conversion reached with butanol, hexanol, and octanol.
When σ profiles were segmented into only three regions instead of five (model 2), the initial model trained on the same training data is significantly more accurate for conversion, as shown in Fig. S3.† The advantage of a smaller number of input variables in the surrogate model outweighs the potential loss of fidelity of chemical information through wider ranges of screening charge densities used. Indeed, using this model, some of the same solvents are suggested, including methyl pentanoate and propyl propanoate, and some new ones such as 5-nonanone and 1-nonanol (Table S2† shows detailed results of the outcomes). Both new solvents gave ≥90% conversion, and 1-nonanol is amongst the best solvents found in this reaction, with 70% d.e. A further iteration led to experimentally testing tert-butylamine.
Unlike conversion, d.e. is challenging to predict accurately, due to the limited range of the data. However, d.e. predictions appear to be better using model 1 (σ1–σ5) than model 2 (
). We have already identified that solvents primarily affect reactivity and only in part, d.e. Therefore, a model based on solvent descriptors cannot be expected to be accurate for d.e. in this case. Thus, it is more useful to think of d.e. prediction as a classification problem (50–60% as ‘low’, 60–70% as ‘high’). Using model 2, the final model correctly predicts the class of only five out of the nine data points, whereas seven out of nine are correctly classified using model 1. Solvent polarity, which is information contained in the profile extremes (non-neutral regions), is more impactful towards d.e. and is better described by σ1–σ5 than
. Hyperparameters using
are shown in Table 3 (model 2, entries 6–8).
| Strategy | Conversion > 90% | d.e. > 60% | Conversion > 90% and d.e. > 60% |
|---|---|---|---|
| Human | 1 of 34 (3%) | 13 of 34 (38%) | 1 of 34 (3%) |
| Algorithm | 12 of 18 (67%) | 10 of 18 (56%) | 7 of 18 (39%) |
The surrogate model for conversion that uses PCA (model 3) is superior to the model that uses σ profiles only (models 1 and 2). This is likely because model 3 includes further physico-chemical information, such as viscosity and solubility of a reactant. These descriptors capture information regarding phenomena that affect reaction conversion, such as mass transfer and molecular interactions. This highlights the importance of developing correct molecular descriptors for a specific reaction of interest. Model 3 does not improve the d.e. prediction; the additional descriptors do not contain extra important information for d.e., as compared to models 1 and 2. The hyperparameters are shown in Table 3 (model 3, entries 9–12), as well as the descriptors that most contribute to the principal components.
Whilst excellent conversions (>90%) could be achieved, attaining high d.e. proved to be more challenging, as the behaviour rendering this outcome requires specific interactions between the chiral ligand, and the substrate, and d.e. in this case is unlikely to reach >75% in any solvent under the same conditions for temperature and pressure. Therefore, the choice of chiral ligand is also key in this case, rather than just the solvent. The highly promising solvents described above are rarely reported in the vast body of literature on asymmetric hydrogenation. This is partially due to experimenter bias, and due to ease of access to the most commonly used laboratory reagents. Certainly, further analysis of the suggested solvents is required, specifically with respect of cost, supply chain and downstream separation.
(q2 = 0.76), model 1 using σ1–σ5 (q2 = 0.61), and finally using just one parameter, t1 (model 6, q2 = 0.59), capturing only 32% of the variance in the original data. For d.e., model 1 is best (q2 = 0.33), predicting 73% of the solvents to be in the correct category (‘high’ vs. ‘low’ d.e.), whereas the other models are weak. The cross-validated predictions for the best models for conversion and d.e. are shown in Fig. 4. These results suggest that an ensemble of GP models, each based on different descriptors, provides better results than using the same descriptors for the different objectives.
We found that the combined properties of 1-octanol and triethyl amine, produced superior results to either of the pure solvents (see Table S4†), and that the combined solvent σ profile fits the predictive models discussed earlier, where the combined σ profile is a linear combination of the concentrations of the two individual ones, as shown in Fig. S9.† In addition to temperature and the two solvents mentioned, we included two further solvents – 1-nonanol and tributyl amine – as optimisation variables to investigate the temperature-dependent mixed solvents' landscape of amines and alcohols, with the aim of identifying promising recipes and temperature operating conditions. The trade-off between reactivity and selectivity makes the temperature landscape complex and non-intuitive. Algorithm modifications are described in Materials and methods.
![]() | ||
| Fig. 5 Outcomes of initial and algorithm-suggested solvent mixtures and the determined Pareto front. Labels refer to experiments (entries in Table S5†). | ||
Table S4† shows the suggested recipes. The first five suggested reactions are at 82 °C, which was a temperature that achieved full conversion in all suggestions. In four out of five cases, an even higher selectivity than the training set entries that achieve full conversion. In a second algorithm-guided batch experiment, five mixtures at 65 °C were selected, showing in one case 96% and 70% (entry 25 in Table S4†). Three iterations were conducted, and outcomes of these recipes are shown in Fig. 5. The hyperparameters of the more accurate GP model (trained on all the available data) show that temperature is the most impactful variable towards each objective. Furthermore, alcohol content impacts conversion more than it does d.e., and that amine content impacts d.e. more. Out of the five recipes that lie on the determined Pareto front (entries 4, 20, 25, 28, 31 in Table S4†), four were selected by the algorithm. Cross-validated results show excellent GP predictive ability (qconversion2 = 0.91 and qd.e.2 = 0.88, shown in Fig. 6). The advantage of this approach was the lack of significant a priori knowledge, whereas the descriptors approach described earlier gives detailed explanation of the importance of certain descriptors, giving further mechanistic insight into the reaction. Unlike in the case of other descriptors, it is straightforward to compute σ-profiles for solvent mixtures using a linear weighting of the pure solvents' σ-profiles. This is an advantage that allows predicting outcomes of solvent mixtures using models such as those discussed earlier in the paper.
We adapted the pipeline optimization for our given problem domain. The aim was to determine its utility as a classification method, in conjunction with in silico modelling to amplify data, to navigate the descriptor space, and to optimise for superior solvents. The thresholds were set to 80% and 65% for conversion and d.e. respectively. The workflow is illustrated in Fig. 8. Given the large amount of data required to be used in classification and genetic programming, we decided to amplify a small set of 10 experimental data points to 100 by using the best GP surrogate models described earlier in the paper, specifically model 4 for conversion and model 1 for d.e. TPOT was then used to select some new solvents, which were tested experimentally and used to improve the accuracy of the amplified data by retraining the GP surrogate model. This loop is repeated until superior solvents are consistently found. The results show that this method rapidly classifies the experimentally confirmed excellent solvents as shown in Fig. 9, which shows reaction outcomes for the set of the initial 10 solvents in the first iteration (as individual data points). The experimentally verified solvents from Iteration 2 were used to improve the amplified data, which leads to very well-classified experimentally confirmed results in Iteration 3. Finally, these were incorporated into the feedback loop, and all solvents classified as ‘high’ were simulated in the highest fidelity versions of models 1 and 4 (for d.e. and conversion respectively), confirming excellent classification accuracy. Table S5,† which includes solvents described previously using TS-EMO, shows the detailed outcomes. Tables S6–S9† show the hyperparameters of models used in each iteration. Details of the best pipelines found, as well as classification accuracy at each iteration, are found in the ESI† as well.
![]() | ||
| Fig. 8 Strategy to investigate the applicability of Tree-based Pipeline Optimisation Tool (TPOT) to the solvent selection problem. | ||
Footnote |
| † Electronic supplementary information (ESI) available: Library of solvents and molecular descriptors. See DOI: 10.1039/c9sc01844a |
| This journal is © The Royal Society of Chemistry 2019 |