Probing machine learning models based on high throughput experimentation data for the discovery of asymmetric hydrogenation catalysts

Enantioselective hydrogenation of olefins by Rh-based chiral catalysts has been extensively studied for more than 50 years. Naively, one would expect that everything about this transformation is known and that selecting a catalyst that induces the desired reactivity or selectivity is a trivial task. Nonetheless, ligand engineering or selection for any new prochiral olefin remains an empirical trial-error exercise. In this study, we investigated whether machine learning techniques could be used to accelerate the identification of the most efficient chiral ligand. For this purpose, we used high throughput experimentation to build a large dataset consisting of results for Rh-catalyzed asymmetric olefin hydrogenation, specially designed for applications in machine learning. We showcased its alignment with existing literature while addressing observed discrepancies. Additionally, a computational framework for the automated and reproducible quantum-chemistry based featurization of catalyst structures was created. Together with less computationally demanding representations, these descriptors were fed into our machine learning pipeline for both out-of-domain and in-domain prediction tasks of selectivity and reactivity. For out-of-domain purposes, our models provided limited efficacy. It was found that even the most expensive descriptors do not impart significant meaning to the model predictions. The in-domain application, while partly successful for predictions of conversion, emphasizes the need for evaluating the cost–benefit ratio of computationally intensive descriptors and for tailored descriptor design. Challenges persist in predicting enantioselectivity, calling for caution in interpreting results from small datasets. Our insights underscore the importance of dataset diversity with broad substrate inclusion and suggest that mechanistic considerations could improve the accuracy of statistical models.


S1 Experimental settings S1.1 General considerations
All manipulations were, unless stated otherwise, performed under inert atmosphere in a nitrogen-filled glovebox.Chemicals, pre-catalysts, and anhydrous solvents were purchased from Sigma-Aldrich, STREM, Solvias, abcr, Santa Cruz Biotechnology, Ambeed, Kanto, Fisher Scientific, TCI, Sinocompound, and BLDpharm, and were used as received.Airand/or moisture sensitive materials were stored inside the glovebox.
A PTFE-coated magnetic stirring bar and 500 µl 1,2-dichloroethane (DCE) were added to each well.Plates were tumble stirred at room temperature, solubility was recorded (Y/N), and DCE was removed by parallel evaporation (Genevac EZ-2 Elite).For DCE-insoluble ligands this procedure was repeated with THF.
For complexation, a PTFE-coated stirring bar was added to each well and the metal precursor was (slurry) dispensed using a multichannel pipette (50 µl DCE well −1 ): [Rh(NBD) 2 ]BF 4 : 3.70 mg/9.90 µmol well −1 .Chiral ligands were dissolved/suspended in 500 µl DCE or THF. 100 µl ligand solution was dispensed to the metal precursor solution using a multichannel pipette.Reactor blocks (Para-dox, Analytical Sales and Services) were then closed and stirred overnight at room temperature (∼35 °C on tumble stirrer inside glovebox).Afterwards, solvent was removed by parallel evaporation.Plates were stored inside the glovebox and were used throughout the experimental campaign.

S1.2.2 Catalyst kit preparation
Catalysts were dissolved/suspended in 250 µl DCE (or the appropriate volume for 0.2 µmol/5 µl).A Parylene-C-coated stirring bar was added to each well (8x30 mm glass shells) and 5 µl of each catalyst solution was dispensed using a multichannel pipette (0.2 µmol catalyst per well).Care was taken to dispense to the bottom of each well.Solvent was removed from the source plates and kits by parallel evaporation.Kits were used immediately or stored inside the glovebox.

S1.2.3 Reaction execution
Pre-dispensed catalyst kits were used for all experiments.To each well a stock solution of 150 µl was added that contained the starting material (e.g., 4.4 mg well-1 of SM1 in 150 µl methanol or DCE to screen at 1 mol% Rh).The reactor block was closed with a pre-slitted PFA mat and Para-dox lid (Analytical Sales and Services).The reactor block was transferred to the parallel reactor system.
Experiments were performed in a custom-made parallel reactor system (Integrated Lab Solutions, Berlin, Germany, Figure S1).This reactor system was designed to fit four SBSsized well plates and offers individual control over gas composition (N 2 , H 2 , specialty, pressure (up to 100 bar), and reaction temperature (sub-ambient to 150 °C).Stirring is performed by tumble stirring.
Upon transfer of the reactor block to the reactor, the reactor was pre-heated to 25 °C, and the reactor headspace was flushed with N 2 (≥2 min at 5 l min −1 ) while the reactor lid was closed.The headspace was then flushed with H 2 (≥30 s) and pressurized to the desired pressure.Tumble stirring was then engaged to start the reaction.

S2 Additional information on internal data analysis
Three independent runs were performed with SM1 to assess the reproducibility of the approach.Results are reported in figure S4.

S3 Density Functional Theory calculations
3][4] This combination of functional and basis set were chosen in an effort to balance computational cost and accuracy.6][7][8] The nature of each stationary point was confirmed via frequency analysis.Thermochemical parameters (e.g.ZPE, finite temperature corrections and entropy contributions to Gibbs free energies) were computed from analytical frequencies (Hessian) at 298.15K and 1 atm.Single point calculations on free ligands extracted from the optimized metal-ligand complexes were done at the same level of theory to obtain ligand descriptors.A Natural Population Analysis (NPA) was performed using the NBO program version 3.1 as implemented in Gaussian 16.

S4 Details of featurization
The in-house developed Python package Open Bidentate Ligand eXplorer (OBeLiX) was used for the automated extraction and calculation of descriptors. 9An installation and usage guide can be found in its Github repository (https://github.com/EPiCs-group/obelix).
Model homogeneous catalyst structure were constructed, featuring a rhodium (Rh) metal center.The metal center was coordinated with a biphosphine bidentate ligand.Additionally, a norbornadiene (NBD) moiety was coordinated as a model substrate representative of the experimental protocol for pre-catalyst generation (Figure S5).In short, the workflow uses a graph method to find and enumerate the metal center and bidentate ligand donor atoms in these model structures.This enumeration was necessary for the calculation of local descriptors and orientation of descriptors such as quadrant/octant contributions of the buried volume.The enumeration of ligand donor atoms is based on a GFN2-xTB single-point calculation as implemented in Morfeus. 10,11For the bidentate ligands, the two coordinating atoms were distinguished based on their charge with the label 'min'/'max' denoting the least/most positively charged atom respectively.
Most calculated descriptors are self-explanatory and readily extracted using the default settings of Morfeus 11 or cclib. 12A detailed overview of all descriptors can be found in the 'C=C AH dataset.xlsx'Excel file.A more detailed explanation for the definition of the calculated dihedral angles and oriented buried volume are given in the text below.

S4.1 Buried volumes
The buried volume parameters were obtained utilizing the Morfeus package.This involved centering spheres of varying radii (ranging from 3.0 to 7.0 Å) on the Rh metal center and subsequently reporting the percentage of the sphere's volume occupied by the ligand.Additionally, the buried volume with a radius of 3.5 Å was calculated locally on both ligand donor atoms, referred to as buried volume donor max and -donor min.
Further analysis was conducted on the quadrant and octant contributions to the buried volume.These were defined using a buried volume radius of 7 Å.The donor min and donor max were oriented in the negative and positive x directions respectively, with the yaxis positioned perpendicular to the plane formed by donor min, donor max, and Rh.The quadrant buried volumes were defined by quadrants in the x,y plane, extending in both the positive and negative z directions.This approach ensures a thorough analysis of the buried volume parameters.An example of the steric maps are shown in Figure S6 for ligand L16, where the ligand structure and the oriented buried volume are shown next to each other.In this case, the P atom with index 22 (Figure S6a) is the max donor and thus points towards the positive x-axis.In the steric plot (Figure S6b), the t-butyl groups can be identified on the right, while the phenyl groups and their respective orientation are visible on the left.

S4.2 Dihedral angles
The NBD moiety comprised a central carbon atom, which was connected to two hydrogen atoms.To assess the spatial arrangement of the hydrogen atoms with respect to the metal center and the diagonally opposite phosphorus atom, the dihedral angles between each hydrogen atom, the metal center, and the corresponding phosphorus atom were calculated (Figure S7).The dihedral angles provide insights into the spatial orientation of the substrate and potential steric interactions between the metal center and ligand in the catalyst structure.

S5 Literature comparison of 3D DFT-based descriptors
In this section more details are given on the comparison of the descriptor library for overlapping ligands (111 out of 192) published by Sigman's group in collaboration with Genentech. 8The full analysis as described in this section can be found in the 'dft nbd model literature comparison.zip' in the SI which contains a Jupyter Notebook.Various aspects in the workflow to create the published descriptor library were observed to be slightly different from our study, such as the nature of the placeholder substrate and metal center (Rh(L)(NBD) in our case and Pd(L)(Cl) 2 in the published study), structure generation, geometry optimization, and descriptor calculation methods.Initially, we attempted to reconstruct the descriptor library via our methods to mitigate any difference in the descriptor calculation method.This was done by applying our OBeLiX workflow to the published xyz structures of catalysts.The comparison was done by extracting the subset of structures that are exactly the same, also in axial symmetry and orientation of stereocenters, and using our own OBeLiX package to calculate descriptors on them.The xyz files of the Pd(L)(Cl) 2 structures were extracted, and a DFT SP calculation was performed on these structures to derive the DFT-based descriptors (vide supra).It is crucial to emphasize that no supplementary geometry optimization was conducted; therefore, the comparison is based on the structures as originally published by Dotson et al. 8 The comparative study focused on a curated selection of descriptors chosen to comprehensively represent various catalytic properties.This encompassed the calculation of descriptors both on the complex and the free ligand, providing a holistic view.Three types of comparisons were conducted: 1. Global descriptors, capturing the overall electronic structure of the ligand.
2. Local descriptors, characterizing the environment around the metal center and ligand donor atoms.
3. Spatial arrangement via the buried volume, quantifying differences in the ligand's configuration.
For global descriptors, we selected the HOMO (figure S8 A), LUMO (figure S8 B), dipole moment (figure S8 C) and bite angle (figure S8 D).Interestingly, all descriptors exhibited a Pearson correlation coefficient (R 2 ) exceeding 0.75, except for the dipole moment.It is noteworthy that the dipole moment is particularly susceptible to variations in the ligand's conformation, exerting a significant influence on the steric environment.This nuanced effect remains unaccounted for in the bite angle, given its constrained measurement between three points, such as P-M-P in the case of PP ligands.
To compare the local environment of the donor atoms, the NBO charge, mulliken charge and buried volume on the donor atoms of the ligand were selected (Figure S9 A,B and   D).For the comparison, these descriptors were averaged over both donor atoms, implying that only the average contribution is compared.This was necessary since in the descriptor calculation, the donors are labeled based on their charge.To ease the comparison, we thus averaged the descriptor over both donors.The metal center's environment was compared

S16
using a buried volume at the metal center.The electronic descriptors on the donors showed good correlations (R 2 > 0.8), even after removal of two extreme cases (L17 (R)-BINAM-P and L139 (R)-CTH-BINAM).The correlations for steric descriptors are significantly worse, indicating a large difference in local steric environment, both around the metal center (Figure S9C) and the donor atoms (Figure S9D).Finally, the difference in the spatial arrangement of the ligand was compared through a quadrant/octant analysis (Figure S10).To do this, the buried volume at the metal center with a radius of 7 Å was separated into quadrant and octant contributions.To compare octant contributions regardless of specific orientation, we chose to compare the the minimum, maximum and average of octant contributions (Figure S10).This means that the minimum, maximum and average over the eight octants were calculated for both structures and compared.These comparisons show that although there is a trend, the correlation between the minimal contributions (R 2 = 0.402) and maximum contributions (R 2 = 0.586) are rather weak.However, the averaged octants show a reasonable correlation (R 2 = 0.72).This shows that although the extremes of the buried volume contributions might be different, the averages are similar.This also indicates that the local steric environment of the ligand is sensitive to small changes, e.g. to the nature of the placeholder substrate used in the metal-ligand complex (Rh(L)(NBD) in our case and Pd(L)(Cl) 2 in the published study).

S6 Principal Component Analysis
The first component explains 27% of the variance observed in the DFT NBD descriptors, while 11% of the variance is explained by the second component.In the score plot, each point is categorized by the family of ligands under investigation, such as bisphosphines, phosphine-amines, etc.This plot reveals the degree of similarity among catalysts based on their respective descriptors and three main clusters can be identified.Ligands belonging to the phosphoramidite class are distinguished by their positive values on both principal components, forming a distinct cluster in the upper right quadrant.Conversely, ligands classified as phosphine-amines are characterized by negative values on the second principal component, signifying a shared similarity in ligand properties within this category.Bisphosphine (PP) ligands are dispersed around the central region of the plot, indicating that their characteristics are representative of an average catalyst in this dataset.This observation aligns with expectations, given that bisphosphines constitute the majority of the ligands tested.
Our DFT NBD descriptors were binned into three categories: steric, geometric and electronic/thermodynamic.Correspondingly, the loading plot presented in figure SS11 is color-coded to reflect these categories.Steric descriptors correspond to negative values of the first principal component (PC1), whereas electronic descriptors are linked to extreme variations in both PC1 and PC2.This pattern allows us to infer that electronic properties are primarily responsible for differentiating the first two principal components in our analysis.

S7 Linear Regression
For in-domain approach we tested also Linear Regression for Conversion and DDG with up to three DFT-based descriptors (brute-force approach) for a total of 7770 linear regression models (with leave one out validation) for both regression tasks.Results are reported in Figure S12.Based on various tests with auto-ML and TPOT, the RF algorithm was suggested as a suitable non-linear model for our data.Subsequently, the RF model was implemented in our ML pipeline which can be found on our Github page (https://github.com/epics-group/obelixml-pipeline).The scikit-learn Python package was used for all functionalities included this pipeline.Feature importances were calculated using the deault Gini importance as implemented in SKlearn.An 80/20 train/test split was used for the out-of-domain approach, while for in-domain the median for each substrate's data was used.For each training, a 5-fold cross-validation method is applied.A grid search cross-validation method was used for selecting hyperparameters.Within this grid search, the options for each hyperparameter were: • 'max depth' = [5, 50, 100, None], *None applied only to OHE-based models • 'max features' = [3, 5], • 'min samples leaf' = [1, 2, 5, 10], • 'min samples split' = [2, 5, 10], • 'n estimators' = [50, 100, 200],

S9 Extended partially out-of-domain approach
In our study, we expanded the partially out-of-domain methodology with DFT-based descriptors for ligands to assess the impact of correlated substrates.Adopting a similar workflow, we trained models using not only half of the target substrate samples but also included samples from one of the other substrates, resulting in a total of 20 distinct models.The objective was to evaluate whether a correlation exists between the Pearson correlation scores observed for experimental values (as reported in figureS13) and the balanced accuracies achieved when training on one substrate and predicting on another, across all possible substrate pairs.This approach serves as an empirical investigation into the impact of correlated substrates.

S10 Naive out-of-domain approach
We analysed the generalisability of predictions of in-domain models in a naive out-of-domain fashion.Basically we compared the predictions obtained from model trained on i-th substrate with the corresponding experimental values for the j-th substrate as reported in the tables below.Clearly diagonal values report the in-domain scores as showed in the following tables, while off-diagonal values are scores obtained by the naive out-of-domain approach.We confirmed results obtained in the extended out-of-domain approach since as expected we observed higher scores for correlated substrates.Especially for reactivity prediction, BAs greater than 0.6 can be observed for naive out-of-domain predictions of the most correlated substrates: SM1,SM2 and SM3.

S11 Monte Carlo in-domain approach
To evaluate if the best-performing models for enantioselectivity could emerge from smaller subsets of related catalyst families, a Monte-Carlo data selection approach was utilized.
This method involved testing 1,000 random splits for each catalyst fraction, ranging from 90% to 10% of the entire catalyst set, in 10% decrements.Each subset was divided into an 80:20 training-test ratio, and Random Forest (RF) models were trained using DFTbased descriptors (see Figure S14).No clear pattern emerged that distinguished these highperforming subsets from the rest, such as by ligand family or class.This suggests that the high scores were likely due to chance correlations and test overfitting.

Figure S1 :
Figure S1: Parallel reactor system used in this work (Integrated Lab Solutions, Berlin, Germany).

Figure S3 :
Figure S3: Comparison of conversion and enantiomeric excess across three runs (#1,#2 and #3) for SM1, 16h, Methanol.Discrepancies are labelled and illustrated more in detail in the categorical scatter plots.

Figure S4 :
Figure S4: Validation of the utilized parallel pressure reactor (top) resulting in a standard deviation (bottom) for conversion of < 0.5% and a standard deviation of 1.5% for ee.

Figure
Figure S5: A selection of calculated descriptors.The precatalyst model structure is shown on the top.Examples of descriptors in various categories, such as geometric, global/local steric and electronic are also displayed.

Figure
Figure S6: (a) Visualization of the 3D structure of L16, a Josiphos ligand.(b) Oriented map of the buried volume occupied by the bidentate ligand.

Figure
Figure S7: (a) Side view of the dihedral angle comprised of the H-C nbd-Rh-P atoms.(b) Front view of the same dihedral angle.

Figure S8 :
Figure S8: The figure illustrates the correlation between four global descriptors-HOMO (A), LUMO (B), dipole moment (C), and bite angle (D)-derived from ligand structures analyzed in our study, compared to descriptors derived from the same ligand structures as used in literature.The R 2 values indicate strong correlations (> 0.75) for all descriptors except dipole moment.

Figure S9 :
Figure S9: The figure illustrates the correlation between four local descriptors-average NBO charge of donors (A), average Mulliken charge of donors (B), buried volume at 4 Å of metal center (C), and average buried volume of donors at 3.5 Å (D)-derived from ligand structures analyzed in our study, compared to descriptors derived from the same ligand structures as used in literature.The electronic descriptors on the donors showed good correlations (R 2 > 0.8), while the steric descriptors indicate a large difference in local steric environment.

Figure S10 :
Figure S10: The figure illustrates the correlation between the octant contributions of the buried volume at the metal center with a radius of 7 Å.The minimum (A), maximum (B) and average (C) over the eight octants were derived from ligand structures analyzed in our study and compared to descriptors derived from the same ligand structures as used in literature.The correlation between the averaged octants show a reasonable correlation, while the extremes show a weak correlation.

Figure
Figure S11: (left) PCA score and loading plots obtained from DFT-based descriptors, colored based on ligand families in the experimental screening set.

Figure S12 :
Figure S12: R2 scores distribution for brute-force in-domain Linear Regression modelling

Figure S13 :
Figure S13: Balanced accuracy (BA) or R2 scores vs Pearson correlation coefficients for the extended partially out of domain approach.The first substrate in label is the target substrate while the second is the one used in the training set.Orange dots highlight pairs containing SM4 and/or SM5 in label.

Figure S14 :
Figure S14: Monte Carlo in-domain approach, red crosses = DFT-based descriptors, blue dots= random descriptors, pink stars = ECFPs.Numbers reported refer to the fraction of ligands selected.