A genetic optimization strategy with generality in asymmetric organocatalysis as a primary target

A catalyst possessing a broad substrate scope, in terms of both turnover and enantioselectivity, is sometimes called “general”. Despite their great utility in asymmetric synthesis, truly general catalysts are difficult or expensive to discover via traditional high-throughput screening and are, therefore, rare. Existing computational tools accelerate the evaluation of reaction conditions from a pre-defined set of experiments to identify the most general ones, but cannot generate entirely new catalysts with enhanced substrate breadth. For these reasons, we report an inverse design strategy based on the open-source genetic algorithm NaviCatGA and on the OSCAR database of organocatalysts to simultaneously probe the catalyst and substrate scope and optimize generality as a primary target. We apply this strategy to the Pictet–Spengler condensation, for which we curate a database of 820 reactions, used to train statistical models of selectivity and activity. Starting from OSCAR, we define a combinatorial space of millions of catalyst possibilities, and perform evolutionary experiments on a diverse substrate scope that is representative of the whole chemical space of tetrahydro-β-carboline products. While privileged catalysts emerge, we show how genetic optimization can address the broader question of generality in asymmetric synthesis, extracting structure–performance relationships from the challenging areas of chemical space.


Reaction Database
The histograms in Figure S1 show the distribution of experimental enantioselectivity and percentage yield values in the database of Pictet-Spengler condensations curated from the literature.Out of 820 reactions, only 36% (295) of yields were reported, predominantly for high-yielding reactions (60% of reported yields are > 80%).If we analyze the coverage of the reaction space by Pictet-Spengler condensations with reported yields (Figure S2), we see that large areas of chemical space are excluded from the database.For these reasons, DFT computations and molecular volcano plots 1 are used to estimate turnover frequencies and probe catalytic activity (vide infra). 2 of the reaction space on the basis of the concatenated Morgan FingerPrints of the substrates, catalyst, and co-catalyst.Grey points represent all the reactions in the database (820), colored points (according to the experimental yield, 295) are the ones for which the yield is reported.

Linear Free Energy Scaling Relationships
. Linear Free Energy Scaling Relationships of catalytic cycle intermediates and TSs for the Scaling Relationship Set (only one enantiomeric pathway is considered).The x-axis is the chosen descriptor variable, ∆GRRS (2).The shaded areas denote the 95% confidence intervals.
The TOF molecular volcano plots were constructed based on the LFESRs shown in Figure S3.Detailed instructions are provided elsewhere. 3The open-source toolkit volcanic 3 was used to automatically generate the volcano plots.The input file for volcanic containing the relative Gibbs free energies of the Scaling Relationship Set is provided in Table S1.∆GRRS (2) was identified by volcanic to be the best descriptor variable for constructing the molecular volcano plots for C2 (mean R 2 = 0.57, mean MAE = 2.76 kcal/mol, MAPE = 0.81) and C3 (mean R 2 = 0.50, mean MAE = 3.15 kcal/mol, MAPE = 1.35) addition.Figure S4 shows an exemplary potential energy surface (PES), highlighting the existence of transition states with similar degree of TOF control. 4The location of the SRS reactions on the t-SNE plot of Figure 2 is reported in Table S2, along with the logTOF values for C2 addition and the experimental ∆∆G ‡ .
Table S1.Relative Gibbs free energies (kcal/mol) of the Scaling Relationship Set.This is the format of the input file for volcanic.

Bayesian ridge regression of enantioselectivity
An alternative representation of the database of Pictet-Spengler condensations catalyzed by dual-and single-hydrogen-bond donors (S/DHBDs, 407 reactions) was constructed by extracting with ᴍᴏʀғᴇᴜs 9 119 global and local molecular features 10,11 from the corresponding catalytic cycle intermediate 2 ("BD"-labelled enantiomer, optimized at the PCM(Toluene)/M06-2X-D3/Def2-TZVP//M06-2X-D3/Def2-SVP level of theory as described in the Computational Details).Features include frontier molecular orbital energies, solvent accessible surface areas, polarizabilities, NBO charges, NMR chemical shifts, local nucleo/electrophilicities, bond distances, angles, dihedrals, and Sterimol parameters.FMO energies, NBO charges, NMR shifts, polarizabilities, and dipole moments were computed at the M06-2X-D3/Def2-TZVP level, whereas all other electronic descriptors were extracted with ᴍᴏʀғᴇᴜs from computations at the GFN2-xTB level.Figure S6 shows the t-SNE plot of the featurized structures, indicating how reactions catalyzed by the same catalyst chemotype or involving the same substrate types are correctly grouped together.
We then used Bayesian ridge regression (BRR, a regularized variation of least-squares fitting) to parametrize a multivariate linear regression model 12 of the experimental ∆∆G ‡ values following our recently reported approach. 13The resulting expression [equation ( 1)] contains 26 parameters extracted from either the protonated tetrahydro-β-carboline species (blue terms), the conjugated base of the cocatalyst (green term), or from the non-covalently bound Brønsted acid (red terms) (Scheme S2).Leaveone-out cross-validation (Figure S7) affords R 2 of 0.53 and mean absolute error of 0.34 kcal/mol.Despite these promising results, the cost associated with the optimization of structures generated during an evolutionary experiment and the computation of molecular features for untested catalyst-substrates combinations prohibit the use of this model for fitness evaluation during genetic optimization with NaviCatGA.
. Multivariate linear regression of experimental ∆∆G ‡ values for Pictet-Spengler condensations catalyzed by single-and dual-HBDs (407 reactions).Bayesian ridge regression was used to parametrize the MLR models and provide uncertainty estimations (the error bars). 13 (1)  Table S3.List of flexible SMILES strings of organocatalyst scaffolds for the evolutionary experiments.S4.List of SMILES strings of R 1-2 substituents for HBD organocatalysts explored in the evolutionary experiments.
Table S5.List of SMILES strings of R [3][4] substituents for CPA organocatalysts explored in the evolutionary experiments.2) of the top individual in the HBD population for selected generations (i.e., when the identity of the best-performing catalyst changes) during a single-objective optimization experiment where only ∆GRRS(2)med is optimized.Each datapoint corresponds to a reaction in the GPS.Outliers and far outliers are indicated with filled circles and squares, respectively.

Specificity-oriented optimization on the HBD catalyst space
In the single-objective optimization experiment reported in Figure S11, activity [i.e., ∆GRRS( 2)med] improves form 4.1 to -2.5 kcal/mol over the course of 27 generations, approaching the volcano peak of -9.0 kcal/mol (Fig. 4D).Conversely, ∆∆G ‡ med remains equal to 0.9 kcal/mol and does not exceed 1.4 kcal/mol (generation 14), but the difference between the upper and lower whisker significantly decreases.The catalyst scaffold evolves from the thiosquaramide-pyrrolidino moiety in generation 1 to the squaramide-amide-based template in generation 27.In this generality-oriented experiment, the targets are scalarized hierarchically using Chimera 15 as follows: first, the median of the activity fitness score fi of catalyst candidate i across the GPS, defined (, a normalized gaussian distribution centered on the target x (-9 kcal/mol, the volcano peak), is maximized.Then, the median selectivity (∆∆G ‡ med) across the GPS is maximized with a 10% degradation margin.Finally, the standard deviations of ∆∆G ‡ med and median fi are minimized with a 25% compromise.This experiment differs from the one reported in Fig. 8 by the order of the first two targets, meaning that we allow ∆∆G ‡ med to be marginally decreased in order to improve activity.Over the course of 47 generations, enantioselectivity increases from ∆∆G ‡ med = 1.0 kcal/mol to 1.4 kcal/mol, while activity simultaneously improves from ∆GRRS(2)med = 2.9 kcal/mol to -2.1 kcal/mol (i.e., closer to the volcano peak of -9.0 kcal/mol).Compared to the experiment reported in Fig. 8, by inverting the order of priority of the two targets NaviCatGA explores candidates with higher median activity but lower median selectivity: the top organocatalyst found at the end of the evolutionary experiment lacks the amide-based template [-C(=O)NR2] which is associated with high generality, as evinced by the number of reactions in the GPS with ∆∆G ‡ < 1 kcal/mol.2) of the top individual in the HBD population for selected generations (i.e., when the identity of the best-performing catalyst changes) during a specificity-oriented optimization experiment.Each datapoint corresponds to a reaction in the GPS, the yellow diamond indicates reaction 13 [2-(5-(benzyloxy)-7-methyl-1H-indol-3yl)ethan-1-amine + ethyl 2-oxopentanoate).Outliers and far outliers are indicated with filled circles and squares, respectively.The optimization targets are ∆∆G ‡ and ∆GRRS(2) of reaction 13.
Figure S13 reports another specificity-oriented evolutionary experiment performed on the HBD catalyst space where we wish to simultaneously optimize the activity [∆GRRS (2)] and enantioselectivity [∆∆G ‡ ] of reaction 13 in the GPS.This Pictet-Spengler condensation features an unprotected β-arylethylamine (SubA) substituted at the 7-position of the indole ring, which has been found to be associated with low stereoselectivity due to the disruption of a key non-covalent interaction between the catalyst's amide O and the indole N-H (see Figures 10 and 11).Over the course of 44 generations, ∆∆G ‡ reaches the value of 0.7 kcal/mol, which is slightly higher than that predicted for the top organocatalyst from generation 32 in the multi-objective, generality-oriented experiment (0.4 kcal/mol, Fig. 8 and Fig. 10) or for the one from generation 37 in the single-objective experiment (0.5 kcal/mol, Fig. 9 and Figure S10).Conversely, the catalyst found in this NaviCatGA run is less general (∆∆G ‡ med = 1.6 kcal/mol) than from the Fig. 8 (1.9 kcal/mol) or from the SOO experiment (2.0 kcal/mol).Despite the improvement in predicted enantioselectivity, this transformation remains essentially an outlier in the GPS (the yellow diamond corresponding to reaction 13 lies on the lower whisker of the box plot of generation 44, Figure S13), indicating that HBD catalysts possessing the amide template [-C(=O)NR2] are ill-suited to impart high stereoselectivity in condensations involving 7-indolyl substituents.The best organocatalyst from generation 44 features a cyclohexyl urea motif with an unusual ethylbenzene substituent (also found in the specificity-oriented optimization of

Figure S1 .
Figure S1.Distribution of experimental ∆∆G ‡ values (left) and experimental yields (right) in the literature database of 820 Pictet-Spengler reactions.

Figure S5 .
Figure S5.Out-of-sample enantioselectivity predictions on 46 reactions 6-8 withheld from the literature database using the trained XGBoost/MFP model.Teal points indicate reactions involving geminallydisubstituted tryptamines, 6 as shown in Scheme S1.

Figure S6 .
Figure S6.2D t-SNE map 2 of the Pictet-Spengler reactions catalyzed by single-and dual-HBDs on the basis of the molecular (global and local) features of catalytic cycle intermediate 2.

Figure S8 .
Figure S8.Database of Brønsted acid templates for evolutionary experiments.X = O/S.

Figure S9 .
Figure S9.Box-and-whisker chart showing the evolution of ∆∆G ‡ and ∆GRRS(2) of the top individual in the HBD population for selected generations (i.e., when the identity of the best-performing catalyst changes) during the specificity-oriented optimization.Each datapoint corresponds to a reaction in the GPS, the yellow diamond indicates reaction 11 (Nβ-benzylserotonin + benzyloxyacetaldehyde).Outliers and far outliers are indicated with filled circles and squares, respectively.The optimization targets are ∆∆G ‡ and ∆GRRS(2) of reaction 11.
Figure S12.Box-and-whisker chart showing the evolution of ∆∆G ‡ and ∆GRRS(2) of the top individual in the HBD population for selected generations (i.e., when the identity of the best-performing catalyst changes) during a multi-objective optimization experiment with activity prioritized.Each datapoint corresponds to a reaction in the GPS.Outliers and far outliers are indicated with filled circles and squares, respectively.

Figure
Figure S13.Box-and-whisker chart showing the evolution of ∆∆G ‡ and ∆GRRS(2) of the top individual in the HBD population for selected generations (i.e., when the identity of the best-performing catalyst changes) during a specificity-oriented optimization experiment.Each datapoint corresponds to a reaction in the GPS, the yellow diamond indicates reaction 13[2-(5-(benzyloxy)-7-methyl-1H-indol-3yl)ethan-1-amine + ethyl 2-oxopentanoate).Outliers and far outliers are indicated with filled circles and squares, respectively.The optimization targets are ∆∆G ‡ and ∆GRRS(2) of reaction 13.

Table S2 .
Dimensions of the 2D t-SNE plot of the Pictet-Spengler database for the Scaling Relationships Set, along with the descriptor variable [∆GRRS(2), kcal/mol] and the logTOF values (1/s) for the C2 addition molecular volcano plot.The experimental ∆∆G ‡ value for these reactions is also reported.