 Open Access Article
 Open Access Article
      
        
          
            Shigeru 
            Yamaguchi
          
          
        
       
      
RIKEN Center for Sustainable Resource Science, 2-1 Hirosawa, Wako, Saitama 351-0198, Japan. E-mail: shigeru.yamaguchi.hw@a.riken.jp
    
First published on 6th July 2022
This review highlights the recent advances (2019–present) in the use of MFA (molecular field analysis) for data-driven catalyst design, enabling to improve selectivities/reaction outcomes in asymmetric catalysis. Successful examples of MFA-based molecular design and how to design molecules by MFA are described, including how to generate and evaluate MFA-based regression models, and future challenges in MFA-based molecular design in molecular catalysis.
In the 1960s, Hansch and Fujita et al. applied the extended Hammett rule to predict biological activities of molecules,8 which led to the construction of the QSAR (quantitative structure–activity relationships) field6,9 and the Hansch–Fujita method is called classical QSAR. QSAR employs biological activities as the target variables. In this review, the target variables are product selectivity. Such regression analysis can be called QSSR (quantitative structure–selectivity relationship) or QSPR (quantitative structure–property relationship) modelling. According to the perspective paper ‘Understanding the roles of the “two QSARs”’10 published by Fujita and Winkler, QSAR/QSPR models can be roughly divided into two types:
Type I: Models for mechanistic interpretations by analysis of small sets of chemically similar molecules.
Type II: Models for predicted purposes relying on machine learning techniques using large and chemically diverse datasets.
The free energy relationships represented by the Hammett rule are classified as Type I because the main purpose of free energy relationships/the Hammett rule is an interpretation of reaction mechanisms through the data analysis of chemically similar datasets. This review article focuses on regression analysis in asymmetric catalysis. The target variables in asymmetric catalysis are logarithms of enantiomeric ratios, which correspond to free energy differences (ΔΔG‡) between the pathways that lead to major and minor enantiomers (Curtin–Hammett principle11). Thus, linear regression analysis in asymmetric catalysis can be regarded as free energy relationships. Free energy relationships in asymmetric catalysis have been investigated by the Sigman group.1 In 2008, Sigman and co-workers reported free-energy relationships/univariate regression analysis in asymmetric Nozaki–Hiyama–Kishi reactions using a classical steric descriptor, Taft–Charton parameters.12,13 Since then, the Sigman group has examined various descriptors, in particular descriptors that can be calculated on computers, such as Sterimol parameters,14 computed IR frequencies,15 and so on. They performed mechanistic interpretation and molecular design in molecular catalysis including asymmetric catalysis based on their modern physical organic chemistry framework.1
In contrast to the above Type I QSPR that mainly aims for mechanistic interpretation, the purpose of Type II QSPR is prediction. Although Type II usually employs large and chemically diverse datasets according to the aforementioned perspective paper,10 we call the regression models that aim to quantitatively predict reaction outcomes as Type II in this article. For example, the Doyle group constructed the regression model to predict reaction yields in Buchwald–Hartwig reactions using Random Forests.16 While the authors collected test and training samples by a systematic combinatorial screening of similar catalysts, substrates, and reagents (i.e., analysis of a chemically similar dataset), the main purpose of their regression analysis was the quantitative prediction of reaction yields. Thus, we classify the above example as Type II QSPR. Denmark and co-workers reported another representative example of Type II QSPR/QSSR. They demonstrated the prediction of higher selectivity catalysts using molecular fields as descriptors and non-linear regression techniques such as support vector machines and neural networks.17 While they also employed the framework of MFA (i.e., the main topic of this review), their purpose is prediction and thus, the analysis is classified as Type II QSSR in this article. The Glorius group reported Type II QSPR/QSSR modeling using Denmark's and Doyle's datasets along with molecular fingerprint descriptors (bit strings that represent molecular structures).18 The aforementioned perspective paper by Fujita and Winkler described “One of the major drivers for the emergence of two main “camps” of QSAR researchers has been the increasingly arcane nature of the descriptors used in QSAR models generated by nonclassical (e.g., machine learning-based) methods that have become popular”.10 Thus, it should be noted that descriptors are important for judging the types of models. In our opinion, however, the types of QSAR/QSPR models can be classified by purpose as described above (Type I: models for mechanistic interpretation, Type II: models for prediction), although further discussions regarding this classification will be required. As the Denmark and Doyle groups employed nonclassical machine learning-based methods such as neural networks and their purposes are prediction, we classify their models as Type II, although they employed highly interpretable and physically meaningful descriptors. This review mainly focuses on the MFA classified as Type I that provides mechanistic insights leading to molecular design with improved enantioselectivity in asymmetric catalysis.
Since the Lipkowitz and Kozlowski reports, MFA was used for the analysis of asymmetric catalysis.26 In 2004, the Hirst group reported MFA in phase transfer asymmetric catalysis (Scheme 1a), in which the authors calculated descriptors from substituents R1 and R2 without considering catalyst structures (topomer CoMFA27).28 Denmark et al. also reported MFA in similar reactions29 (Scheme 1b), in which they employed an indicator field (vide infra) instead of the typical molecular field described above. Lei et al. reported MFA in Ru-catalyzed asymmetric hydrogenation of acetophenones30 (Scheme 1c).
|  | ||
| Scheme 1 (a) Asymmetric phase transfer catalysts analyzed by Hirst et al. (ref. 28) (88 samples [training set: 70 samples, test set: 18 samples], data range: 16% ee–93% ee, R2 = 0.82, q2 = 0.72, R2pred = 0.69). (b) Asymmetric phase transfer catalysts analyzed by Denmark et al. (ref. 29) (data range: −28% ee–62% ee, R2 = 0.94, q2 = 0.79 (0.76*)) *leave 20% cross validation over 100 runs. (c) Ru-Catalyzed ketone hydrogenation reactions analysed by Lei et al. (ref. 30) (25 samples [training set: 20 samples, test set: 5 samples], data range: −99% ee–99% ee, R2 > 0.99, q2 = 0.80, R2pred = 0.97). Schemes 1–4 were adapted and modified with permission from CICSJ Bull., 2017, 35, 133. Copyright 2017 the Chemical Society of Japan. | ||
The examples shown above employed LJ and coulombic potentials between probe atoms and molecules as molecular fields.
The Kozlowski group reported MFA that employed the quantum-mechanics (QM)-based interaction energy between probe atoms and molecules (QM-QSAR31).24,32,33 The target reaction was enantioselective addition of diethyl zinc reagents to aldehydes using chiral amino alcohols. The authors used transition-state structures that lead to major enantiomers (Scheme 2a(I)) for the calculation of molecular fields.24 The authors performed linear regression with two descriptors selected from the molecular fields by a simulated annealing method. They also employed catalyst structures32 (Scheme 2a(II)) and substrate structures33 for calculations of the molecular fields. The authors performed QM-based MFA in asymmetric lithiation–substitution of N-Boc–pyrrolidine as well.34
|  | ||
| Scheme 2 (a) Asymmetric alkylation of aldehyde using β-amino alcohols analyzed by Kozlowski et al. (I) Analysis using molecular fields calculated from transition state structures (ref. 24) (18 samples [training set: 14 samples, test set: 4 samples], data range: 0% ee–98% ee, R2 = 0.90, R2pred = 0.92). (II) Analysis using molecular fields calculated from catalyst structures (ref. 32) (31 samples [training set: 18 samples, test set: 13 samples], q2 = 0.85 [leave-two-out cross validation], R2pred = 0.87). (b) The asymmetric lithiation–substitution of N-Boc–pyrrolidine analyzed by Kozlowski et al. (ref. 34) (16 samples, data range: 0% ee–97% ee, R2 = 0.82, q2 = 0.67). | ||
MFA requires alignment based on, for example, a common catalyst skeleton for the calculations of molecular fields. An alignment independent 3D-QSAR method, GRIND (GRid Independent Descriptor),22 has been applied to MFA in asymmetric catalysis by the Morao group35 using Kozlowski's and Lipkowitz's datasets (Fig. 2 and Scheme 1a). Bo et al. reported combinations of a QM-based method and GRIND for the calculations of molecular fields in asymmetric catalysis.36 Carbó et al. applied the GRIND-based MFA to the analysis of Rh-catalyzed asymmetric hydroformylation of styrenes37 (Scheme 3).
|  | ||
| Scheme 3 Rh-Catalyzed asymmetric hydroformylation of styrenes analyzed by Carbó et al. (ref. 37) (21 samples, data range: 2% ee–94% ee, R2 = 0.99, q2 = 0.74). Quantum mechanical method is used for calculations of molecular fields. | ||
The MFA described above employed one of the conformers (e.g., the most stable conformers) for the calculations of molecular fields. MFAs using molecular fields calculated from the structures obtained from a trajectory of MD simulations (4D-QSAR21) and Boltzmann-weighted conformers (3.5D-QSAR) have been reported by the Hirst group.38 The target was asymmetric phase transfer catalysis shown in Scheme 4.
|  | ||
| Scheme 4 Asymmetric phase transfer catalysts analyzed by Hirst et al. (ref. 38) (40 samples, data range: 30% ee–91% ee, CoMFA R2 = 0.94, q2 = 0.78, 3.5D-QSAR R2 = 0.95, q2 = 0.82, 4D-QSAR R2 = 0.86, q2 = 0.76). | ||
As molecular fields, we used steric indicator fields, which are composed of indicator variables (0,1 values) and calculated as follows (Fig. 4b): (I) a set of Pd-enolate structures was optimized using the DFT method. (II) The coordinates of the set of molecules obtained in step I were aligned based on the common reactive site of the intermediates, which is shown in red in Fig. 4b(I). Atoms except for the β-ketoester and equatorial Ar groups on the ligands were removed. (III) The structures were placed in a grid space. The unit cell size is 1 Å per side. The enolate α-carbon was set as the origin, and the xy plane was defined based on the enolate mean plane. The size of the grid space, which is centered at the origin, is 6 × 8 × 8 Å3. Each unit cell is regarded as an element of the descriptor vectors. The unit cells that included the van der Waals radii of any atoms were counted as 1, or were otherwise counted as 0. Columns in the descriptor matrix that exhibited small deviations were removed. The calculations of the molecular fields are further discussed in section 4.2 “How to calculate descriptors”. MFA described in sections 3.2 and 3.3 also employed the steric indicator fields. MFA in this section employed LASSO or Elastic Net regression39–41 instead of PLS regression, which is typically employed in MFA.
The indicator fields and enantioselectivity values were correlated to generate regression models. The structural information extracted by regression analysis is shown in Fig. 4c along with an intermediate structure. The definition of important structural information shown in sections 3 and 4 is summarized below.
Blue/red points correspond to molecular fields (i.e., unit cells shown in Fig. 4b(III)) with positive/negative regression coefficients, respectively. If molecular structures are on the blue/red points, enantioselectivity increases/decreases. Blue (red)/light blue (light red) points indicate that molecular structures overlap/do not overlap with the points.
We can obtain insights into asymmetric induction mechanisms based on the visualized information as the MFA is Type I QSPR. In this case, the blue points mainly exist on the Si-face and were observed around the aryl group on the ligands and the ester substituents on the substrates (Fig. 4c). This means the substituents formed a pocket around the reaction centre, hindering the reaction from the Si-face (Fig. 4d). On the other hand, the aryl group of the ligands and the ester-substituents on the Re-face were on the same side, indicating that the nucleophilic attack of the Pd-enolate on the fluorinating reagent (i.e., NFSI) proceeds smoothly from the Re-face.
Further comparison between visualized structural information and intermediate structures showed us that light blue points visualized on some intermediate structures would lead to the design of molecules as shown in Fig. 5a(I) and b(I) (yellow arrows). We designed a ligand and a substrate by introducing substituents to overlap with the light blue points. Both intermediates composed of a designed ligand (6Pd) and substrate (Bzh) overlap with the blue points as shown in Fig. 5a(II) and b(II) (green arrows), and both the intermediate structures make the pocket on the Si-face narrow. The calculated ΔΔG‡ values in the reactions using the designed molecules showed excellent values (Fig. 5c). The reaction using the designed substrate exhibited significantly improved enantioselectivity in comparison with those in the training samples (94% ee vs. up to 81% ee, Fig. 5c).
|  | ||
| Fig. 5 Molecular design of (a) chiral ligand 6Pd and (b) substrate Bzh based on the MFA using intermediate structures and (c) the reaction using the substrate. | ||
|  | ||
| Fig. 6 Dataset for the MFA using computational screening data. Reprinted with permission from Bull. Chem. Soc. Jpn., 2022, 95, 271. Copyright, The Chemical Society of Japan. | ||
To collect samples, TS calculations were performed using a combination of three NHC ligands (1Cu–3Cu) and six substrates (S1–S6). The range of the experimental ee (enantiomeric excess) was 18–73% ee. The MFA was performed using the calculated ΔΔG‡ values and corresponding transition-state structures. The extracted and visualized structural information provided an insight into the asymmetric induction mechanism (see section 4.3, Fig. 14). Based on the obtained insight, chiral ligands 4Cu and 5Cu were designed by introducing substituents into the template molecules to overlap the light blue point designated by yellow and green arrows shown in Fig. 7a and b, which exhibit improved calculated ΔΔG‡ values in comparison with the design template. The experimental enantioselectivity values in the reactions using the designed NHC ligands were higher in comparison with those in the training samples (87% ee vs. up to 73% ee). The MFA using computational screening data including the designed NHC ligands (30 training samples calculated from the combination of five ligands and six substrates) was performed and based on the visualized information, NHC ligands were designed again, which showed improved experimental enantioselectivity (96% ee vs. up to 89% ee) as shown in Fig. 7c and d. While 6Cu was an already examined optimum ligand in the related catalytic systems, and 7Cu is the ligand that would not be examined without the information obtained by the MFA using computational screening data.
|  | ||
| Fig. 7 Molecular design based on the MFA using computational screening data and the experimental results. The results of MFA using (a), (b) 18 samples and (c), and (d) 30 samples. As molecular fields, the indicator fields are calculated by a similar procedure shown in Fig. 4. The sizes of the grid spaces (unit cell size: 1 Å per side) were 6 × 6 × 6 Å3 for the 1st MFA (18 training samples) and 6 × 8 × 8 Å3 for the 2nd MFA (30 training samples). | ||
Both the MFAs using computational and experimental screening data described in the previous section have particular strengths and these are usually complementary. The characteristics of the MFA using computational screening data are listed below.43
• We can collect training samples without experiments.
• High calculation cost (transition-state calculations)
• The calculated ΔΔG‡ values include less information in comparison with the experimental ΔΔG‡ values.
• Reaction mechanism must be to some extent known.
On the other hand, the MFA using experimental screening data and intermediate structures4 described in section 3.1 has the following characteristics:
• High-quality experimental data are required.
• Reasonable calculation cost (ground-state calculations)
• The experimental ΔΔG‡ values provide a lot of information including solvent effects etc.
• This method is applicable even when reaction partner structures are unclear (we did not calculate descriptors from the reaction partner, i.e., NFSI as shown in section 3.1).
In summary, as experimental data includes a lot of information that is difficult to reproduce by DFT calculations such as solvent effects, the MFA using intermediate structures and experimental data can extract more information in comparison with the MFA using computational screening data. In some cases, however, it is not easy to collect high-quality data due to, for example, the use of expensive and synthetically difficult catalysts. In such cases, the MFA using computational screening data are useful.
A specific target is an asymmetric two-component iridium/boron dual catalyst system for α-C-allylation of carboxylic acids48 (Fig. 8). The target reaction proceeds as follows: Ir-catalyst activates the substrate to afford the Ir–π-allyl intermediate and the chiral Boron species activates the remaining carboxylate moiety to generate chiral B-enolate species. The chiral B-enolate species attacks the chiral Ir–π-allyl complex to stereodivergently afford products (Fig. 8). Inversion of the absolute configuration of the chiral ligands on the B-catalyst shown in Fig. 9a changes the relative configuration of the products. Although the initial attempt of the reaction afforded products with excellent enantioselectivity, both the reactions using the S and R boron catalysts showed low diastereo- and regioselectivity (linear/branch selectivity; the structure of the linear product is shown in Fig. 9a). Thus, the purpose of the analysis is the improvement of regio- and stereoselectivities to selectively synthesize (2R,3R)- and (2S,3R)-products when using the S and R boron catalysts, respectively. Importantly, the Ir–π-allyl complexes are the well-established49 common intermediates in the diastereo- and regioselectivity determining step. Thus, molecular fields calculated from a set of Ir–π-allyl intermediate structures allow us to analyse four sets of reaction outcomes. While the boron enolate structures were not used for the calculation of the descriptors/molecular fields, the information about the boron catalysis is included in the experimental data. Thus, analysis using experimental ΔΔG‡ values and the molecular fields calculated from Ir–π-allyl complexes extracts and visualizes the information about how the Ir–π-allyl complex and the B-enolate interact with each other when the reaction proceeds. Important structural information about the four selectivity outcomes visualized on the identical intermediate structures enables facile comparison of their selectivity determining factors, thereby allowing to control the multiple reaction outcomes.
|  | ||
| Fig. 8 Asymmetric iridium/boron hybrid catalysis for stereodivergent synthesis of α-allyl carboxylic acids. | ||
|  | ||
| Fig. 9 The result of the MFA and the molecular design in asymmetric Ir/B hybrid catalysis. (a) Dataset for the MFA in asymmetric Ir/B hybrid catalysis. (b) Overall design path. (c)–(e) Important structural information visualized on the Ir–p-allyl intermediates and the molecular design based on the structural information. The number in parenthesis is the number of reactions used for the MFA. As molecular fields, the indicator fields were calculated using a similar procedure shown in Fig. 4. The size of the grid space (unit cell size: 1 Å per side) is 10 × 12 × 6 Å3. Adapted with permission from Cell. Rep. Phys. Sci., 2021, 2, 100679. Copyright 2021 Cell Press. | ||
The overall design process is summarized in Fig. 9b. Using the training data (two sets of 24 reactions) collected by screening a combination of 12 phosphoramidite ligands and two substrates (Fig. 9a), the MFA was performed. The training samples are selected mainly based on availability (for more details about the selection of the training data, see section 4.1). As shown in Fig. 9b, among the four regression models, the model for the b/l ratios in the reactions using boron ligand S was employed for molecular design. The important structural information visualized on the Ir–π-allyl intermediate structures are shown in Fig. 9c. Light blue points are found around the 3,4-positions of the binaphthyl skeleton. Four ligands 13Ir–16Ir were designed by introducing substituents to overlap with the light blue points. The structure of 15IrPr (intermediate consisted of ligand 15Ir and substrate Pr) is shown in the right panel of Fig. 9c. The reactions using ligands 13Ir–16Ir showed improved regioselectivity. While regioselectivity improved, diastereoselectivity values were not satisfactory. Thus, we collected additional training samples using the designed ligands and again performed the MFA using the 32 training samples. As shown in Fig. 9b, MFA using 32 samples led to the design of optimum ligands Ir17 for the boron ligand S and Ir18 for the boron ligand R. Here, we show the molecular design based on the MFA using the data obtained from the reactions using boron ligand R as shown in Fig. 9d and e.. The structural information for the b/l ratios and dr visualized by MFA is shown in Fig. 9d and e. The light blue points are observed around the 2-position of the fluorene moiety of 5IrPr. Therefore, we introduced the tBu group to the position and the reaction using the designed ligand 18Ir showed excellent regio- and diastereoselectivity. In summary, the analysis of 32 molecular structures with the MFA framework enabled the control of complicated organic reactions, stereodivergent asymmetric synthesis, indicating the powerful potential of our data-driven approach. The overview of molecular design in this complicated reaction can be found as a movie in the original literature (https://ars.els-cdn.com/content/image/1-s2.0-S2666386421004045-mmc7.mp4).
|  | ||
| Fig. 11 Chemoinformatics-guided optimization protocol. (A) Generation of a large in silico library of catalyst candidates. (B) Calculation of robust chemical descriptors. (C) Selection of a universal training set (UTS). (D) Acquisition of experimental selectivity data. (E) Application of ML to use moderate- to low-selectivity reactions to predict high-selectivity reactions. Reproduced with permission from ref. 17. Copyright 2019 American Association for the Advancement of Science. | ||
Regarding the statistical metrics, there have been long debates on the evaluation of regression models in QSAR/QSPR.51 One of the widely employed indices for the evaluation of the quality of QSAR/QSPR models is Golbraikh–Tropsha criteria.51 These criteria specify that leave-one-out cross-validated coefficient of determination q2, by itself, is insufficient for evaluating the model and that external validation is necessary. The following criteria must be satisfied to validate the model: (1) high q2 and R2pred (coefficient of determination calculated from a test set) values must be obtained; (2) one of the coefficients of determination for the regressions of a test set through the origin (either predicted vs. observed values R02pred or observed vs. predicted values R0′2pred) should be close to R2pred; (3) the slope of a regression line of the predicted vs. observed (k) or observed vs. predicted (k′) values of a test set through the origin should be close to 1. These are described in greater detail below and an example to explain condition 3 is shown in Fig. 12.
|  | ||
| Fig. 12 An example of regression between observed vs. predicted (a) and predicted vs. observed (b) activities for compounds from an external test set. Despite the high R2pred value and both k and k′ close to 1, the model is not highly predictive, because the regressions through the origin of the coordinate system are not close to the optimal regressions. Note that R02pred and R′02pred are substantially different from each other. Adapted with permission from ref. 51. Copyright 2002 ELSEVIER. | ||
1. Coefficient of determination for a test set R2pred > 0.6.
2. Leave-one-out cross-validated coefficient of determination q2 > 0.5.
3. (R2pred − R02pred)/R2pred or (R2pred − R′02pred)/R2pred < 0.1 and 0.85 < k or k′ < 1.15.
Our studies employed the above criteria to evaluate the regression models and test sets for the evaluations were selected based on PCA (principal component analysis) so that the test samples cover the entire descriptor space.43,46
We also employed k-fold cross-validation (k = 4 or 5 in our previous analysis) and y-randomization for the evaluation as well. In the case of the MFA in NHC–Cu catalysis (section 3.2), the regression models used for the design showed q2 > 0.5 for 18 training samples and R2, q2, Q2, >0.5, and R2yrandom < 0.1 for 30 training samples. In the case of the MFA in Ir/B dual catalysis (section 3.3), the regression models showed R2, q2, Q2 > 0.6, and R2yrandom < 0.2. Thus, at this stage, R2, q2, Q2, >0.5, and R2yrandom < 0.2 seems to be one of the useful criteria to evaluate the MFA-based regression models, while further accumulation and discussion of examples should be required regarding which criteria should be used to evaluate regression models in the MFA framework.
We designed the molecules based on mechanistic insights obtained from the structural information visualized by MFA, meaning we utilize the researchers’ intuition as well. This MFA framework also uses the researchers’ intuition not only for the molecular design but also for the calculations of the descriptors/molecular fields. In all the cases that successfully designed the molecules showing improved selectivity, the molecular structures around the reaction centre were used for the calculation of the molecular fields. We explain the details regarding this point using the MFA described in section 3.1. In the MFA of section 3.1, molecular fields were calculated from the structure around the reaction centre as shown in Fig. 4b(III). The extracted structural information by the MFA is shown in Fig. 4c. The same intermediate structure shown in Fig. 4c is again shown in Fig. 13 along with the information visualized by MFA that employed the molecular field calculated from the whole Pd-enolate structures. The important structural information was observed far from the reactive site as marked by red arrows, which is not in accordance with our intuition. Moreover, it is difficult to understand the asymmetric induction mechanism, based on the structural information in contrast to the result of the MFA shown in Fig. 4c. Thus, dimension reduction of descriptors/molecular fields based on researchers’ intuition is required to extract meaningful information for mechanistic interpretation and molecular design.
|  | ||
| Fig. 13 A result of the MFA using the whole structures of the Pd-enolate complexes for the calculations of molecular fields. | ||
The first key point is the reduction of conformational flexibility. The Pd-enolate structures shown in Fig. 4b(I) are composed of BINAP–Pd catalysts and β-ketoesters. Their structures themselves have conformational flexibility to some degree. For example, the ester moiety of the β-ketoesters can be freely rotated. The complexation of catalysts and substrates reduces this conformational flexibility. Steric interactions with the Ar-group of BINAP derivatives hinder the rotation of the ester moiety on the substrates. This facilitates the determination of conformers that could be employed for the calculations of molecular fields.
The second point is alignment. Alignment of the molecules is required for the calculations of molecular fields as shown in Fig. 4b. MFA in medicinal chemistry is a ligand-based drug design and thus protein structures are not considered explicitly. Which parts of molecular structures are used as the standard for the alignment is one of the biggest problems in evaluating biological activities using MFA. On the other hand, in the MFA of asymmetric catalysis, intermediate and transition-state structures usually involve reactive sites. Thus, molecules can be easily aligned based on the reactive sites. Even when the reactive sites are flexible and are not suitable for the standard of alignment, the molecular structures can be aligned based on the chiral catalyst skeleton. MFA using a set of molecular structures aligned based on the reactive sites or chiral catalyst skeleton allows for the comparison of subtle structural differences that are important for selectivity outcomes and are difficult to capture only by researchers’ intuition (vide infra).
The third point is the structural change induced by interactions between catalysts and substrates. Most of the structural information used for the molecular design shown in section 3 is derived from the structural change. We explain the details about this point using Fig. 14. In Fig. 14, examples of the template molecules for the molecular design and the molecular structures that are the origins of the structural information used for the molecular design in the three MFAs described in section 3 are shown (origins of structural information means that the information disappears when removing the molecules from training samples).
In the case of the Pd-catalysed asymmetric fluorination reactions, the blue point used for the catalyst design is derived from the Pd-enolate structure bearing a tBu substituent on the β-ketoesters (e.g., 2PdtBu shown in Fig. 14a). Due to steric repulsion between the tBu group and the Ar group on the BINAP derivatives, the Ar group on the ligand in the Si face gets closer to the reactive site as shown in Fig. 14a (i.e., the pocket on the Si face explained in section 3.1 becomes narrow). On the other hand, the Pd-enolate structure bearing an iPr group instead of the tBu group does not overlap with the blue point. Thus, we can design the molecule based on Pd-enolate by introducing the substituents to overlap the blue point as shown in Fig. 5a.
In the case of the NHC–Cu-catalysed asymmetric carbonyl addition reactions, the blue point used for the catalyst design is derived from the transition-state structure bearing an iPr-substituent on the NHC ligand (e.g., 3CuS1 shown in Fig. 14b). Due to steric repulsion between the iPr group and the silyl substituent, the phenylene group on the ligand shows positional change, thereby inducting steric crush with the substrate in the transition-state of the minor pathway (Fig. 14c). On the other hand, the transition-state structures in the major pathway do not show such interactions between the NHC ligands and the substrate as shown in Fig. 14c. The visualized structural information provides this mechanistic insight. We can design molecules by introducing the substituents into the template molecules to overlap the blue point as shown in Fig. 7.
In the case of the Ir-catalysed reactions, the blue point used for the catalyst design to improve regioselectivity is derived from the Ir–π-allyl intermediate structures of 5IrPr bearing a fluorene moiety (Fig. 14d). Due to steric repulsion between the binaphthyl skeleton and the fluorene moiety, the binaphthyl skeleton gets closer to the terminal allyl carbon, hindering the reaction that affords the undesired linear products. Thus, we can design molecules based on the Ir–π-allyl intermediate structure 1IrPr by introducing the substituents to overlap the blue point as shown in Fig. 9c and 14d.
The research regarding the Type I MFA-based data-driven catalyst design enabling the improvement of reaction outcomes is just starting and there are many issues that should be tackled. Some of them are introduced below as outlook.
The molecular fields used for the molecular design so far are the steric indicator fields. It should be possible to extract further information by using, for example, molecular fields representing electronic effects such as hydrogen bonding interactions. It should also be interesting to evaluate weak attractive non-covalent interactions by MFA using the steric indicator fields described in this review article. The weak non-covalent interactions such as London dispersion effects have been recently recognized as important enantioselectivity-controlling factors in asymmetric catalysis.52 The Sigman group demonstrated that interatomic distances between probe molecules (benzene) and substrates can be used as descriptors that represent CH–π and π–π interactions in asymmetric catalysis as shown in Fig. 15 (Dπ is the distance between probe molecules and substrates).53 The indicator fields include positional information (3D coordinate), meaning the MFA using the indicator fields can consider interatomic distances. Therefore, it should be worth examining whether or not our MFA framework enables the analysis of asymmetric catalysis in which non-covalent weak attractive interactions significantly affect enantioselectivity.
Another important future task is the MFA in molecular catalysis using reaction rates (e.g., TOF [turnover frequency]) as target variables. As described in section 2.1, target variables for the regression analysis in asymmetric catalysis are the logarithms of enantiomeric ratios, which correspond to free energy differences in the pathways that lead to each isomer (Curtin–Hammett principle11). Therefore, the target variables in asymmetric catalysis are physically meaningful and high-quality values. Moreover, enantioselectivity values can be collected by single-point measurements using HPLC or GC. Thus, regression analysis in asymmetric catalysis has been recently actively investigated.1 In contrast, MFAs using reaction rates, which are important target variables for evaluating molecular catalysis, have been still scarce probably because of the difficulty of collecting training samples. To measure reaction rates such as TOF, reactions should be monitored periodically. This process is time-consuming. Moreover, catalytic reactions are typically composed of a combination of elementary reactions, such as oxidative addition and reductive elimination, while only one step (i.e., an enantio-determining step) is usually considered for the analysis in asymmetric catalysis. Although there are examples of the use of TOF/reaction rates as target variables for regression analysis in molecular catalysis,2,54 enhancing reaction rates by MFA-based data-driven catalyst design should be also tackled.
The MFA using intermediate or transition-state structures are useful analytical techniques that provide highly interpretable information on reactions, leading to the design of molecules showing improved selectivity. Analytical methods that enable the investigation of the details of molecular structures/properties (e.g., NMR and single crystal X-ray diffraction analysis) accelerate molecular science research. We have successfully controlled the complicated organic reactions, stereodivergent asymmetric synthesis, through MFA-based data-driven catalyst design as described in this review article. We expect that further trials to control challenging/complicated organic reactions by the MFA will open new avenues in the field of molecular catalysis/organic synthesis.
| This journal is © The Royal Society of Chemistry 2022 |