Open Access Article
Baochen
Li‡
a,
Yuru
Liu‡
a,
Haibin
Sun
a,
Rentao
Zhang
a,
Yongli
Xie
a,
Klement
Foo
b,
Frankie S.
Mak
b,
Ruimao
Zhang
c,
Tianshu
Yu
c,
Sen
Lin
a,
Peng
Wang
a and
Xiaoxue
Wang
*a
aChemLex Technology Co., Ltd, 1976 Gaoke Mid. Rd, Shanghai, China 201210. E-mail: wxx@chemlex.tech
bExperimental Drug Development Centre (EDDC), Agency for Science, Technology and Research (A*STAR), 10 Biopolis Road, #05-01, Chromos, 138670, Singapore
cSchool of Data Science, The Chinese University of Hong Kong, Shenzhen (CUHK-SZ), 2001 Longxiang Boulevard, Shenzhen, China 518172
First published on 3rd September 2024
As a fundamental problem in organic chemistry, synthesis planning aims at designing energy and cost-efficient reaction pathways for target compounds. In synthesis planning, it is crucial to understand regioselectivity, or the preference of a reaction over competing reaction sites. Precisely predicting regioselectivity enables early exclusion of unproductive reactions and paves the way to designing high-yielding synthetic routes with minimal separation and material costs. However, it is still at the emerging state to combine chemical knowledge and data-driven methods to make practical predictions for regioselectivity. At the same time, metal-catalyzed cross-coupling reactions have profoundly transformed medicinal chemistry, and thus become one of the most frequently encountered types of reactions in synthesis planning. In this work, we for the first time introduce a chemical knowledge informed message passing neural network (MPNN) framework that directly identifies the intrinsic major products for metal-catalyzed cross-coupling reactions with regioselective ambiguity. Integrating both first principles methods and data-driven methods, our model achieves an overall accuracy of 96.51% on the test set of eight typical metal-catalyzed cross-coupling reaction types, including Suzuki–Miyaura, Stille, Sonogashira, Buchwald–Hartwig, Hiyama, Kumada, Negishi, and Heck reactions, outperforming other commonly used model types. To integrate electronic effects with steric effects in regioselectivity prediction, we propose a quantitative method to measure the steric hindrance effect. Our steric hindrance checker can successfully identify regioselectivity induced solely by steric hindrance. Notably under practical scenarios, our model outperforms 6 experimental organic chemists with an average working experience of over 10 years in the organic synthesis industry in terms of predicting major products in regioselective cases. We have also exemplified the practical usage of our model by fixing routes designed by open-access synthesis planning software and improving reactions by identifying low-cost starting materials. To assist general chemists in making prompt decisions about regioselectivity, we have developed a free web-based AI-empowered tool. Our code and web tool have been made available at https://github.com/Chemlex-AI/regioselectivity and https://ai.tools.chemlex.com/region-choose, respectively.
![]() | ||
| Fig. 1 Regioselectivity in metal-catalyzed cross-coupling reactions. (a) An example of regioselectivity challenge in a Suzuki–Miyaura reaction. In the heteroaryl halide, there are two competing coupling sites, marked by blue and red circles. As reported by Saxty et al.,1 the blue circle denotes the major coupling site with the corresponding major product (with a yield of 82%) highlighted in the blue rectangle; (b) the general mechanism for Metal-Catalyzed Cross-Coupling Reactions (MCCCRs); (c) an overview of this work. The dataset comprising 9734 MCCCRs with regioselectivity ambiguity is licensed from Pistachio2 and CAS Content Collection.3 Calculated descriptors, together with atom and bond features are passed through Regio-MPNN to get the predicted probability for each candidate product. A steric hindrance checker guarantees that the predicted major reaction site is within the “safe” steric hindrance range. Black circles stand for carbon atoms, blue circles stand for nitrogen atoms, green circles stand for chlorine atoms, brown circles stand for bromine atoms, and filled circles stand for neighboring atoms considered at each message-passing step. | ||
Concurrently, the recent integration of artificial intelligence (AI) into computer-aided synthesis planning (CASP) has profoundly revolutionized early-stage drug discovery and preclinical manufacturing process development.5–13 Not only useful for saving opportunity and tangible costs for bench scientists, fast and accurate computational methods to determine regioselectivity are also critical for designing green and efficient synthetic routes in CASP.14–16 Aligning with such unmet needs, computational methods to predict regioselectivity for various organic chemistry reactions have been long pursued. However, despite relentless endeavors over decades, developing high-performing predictive computational tools for regioselectivity remains a significant challenge. In the 1990s, with limited computational power, scientists focused on using feature engineering to learn the correlation between molecular descriptors and regioselectivity.17 Oslob and co-workers18 used the Quantitative Structure–Activity Relationship (QSAR) to predict the regioselectivity for palladium-catalyzed allylations. They calculated the energy for steric probing, the bond length from palladium to the reacting center carbon, and two dihedral angles, to fit the experimental selectivity data. Nonetheless, to ensure the performance, data points with relatively high activation energy were removed from their dataset, limiting their method's application and imposing unresolved bias. Banerjee and co-workers19 used various Machine Learning (ML) tools to analyze the outcome of catalytic regioselective difluorination reactions of alkenes and decipher the complex interplay of various molecular parameters and their non-linear dependencies. In their work, they used density functional theory (DFT) to compute molecular parameters for 66 alkenes and further discovered the dependencies between these molecular parameters and regiochemical outcomes. However, computing these parameters via DFT is relatively time-consuming, underscoring the need for more efficient approaches. To enhance the computational efficiency, Hong et al.20 used ML models trained on DFT results to predict descriptors of interest. They combined 32 key physical organic descriptors and developed a regression algorithm to predict regioselectivity for radical C–H functionalization of heterocycles. Their work achieved a rapid and reliable prediction of descriptors through ML models successfully. However, the approach of featurizing reactants solely with physical organic descriptors may lead to missing critical structural information about molecules. Ree and co-workers21 calculated Charge Model 5 (CM5)22 atomic charges and used them to predict the regioselectivity of electrophilic aromatic substitution reactions with a light gradient boosting machine. Similarly, Tomberg et al.23 also focused on electrophilic aromatic substitution reactions. They computed the Fukui coefficient, partial charge, bond orders, and atomic solvent accessible surface for each aromatic carbon and found that random forest models achieved the best performance for classifying carbons as active or inactive in electrophilic aromatic substitution. Guan et al.17 introduced a new method that combines graph-based molecular representation with simulated chosen quantum mechanical descriptors to predict regioselectivity in substitution reactions. After that, they focused on the regioselectivity problem in nucleophilic aromatic substitution (SNAr) reactions and attached a checker at the end of their model to distinguish the possible products with similar predictive scores.24 In Ree's, Tomberg's, and Guan's work, they all treated the electronic properties as the major factors for regioselectivity in their interested reactions and did not take steric effects into account explicitly. Currently, there is no predictive model available that combines both electronic and steric effects with data-driven deep learning methods.
On the other hand, medicinal chemistry has been profoundly reshaped by metal-catalyzed cross-coupling reactions (MCCCRs) due to their impressive ability to forge carbon–carbon/carbon–heteroatom bonds between diverse chemical moieties, enabling the creation of compounds that are otherwise challenging to obtain.25 Palladium-catalyzed cross-coupling reactions, such as the Suzuki–Miyaura, Buchwald–Hartwig, Heck, and Stille reactions, have stood among the most popular reaction types in medicinal chemistry.25 As shown in Fig. 1b, the mechanism of MCCCRs generally involves:26 (1) oxidative addition converting an organic halide (RX) to L(n)MR(X) in the presence of catalyst L(n)M (L = spectator ligand). (2) Transmetallation converting L(n)MR(X) to L(n)MR(R′), where the source of R′(−) varies in different metal coupling reactions. For Suzuki, R′(−) comes from a boronic acid or the corresponding ester. For Buchwald, Heck, Sonogashira, Negishi, Stille, Hiyama, and Kumada reactions, the sources are amines, alkenes, alkynes, organozinc, -tin, and -silicon reagents and Grignard reagents, respectively. (3) Reductive elimination of L(n)MR(R′) to regenerate catalyst L(n)M and to release the resulting product R–R′. Among the three steps, oxidative addition is generally considered the rate-limiting step for most MCCCRs.27 The extent of this step is heavily influenced by the leaving group ability of X and the steric hindrance around the C–X bond.26 Therefore, it is possible to use quantitative descriptors, e.g. partial charge, Fukui index, and volumetric measures from conformational analysis, to describe the properties of the rate-limiting step, and further characterize the different kinetics between competing reaction sites where regioselectivity is considered. Additionally, regioselectivity-related experimental results have been widely reported in the literature, enabling data-driven methods to mine statistical rules behind the literature data. As a result, a computational model that accurately predicts the regioselectivity is thus made possible by both theoretical analysis and data-driven machine learning.
In this work, we propose Regio-MPNN, Message Passing Neural Network (MPNN) backbone combined with chemical descriptors, to directly predict intrinsic major products for MCCCRs with regioselectivity risks, as shown in Fig. 1c. We use computed atomic charges, Fukui index, Nuclear Magnetic Resonance (NMR) chemical shift, and bond length through DFT calculation to train an MPNN model based on the graph representation of a molecule, and examine the steric hindrance effect on possible reactive sites through a steric hindrance checker. It is worth noting that in practice, regioselectivity can also be affected by reaction conditions.28–30 However, literature has limited reported records on the relationship between regioselectivity and reaction conditions, making it difficult to capture this effect using data-driven methods. Therefore, in this work, we only focus on the intrinsic properties of reactants and ignore the differences caused by reaction conditions. Our model exhibits an overall accuracy of 96.51% on a test set comprising eight types of metal-catalyzed cross-coupling reactions, demonstrating the capability for practical usage. Our model architecture outperforms other frequently used model types, including Extended-Connectivity Fingerprint (ECFP)31 based multilayer perceptron (MLP), descriptor based MLP, sequence based model + MLP, and ml-QM-GNN,17 regarding prediction accuracy and robustness. We have also demonstrated that our steric hindrance checker is able to detect regioselectivity solely induced by steric hindrance regardless of electronic effects. Furthermore, we invited 6 experimental organic chemists with an average working experience of over 10 years in industry to compete with our model on regioselectivity tests. In this test, we randomly picked 100 reactions from the test set and collected the predictions of major products from the chemists and our model. Our model significantly outperformed human chemists, demonstrating the advantage of using machine learning methods with first-principles results. In the end, we show the practical usage cases of our model on synthesis planning and material cost saving by finding more accessible starting materials. Our Regio-MPNN model is designed to predict the regioselectivity of a given MCCCR, however, it does not guarantee the feasibility of the MCCCR. Therefore, if the feasibility of the MCCCR is a concern, a separate reaction yield prediction model should be used. It is worth noting that reaction yield prediction can be considerably challenging and is beyond the scope of this work.32–34
The data cleaning process is as follows. Here we take the Suzuki–Miyaura cross-coupling reactions as examples. First, we removed the duplicate reactions. Then we filtered out reactions with yields less than 40% to ensure that the product in the dataset is indeed the preferred reacting site. Next, we used SMARTS to match each reactant and classify reactants into two types: organohalides or organoboron.35 The SMARTS of the reactant organohalides is defined as “[F,Cl,Br,I,$(OS(
O)(
O)C(F)(F)F)][#6]”, which covers aryl halides and allyl halides. Here, the “generalized” halides are not limited only to halogens but also include halogen-like functional groups, e.g. trifluoromethane sulfonate (OTf). SMARTS of reactant organoboron is defined as “B(O)O”, which matches boronic acid or boronate ester. In order to be identified as Suzuki reactions, all reactions should have at least one organohalide and one boronic substance. Then another filter was applied to pick only organohalides with more than 2 halogen and halogen-like leaving groups. Finally, we use the rdkit.Chem.rdChemReactions module to enumerate all possible products.36 The SMIRKS we used for the Suzuki reactions is: “B(O)(O)([#6:1]).[Cl,Br,I, $(OS(
O)(
O)C(F)(F)F)][#6:2] ≫ [#6:1]–[#6:2]”. For all the obtained products after applying the reaction SMIRKS, the product that is the same as in the reaction dataset is labeled with 1, which is the ground truth, and all the rest of the possible products are labeled as 0. Other types of MCCRs other than Suzuki reactions were treated similarly with their own chemical definitions respectively. The corresponding SMIRKS for each reaction type can be found in our GitHub repository.37
, where A is the matrix of atom features and B is the adjacency tensor of bond features. Here, l is the number of atoms in the molecule, a is the dimension of atom features, and b is the dimension of bond features. Such a graph
serves as the input for the MPNN framework. In an MPNN layer, there are two fundamental steps for a forward pass: a message-passing step and an update step. The number of stacked MPNN layers typically depends on the number of connected bonds to be considered around the central atom.44 Besides the conventional graph pairwise sum aggregator used in the message-passing step, other graph aggregators, such as the graph pooling aggregator and gated attention aggregator47 are also implemented to verify the impact of different graph aggregators on model performance.
A message-passing step with the pairwise sum aggregator reads,
![]() | (1) |
denotes the new message of atom v at the next time step t + 1 with dimension lm; N(v) denotes the neighboring atom indices of v;
denotes the hidden state for atom v at time step t with dimension lh;
denotes the corresponding atom features for atom n at time step t where n ∈ N(v); the initial hidden states
and
are vectors of length a that denotes the vth and nth elements of atom features matrix A, correspondingly;
is a vector of length b that denotes the (v, n) element of the adjacency tensor B; ‖ denotes the operation of vector concatenation; M is the message neural network as a mapping, M(xM) = ReLU(WM·xM + bM), ReLU is the rectified linear unit activation function, and
and
are the weights and bias of M respectively.
A message-passing step with the graph pooling aggregator reads,
| mvt+1 = M(hvt‖pooln∈N(v)(P(hnt‖B(v,n)))) | (2) |
and
are the weights and bias of P respectively; pooln∈N(v) is the pool operator, which can be average pooling or max pooling along all neighbouring atoms; M is the message neural network as a mapping, M(xM) = ReLU(WM·xM + bM), and
and
are the weights and bias of M respectively.
A message-passing step with the gated attention aggregator reads,
![]() | (3) |
; H is the number of heads in the attention mechanism, used to capture features from different representation subspaces; σ is the sigmoid function; G and F are single fully-connected layers; Qk is a linear map,
, for computing the query vector of head k; Kk is a linear map,
, for computing the key vector of head k; Vk is a linear map,
, for computing the value vector of head k;· represents the dot product between two vectors; M is the message neural network as a mapping, M(xM) = ReLU(WM·xM + bM), and
and
are the weights and bias of M respectively.
In the update step,
| hvt+1 = U(hvt‖mvt+1) | (4) |
and
are the weights and bias of the update neural network U respectively.44
An overall pipeline of our model with the gated attention aggregator is shown in Fig. 2. At the beginning, input reactant pairs were fed into the DFT computation module or finetuned qmdesc model (Fqmdesc)17 to obtain atom-wise descriptors (Hirshfeld partial charge, nucleophilicity, and electrophilicity) and bond-wise descriptors (bond length). At the same time, atom and bond features of input reactant pairs were extracted using RDKit.36 Extracted atom and bond features are described in ESI Table S2.† Atom features were concatenated with atom descriptors and bond features were concatenated with bond descriptors. Together, the concatenated atom and bond features were fed into the MPNN module, with different aggregation strategies, as the input A and B, respectively, to perform representation learning for atoms. The representations of reaction center atoms were recorded at each step and these representations were max-pooled among different steps. The final atom representation was gained through a single fully-connected layer which took average-pooled representations and its corresponding atom descriptors as input. At the same time, the rxnfp48 of the interested reaction was also calculated to incorporate reaction details into our model. Unlike Guan et al.'s previous work on aromatic substitution reaction regioselectivity,17 which applies global attention between atom representation and reactant molecules to capture the possible impact of atoms beyond the immediate neighbors considered in the message passing section, we used a multi-head attention layer between atom representation and reaction rxnfp to capture the various possible relationships between the reaction center atoms and the entire reaction. The final atom representation together with the attention vector was used to compute the probability of being the main product for each candidate product. This probability was estimated as follows. We first calculated a score for each candidate product using eqn (5),
![]() | (5) |
is the resulting score;
is the hidden state of atom c learned by the MPNN at time step t, c belongs to reaction center atoms, reaction center atoms are the atoms which undergo changes in their bonding pattern during the course of MCCCRs, for example, the carbon atom attached to the boronic acid or the corresponding ester group and the carbon atom attached to the reacting halogen or OTf group in a Suzuki–Miyaura reaction are the reaction center atoms;
is the QM computed or Fqmdesc computed atom descriptors with a dimension of ld for atom c; R is a neural network as a mapping, R(xR) = WR·xR,
is the weights of the neural network; Qh is a linear map,
, for computing the query vector of head h; Kh is a linear map,
, for computing the key vector of head h; Vh is a linear map,
, for computing the value vector of head h; lr is the length of the reaction rxnfp; D is the number of heads in the attention mechanism; S is a neural network as a mapping, and
and
are the weights and bias of the neural network. After computing the scores for each candidate product, the probabilities of being the main product for each candidate product were calculated by feeding the concatenation of the scores through a softmax activation function.
![]() | ||
| Fig. 2 Architecture of Regio-MPNN. We calculate the atom and bond descriptors using density functional theory (DFT) or a finetuned qmdesc model(Fqmdesc),17 at the same time the atom features and bond features are computed using RDKit. We combine the quantum mechanics (QM) descriptors with the features and feed the overall input into the MPNN network, which outputs a learned embedding of the reaction center atom. We then use this embedding together with the atom descriptors to predict the probabilities for candidate products. A steric hindrance checker (shown in Fig. 1c) is used after Regio-MPNN to filter out steric hindered results. | ||
To get a deeper understanding of our dataset, we plot the competing reaction sites with regioselectivity risks in the overall cross-coupling reaction dataset in Fig. 3c. The vertical axis denotes the ground-truth halogens or halogen-like groups corresponding to the major product, and the horizontal axis denotes other potential reactive sites present in the same reactant. For instance, the deepest blue section in the heatmap represents that the most commonly seen competing site pair in MCCCRs is Br vs. Cl (3599 data points), where the actual reactions occur at the bromine-substituted sites. The total number in the heatmap exceeds the total number of our datasets (9734) because there may be more than two competing sites in one molecule.
:
1
:
1. To ensure the fidelity of the estimation of our model's generalization ability, we incorporated a stratified sampling strategy.53 In other words, reactions in the test set do not contain any reactants that are seen in the training or validation sets. In addition, we made sure that each reaction type is distributed among the three sets according to their distribution in the overall dataset shown in Fig. 3b.
We implement different model architectures to predict the main product for MCCCRs with regioselectivity risks (Table 1). The random guess refers to randomly picking a product from the candidate products and treating it as the main product. Besides the aforementioned graphical representation in the MPNN framework, other frequently used representations for chemical compounds or reactions, such as Extended-Connectivity Fingerprints (ECFPs),31 Simplified Molecular Input Line Entry System (SMILES) sequence,54 and chemical descriptors, are used to do the same prediction task. Detailed implementation of each representation is illustrated in the ESI “Details for implementation of other models” section.† The robustness of each model is evaluated by a five-fold cross-validation in which the data splitting also follows the stratified sampling strategy.53 Accuracy for each model is the average accuracy of the model running on different data splits for five times and robustness for each model is the maximum deviation between the average accuracy and a one-time running accuracy. Accuracy and robustness results for different models are shown in Table 1. The overall accuracy on the test set using only MPNN is 62.33%, while the overall accuracy can increase to 95.83% when integrating Fqmdesc-calculated descriptors into the MPNN, emphasizing the necessity of introducing chemical descriptors to the model. The accurate results of the MPNN + DFT model and MPNN + Fqmdesc model indicate that the difference between using DFT- vs. Fqmdesc-calculated descriptors is negligible. Therefore, Fqmdesc can be an effective alternative to the DFT calculation for inference, reducing the total inference time by ∼20
000 times. Putting this into context, calculating descriptors with DFT takes 1.5 days vs. 6 seconds with Fqmdesc for the same data in the test set. To compare the impact of aggregation functions in the message passing step, we implement the aggregation functions mentioned in Section 2.3 in our Regio-MPNN model. This comparison is made using Fqmdesc-calculated atom descriptors. The performance of each aggregation function is shown in Table 1. Among these functions, the gated attention aggregator (eqn (3)) exhibits the highest accuracy and strongest robustness. To verify the effectiveness of the Regio-MPNN model, we also conduct an experiment where the multi-head attention mechanism in the score calculation section was removed. In this variant, the final atom representation is directly concatenated with the reaction's rxnfp and fed into the fully connected layer, resulting in a model with an accuracy of 96.32 ± 0.86%. Another experiment involves changing the max-pooling of reaction center atom representations after T steps to average-pooling, resulting in a model with an accuracy of 96.33 ± 0.96%. Additionally, we implement the state-of-the-art regioselectivity determination model, ml-QM-GNN17 which is designed for substitution reactions, in the metal-catalyzed cross-coupling reactions. We find that Regio-MPNN, using the gated attention aggregator, outperforms ml-QM-GNN in this task.
| Model | Structure info | Descriptors used | Aggregator in message passing steps | Accuracy% |
|---|---|---|---|---|
| Random guess | ✗ | ✗ | ✗ | 41.74 ± 4.83 |
| ECFP based MLP | ✓ | ✗ | ✗ | 46.34 ± 2.56 |
| Descriptor based MLP | ✗ | DFT | ✗ | 60.52 ± 5.98 |
| ✗ | Fqmdesc | ✗ | 61.40 ± 1.93 | |
| Sequence based model + MLP | ✗ | ✗ | ✗ | 62.53 ± 5.32 |
| ✗ | DFT | ✗ | 71.65 ± 3.05 | |
| ✗ | Fqmdesc | ✗ | 73.51 ± 2.63 | |
| Regio-MPNN | ✓ | ✗ | Pairwise sum aggregator | 62.33 ± 1.74 |
| ✓ | DFT | Pairwise sum aggregator | 95.61 ± 0.97 | |
| ✓ | Fqmdesc | Pairwise sum aggregator | 95.83 ± 0.95 | |
| Regio-MPNN | ✓ | Fqmdesc | Average pooling aggregator | 95.51 ± 0.97 |
| Regio-MPNN | ✓ | Fqmdesc | Max pooling aggregator | 94.68 ± 1.13 |
| Regio-MPNN | ✓ | Fqmdesc | Gated attention aggregator | 96.51 ± 0.87 |
| Regio-MPNN (w.o. multi-head attention) | ✓ | Fqmdesc | Gated attention aggregator | 96.32 ± 0.86 |
| Regio-MPNN (average-pooling after T steps) | ✓ | Fqmdesc | Gated attention aggregator | 96.33 ± 0.96 |
| ml-QM-GNN17 | ✓ | Fqmdesc | Pairwise sum aggregator | 96.13 ± 0.91 |
We also evaluated the accuracy of prediction among different cross-coupling reaction types in the test set. As shown in Fig. 4a, four reaction types with relatively fewer data points (as shown in Fig. 3b), Kumada, Hiyama, Negishi, and Heck, have relatively higher percentages of erroneous regioselectivity prediction. This could be rationalized by the lack of data in the data set for these reaction types (Fig. 3b). Even though we have combined all the relevant data from Pistachio and CAS Content Collection3 datasets, the extent of data imbalance is still significant. To mitigate this, data scientists typically use measures such as data augmentation,55 under-sampling,56 and weighted loss.57 However, data augmentation may not be a suitable solution for the task at hand, because regioselectivity may vary even with small structural changes, which means the generation of new data points from scratch or from existing data points for rare reaction types would not be feasible. Under-sampling the majority reaction type leads to a small training set and would significantly reduce the ability to generalize our model (detailed under-sampling experiments are shown in the ESI “Experiment on under-sampling training set” section†). Unlike classification tasks, the scaling strategy in the loss function cannot be applied to our model either. Thus, in the future, the best solution to the data imbalance is to intentionally acquire more wet lab data from high throughput experiments, as proposed by recent literature.58–60 However, this approach is out of the scope of this work.
The reaction type that challenged our prediction model the most was Kumada coupling reactions (Fig. 4a). Surprisingly, as shown in Fig. 3b, Kumada reactions are not the rarest reaction type in the dataset. In order to understand the uniqueness of Kumada reactions, we conduct a fine-grained analysis of the data distributions. We notice the obvious difference in competing site distribution between all reactions in the overall dataset and Kumada reactions as shown in Fig. 3c and d. To quantify this difference, the Kullback–Leibler (KL) divergence,61 Jensen–Shannon (JS) divergence,62 and Bhattacharyya coefficient63 are computed with eqn (6) and the difference of competing site pairs among different reaction types is shown in Table 2.
![]() | (6) |
| Reaction type | KL divergence | JS divergence | Bhattacharyya coefficient |
|---|---|---|---|
| Buchwald | 0.065 | 0.027 | 0.941 |
| Heck | 0.127 | 0.052 | 0.930 |
| Hiyama | 0.114 | 0.054 | 0.936 |
| Kumada | 0.219 | 0.101 | 0.876 |
| Negishi | 0.068 | 0.033 | 0.945 |
| Sonogashira | 0.164 | 0.058 | 0.909 |
| Stille | 0.142 | 0.058 | 0.919 |
| Suzuki | 0.035 | 0.011 | 0.975 |
Among the 8 reaction types, Kumada reactions exhibit the highest KL and JS divergence as well as the lowest Bhattacharyya coefficient compared to the entire dataset, which indicates that the competing site pairs in Kumada reactions are significantly different from those in the entire dataset. To better understand this difference, we calculate an approximation of the posterior probability of reaction type k given competing site pairs i, j using eqn (7), and this posterior probability for Kumada reactions is shown in Fig. 4b.
![]() | (7) |
An interesting fact is discovered that the competition between Br vs. OTf with Br as the preferred site is a signature of Kumada reactions. This is unexpected since the reactivity for Bromine sites and that for OTf sites in MCCCRs are normally considered as closely similar and hard to distinguish,64 which could be attributable to the relatively low accuracy for predicting regioselectivity for Kumada reactions. Moreover, among all the reaction types we are considering, the Kumada reactions require strict experimental condition control owing to the high reactivity of Grignard reagent with water or with functional groups in other reactants,65 which also makes Kumada reactions unique among MCCCRs. Figures of competing site pair distribution and approximated posterior probability for other reactions are available in ESI Fig. S4–S10.†
![]() | ||
| Fig. 5 Case study of steric impact on a series of Suzuki–Miyaura reactions. Regioselectivity may vary due to the crowded chemical environment around the reacting halogen group. Our steric hindrance checker can successfully capture this change. The yield reported here is from the literature.66 | ||
| Test taker | Accuracy |
|---|---|
| Chemist_1 | 94% |
| Chemist_2 | 91% |
| Chemist_3 | 81% |
| Chemist_4 | 80% |
| Chemist_5 | 78% |
| Chemist_6 | 74% |
| Regio-MPNN | 100% |
![]() | ||
| Fig. 6 Examples of erroneous predictions made by at least half of the human chemists. The reactions are from the 100 randomly sampled reactions with regioselectivity issues from the test set. The ground-truths are shown as the product;67–71 the sites predicted by our model are marked by blue circles; the sites erroneously predicted by human chemists are marked by red circles. | ||
![]() | ||
| Fig. 7 Applications of the Regio-MPNN model. (a) An example of a regioselectivity mistake made by an AI-driven CASP system. This Buchwald–Hartwig reaction is used in a route designed by a CASP system with open online access. However, the ground truth73 demonstrates that the blue-circled site is more active than the mistakenly chosen red-circled one; (b) a snapshot of the prediction result from our Regio-MPNN web tool for the Buchwald–Hartwig reaction in (a), showing that our model can fix the above regioselectivity mistake; (c) an example of how regioselectivity models can potentially save material costs by replacing expensive starting materials (upper reaction scheme)74 with economical ones (lower) without significantly affecting the yield. | ||
Footnotes |
| † Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4dd00244j |
| ‡ These authors contributed equally to this work. |
| This journal is © The Royal Society of Chemistry 2024 |