 Open Access Article
 Open Access Article
      
        
          
            Sooyeon 
            Moon‡
          
        
        
       ab, 
      
        
          
            Sourav 
            Chatterjee‡
ab, 
      
        
          
            Sourav 
            Chatterjee‡
          
        
       a, 
      
        
          
            Peter H. 
            Seeberger
a, 
      
        
          
            Peter H. 
            Seeberger
          
        
       ab and 
      
        
          
            Kerry 
            Gilmore§
ab and 
      
        
          
            Kerry 
            Gilmore§
          
        
        
       *a
*a
      
aDepartment of Biomolecular Systems, Max-Planck-Institute of Colloids and Interfaces, Am Mühlenberg 1, 14476 Potsdam, Germany. E-mail: kerry.m.gilmore@uconn.edu
      
bFreie Universität Berlin, Institute of Chemistry and Biochemistry, Arnimallee 22, 14195 Berlin, Germany
    
First published on 26th December 2020
Predicting the stereochemical outcome of chemical reactions is challenging in mechanistically ambiguous transformations. The stereoselectivity of glycosylation reactions is influenced by at least eleven factors across four chemical participants and temperature. A random forest algorithm was trained using a highly reproducible, concise dataset to accurately predict the stereoselective outcome of glycosylations. The steric and electronic contributions of all chemical reagents and solvents were quantified by quantum mechanical calculations. The trained model accurately predicts stereoselectivities for unseen nucleophiles, electrophiles, acid catalyst, and solvents across a wide temperature range (overall root mean square error 6.8%). All predictions were validated experimentally on a standardized microreactor platform. The model helped to identify novel ways to control glycosylation stereoselectivity and accurately predicts previously unknown means of stereocontrol. By quantifying the degree of influence of each variable, we begin to gain a better general understanding of the transformation, for example that environmental factors influence the stereoselectivity of glycosylations more than the coupling partners in this area of chemical space.
Machine learning is a powerful tool for chemists6,7 to identify patterns in complex datasets from composite libraries or high-throughput experimentation.8 Chemical challenges including retrosynthesis,9 reaction performance10 and products,11,12 the identification of new materials and catalysts,13–15 as well as enantioselectivity16,17 have been addressed. However, a significant challenge is predictability of reactions involving SN1 or SN1-type mechanisms18 in the absence of chiral catalysts/ligands,19 due to the potentially unclear mechanistic pathways resulting from the instability of the carbocationic intermediate.16,17,20
Glycosylation is one of the most mechanistically complex organic transformations,20–22 where an electrophile (donor), upon activation with a Lewis or Brønsted–Lowry Acid, is coupled to a nucleophile (acceptor) to form a C–O bond and a stereogenic center. This reaction involves numerous potential transient cationic intermediates and conformations and can proceed via mechanistic pathways spanning SN1 to SN2.23 The stereochemical outcome is determined by numerous permanent (defined by the starting materials) or environmental factors (defined by the selected conditions/catalyst) whose degree of influence, interdepency, and relevance is poorly understood.20,24,25 A systematic assessment of these factors on a flow platform allowed for the isolated interrogation of these variables. The empirical study indicated general trends/influences of these factors (Fig. 1) and hypothesized their relative rankings with respect to dominance.24 However, a data sciences approach is required to positively identify, quantify, and apply this knowledge for the accurate prediction of stereoselectivities of new coupling partners and conditions. While transfer learning has been applied to machine learning models for the prediction of selectivities of glycosylations (reported between preprint and publication of this work), the stereoselectivity of couplings predicted were controlled by the C-2 acyl protecting group that provide a well-established, highly reproducible means of stereocontrol in these reactions.26
|  | ||
| Fig. 1 General representation of the potential mechanistic pathways of glycosylations leading to either the alpha (α) or beta (β) anomer of the formed C–O bond. The empirically-derived permanent and environmental factors and their influence on stereoselectivity are provided.24 | ||
A set of numerical descriptors that accurately describe the relevant steric and electronic parameters of all reaction participants – starting materials, reagents, and solvent – is key to building an accurate, extrapolatable model to predict the subtle nuances of stereoselectivity. The concise nature of the training set (268 data points, Table S1 (pS5–S9), ESI†)30,31 renders manual selection of descriptors – quantifying sterics/electronics – using chemical intuition32 particularly important.33
The training dataset is a lightly modified version of the dataset presented in our previous work,24 removing two subsets of data (variance of the residence time and nucleophile equivalents) and adding data for β-glucose electrophile (pS6, lines 68–74 and 101–106 of Table S1, ESI†) and three additional solvents (pS9, lines 238–268 of Table S1, ESI†). Two holdout datasets were experimentally generated (HD1, HD2). The first was comprised of new electrophiles, nucleophiles, acid catalysts, and solvents. Holdout dataset 2 was comprised of examples probing the influence of electrophile leaving group stereochemistry.
![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) :
:![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) descriptors >10
descriptors >10![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) :
:![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 1.34,35 The best-performing descriptors for each participant class were determined by the accuracy of the resultant trained models in predicting stereoselectivities of the relevant portions of holdout dataset 1 (e.g. determining the accuracy of predicting the novel electrophiles in HD1 with systematic screening of electrophile descriptors). Ten descriptors were identified that, along with temperature, allow for the assignment of quantified values to the relevant steric/electronic properties of the chemicals involved.
1.34,35 The best-performing descriptors for each participant class were determined by the accuracy of the resultant trained models in predicting stereoselectivities of the relevant portions of holdout dataset 1 (e.g. determining the accuracy of predicting the novel electrophiles in HD1 with systematic screening of electrophile descriptors). Ten descriptors were identified that, along with temperature, allow for the assignment of quantified values to the relevant steric/electronic properties of the chemicals involved.
        The identified descriptors, described below (see potential descriptors excel sheet for a list of all descriptors screened), are either classified as regressors (intra-/extrapolatable values) or categorical (binary values). While the model can be developed solely using regressor values, it exhibits marginally poorer overall accuracy for holdout dataset 1 and necessitates additional calculations (vide infra). The ability to interchange descriptors will facilitate the expansion of the developed model into adjacent or similar chemical subspaces as well as for multi-stage predictive algorithms, designing both reagents and environmental conditions to maximize the stereoselectivity of the desired transformation.
The key parameters needed to describe the electrophile were differences in the reactivity of the anomeric position and the orientations of the pyran ring substituents that may influence the selectivity through both conformational preferences36 and hyperconjugative interactions.37,38 The different leaving groups at the anomeric position were distinguished using the calculated 13C NMR chemical shift,39 which provided more clear distinctions between leaving groups than the 1H NMR shift40 of the anomeric proton. The relative orientations of the ether moieties around the pyran presented a challenge for descriptor selection, as our model performed well with both regressor and categorial descriptors. The accuracies of the three best performing descriptors (proton J-couplings around the ring, dihedral angles of the C–O bonds, and treating the relative axial/equatorial orientations of the substituents as binary) are shown in Fig. 3. The binary classification is the most accurate and represents the simplest descriptor, and the loss of additional/more nuanced information provided by regressor values – e.g. the influence and nature of the leaving groups – is, at present, acceptable.
Observed nucleophile reactivity has been correlated with a range of parameters.41–43 Where available, Mayr's nucleophilicity and field inductive parameters correlate with glycosylation stereoselectivity.44 To ensure general applicability, the 17O NMR chemical shift of the oxygen nucleophile was calculated to capture the relevant hyperconjugative influences. The steric environment of the nucleophile was described by the exposed surface areas of the oxygen and α-carbon in a space-filling model (Fig. 4). While screening whether simple categorical descriptors can be utilized, specifically the whole values 0–3 to describe the substitution at the α-carbon (as opposed to the exposed surface area), we found that the regressor value proved superior (see ESI†).
The chosen environmental conditions – solvent, acid catalyst, and temperature – are even more influential on the stereoselectivity than the intrinsic properties of the nucleophile and electrophile (vide infra). While regressor values for similar species have been calculated previously, the identification of the descriptors for acid catalysts relevant to this transformation was critical. The conjugate base of the acid catalyst has a significant impact on glycosylation stereoselectivity,45 as evidenced by several studies observing an α-triflate intermediate20,46 – the product of the conjugate base trapping the oxycarbenium ion.47 Two values were identified that capture the nuanced role of this species (Fig. 5a): the HOMO energy value of the conjugate base and the exposed surface area of the oxygen or nitrogen anion in a space-filling model.
While the influence of the solvent in glycosylations48,49 has been categorized by polarity and donicity (coordinating ability) values,20 donicities are experimentally derived values and only available for select solvents. The calculated minimum and maximum electrostatic potentials describe the ability of the solvent to stabilize and interact with charged intermediates (Fig. 5b). These descriptors perform well, such that even previously unreported means of solvent-control over stereoselectivity are accurately predicted (vide infra).
The trained RF model was then used to predict the stereoselectivities of the entirety of holdout dataset 1, containing unseen variants of each of the four chemical species in the reaction over the accessible temperature ranges (defined by the solvent and reactor). Holdout dataset 1 (see holdout dataset 1 excel sheet of ESI†) was generated using the same reproducible microreactor platform24 as the training dataset. The results of these predictions, as compared to the experimentally observed selectivities, are presented as the percentage of alpha product formed versus temperature. The corresponding parity plots for each are also provided (Fig. 6).
While the training dataset contains only simple alkyl alcohols as nucleophiles, the model accurately predicts the stereoselectivities of disaccharide formation. The predicted values for the coupling of α-galactose imidate with both glucose and mannose C6 alcohols matches well with the experimental data, albeit predicting a less α-selective process than observed (RMSE: 6.9 and 4.2, Fig. 6d and e, respectively).
The model predicts more α-selective processes than experimentally observed in glycosylations using superacid 4,4,5,5,6,6-hexafluoro-1,3,2-dithiazinane-1,1,3,3-tetraoxide (C3F6S2O4NH) as acid catalyst. This deviation is seen at lower temperatures with galactose, however, the trend is correct and has a low RMSE (5.5, Fig. 6g). The weakest correlation of our model is observed for the C3F6S2O4NH-activated mannose coupling with tert-butanol in DCM (RMSE: 19.3). Here, a stereoselective plateau is predicted at low temperatures with α-selectivity around 60% – as was observed experimentally for other activators with mannose.24 However, experimentally the β-mannosylation product is mainly formed at low temperatures (−50 °C, 63% β-product). This finding is highly unexpected as β-mannosylation is challenging, generally requiring locked electrophile configurations.21 With C3F6S2O4NH, the perbenzylated electrophile ranges from a 63% β-selectivity at −50 °C to 98% α-selectivity at 30 °C (Fig. 6h).
Finally, the stereoselectivities of glucose and galactose α-imidate electrophiles with isopropanol were predicted for two new solvents (Fig. 6j and k). The strong influence of solvent48,51 on the stereoselectivity of glycosylations is nicely captured by the descriptors chosen, and the model is accurate across a wide temperature range for both α,α,α-trifluorotoluene (RMSE: 6.2) and 1,4-dioxane (RMSE: 4.5).
|  | ||
| Fig. 7  Prediction of novel mechanistic controls of glycosylation reactions using holdout dataset 2, with experimental data shown as points and predicted data shown as lines. The relevant experimental data for the α-electrophiles can be found in Table S1 of the ESI.† (a) Experimental results of coupling α/β-glucose electrophiles with iPrOH (Glc1α and Glc1β) in DCM and CHCl3. (b) Experimental results of coupling α/β-glucose electrophiles with iPrOH (Glc1α and Glc1β) in toluene, and MTBE. (c) Prediction and experimental results of β-glucose electrophile (Glc1β) with EtOH in toluene. (d) Prediction and experimental results of β-glucose electrophile (Glc1β) with tBuOH in toluene. (e) Parity plot of EtOH and t-BuOH nucleophile predictions with the β-glucose electrophile. (f and g) Prediction and experimental results of β-galactose electrophile (Gal1β) with iPrOH in DCM and toluene, respectively. (h) Parity plot for DCM and toluene solvent predictions of the β-galactose electrophile with iPrOH. Figure code: Glc1α (▲); Glc1β (★); EtOH ( ![[hexagon filled, point down]](https://www.rsc.org/images/entities/char_e124.gif) ); tBuOH (▼); DCM ( ![[pentagon filled]](https://www.rsc.org/images/entities/char_e13d.gif) ); toluene (■); experimental values (data points) and predicted values (solid colored lines). | ||
The ability to use solvent to turn on and off the influence of leaving group orientation on glycosylation stereoselectivity has, to the best of our knowledge, not previously been reported. While essentially identical behavior is observed in DCM and chloroform, a slight divergence in MTBE at low temperatures is observed, with an 11% difference at −50 °C where the β-electrophile reaches 96% α-selectivity. This variable becomes important in toluene. Glucose β-imidate electrophile yields almost unchanged stereoselectivity (∼60% alpha) over a 120 °C range! The orientation of the leaving group of the electrophile influences the stereoselectivity by more than 40% at −50 °C (Fig. 7b).
With this limited data in our training dataset (Fig. 7a and b), we tested the ability of our model to predict the influence of other factors on this to-date unreported phenomenon in holdout dataset 2, whose experimental values were obtained on the same microreactor platform as TD and HD1 (see holdout dataset 2 excel sheet of ESI†). The stereoselectivity of glucose α-imidate with ethanol as nucleophile ranges from 10–54% α-product in toluene. The model predicts that the β-electrophile will behave differently, with a much less selective coupling overall (37–56% α-product) and a 27% difference in selectivity at low temperature compared to the α-electrophile. This prediction matches well with the experimental results, with an RMSE of 4.4 over the 120 °C range (Fig. 7c). The model predicts a less α-selective reaction at low temperatures than observed with t-BuOH as nucleophile (similar to what is observed using the α-electrophile, pS6, lines 82–88 of Table S1, ESI†), though at higher temperatures, the prediction matches well with experimental values (RMSE: 6.4, Fig. 7d).
Lastly, we sought to explore whether this additional mechanistic complexity exists for other electrophiles (Fig. 7f and g). In DCM, the coupling of α-galactose with isopropanol moderately favors the formation of the β-product (19–51% α-product from −50 to 30 °C, (pS7, lines 119–124, of ESI†)). The model predicts that the β-galactose electrophiles will give similar α-selectivity in DCM over the 80 °C temperature range (24–49% α-product), matching experimental results (RMSE 3.1, Fig. 7f). In toluene, the α-galactose electrophile exhibits a wide range of selectivities with isopropanol, from 10–69% α-product across the 130 °C range (pS7, lines 142–148, of ESI†). The model predicts a slight divergence (15%) in stereoselectivity at low temperatures when the β-galactose electrophile is used (25–64% α-product, −50 to 70 °C), though not as large as what is observed with β-glucose. This prediction again aligns with experimental results (RMSE: 3.7, Fig. 7g). Overall, the model correctly predicts the previously unknown ability to turn on and off the influence of the electrophile leaving group's orientation using solvents under otherwise identical conditions. We hypothesize the decrease of stereoselectivities for β-electrophiles when using toluene may result from an increase in the SN1-type pathways. The π-system of the solvent can more easily induce solvolysis of the more planar equatorial leaving group from both faces (as compared to the axial orientation), leading to an accessible oxonium ion instead of an α-triflate intermediate. Additional detailed mechanistic studies are required to discern the degree and nature of mechanistic control.
|  | ||
| Fig. 8 Degree of influence of the eleven factors (defined and described above) influencing the stereoselectivity of glycosylations, rounded to the nearest whole number. | ||
| Footnotes | 
| † Electronic supplementary information (ESI) available: Detailed experimental procedures, complete datasets, additional graphs and control studies, details regarding automation and instrumentation. Microsoft Excel worksheets listing of descriptors, the training set, and holdout datasets 1 and 2. Code availability: software available at https://github.com/DrSouravChemEng/GlyMecH. See DOI: 10.1039/d0sc06222g | 
| ‡ These authors contributed equally to this work. | 
| § Current address: Department of Chemistry, University of Connecticut, 55 N. Eagleville Rd, Storrs, CT, USA. | 
| This journal is © The Royal Society of Chemistry 2021 |