Marc A. S.
Short
a,
Clare A.
Tovee
c,
Charlotte E.
Willans
b and
Bao N.
Nguyen
*b
aSchool of Chemical and Process Engineering, University of Leeds, Woodhouse Lane, Leeds LS2 9JT, UK
bSchool of Chemistry, University of Leeds, Woodhouse Lane, Leeds LS2 9JT, UK. E-mail: b.nguyen@leeds.ac.uk
cThe Cambridge Crystallographic Data Centre, 12 Union Road, Cambridge, CB2 1EZ, UK. E-mail: tovee@ccdc.cam.ac.uk
First published on 22nd March 2023
A novel semi-automated, high-throughput computational workflow for ligand/catalyst discovery based on the Cambridge Structural Database is reported. Two potential transition states of the Ullmann–Goldberg reaction were identified and used as a template for a ligand search within the CSD, leading to >32000 potential ligands. The ΔG‡ for catalysts using these ligands were calculated using B97-3c//GFN2-xTB with high success rates and good correlation compared to DLPNO-CCSD(T)/def2-TZVPP. Furthermore, machine learning models were developed based on the generated data, leading to accurate predictions of ΔG‡, with 70.6–81.5% of predictions falling within ± 4 kcal mol−1 of the calculated ΔG‡, without the need for the costly calculation of the transition state. This accuracy of machine learning models was improved to 75.4–87.8% using descriptors derived from TPSS/def2-TZVP//GFN2-xTB calculations with a minimal increase in computational time. This new workflow offers significant advantages over currently used methods due to its faster speed and lower computational cost, coupled with excellent accuracy compared to higher-level methods.
While high-throughput experimental approaches have proven effective at finding suitable ligands from libraries and optimising reaction conditions,14–17 these are limited by the available ligand libraries. In silico ligand exploration allows faster access to the entire chemical space and can lead to the discovery of unexpected ligands. In addition, new developments in high-throughput computational techniques,18–21 and cheminformatics tools can underpin additional filters such as ligand cost/complexity, toxicity and availability for a variety of applications in different chemical sectors.22–24 However, research in this field has been hampered by a lack of suitable tools for the automated exploration of ligand space, while taking into account synthetic feasibility of the ligands.13,25–27 In this paper, we report an alternative approach which leverages the extensive Cambridge Structural Database (CSD) and its tools to explore ligand space in a relevant catalytic reaction. This has the benefit of avoiding the synthetic feasibility challenge completely, while still maintaining a very wide chemical space coverage.
The approach was demonstrated with the copper(I)-catalysed Ullmann–Goldberg reaction, an important C–N cross-coupling reaction which has been highlighted by pharmaceutical companies as a desirable synthetic tool in the near future due to its mild conditions compared to the palladium-catalysed counterpart and sustainability credentials.28 Despite this level of interest, the Pd-catalysed Buchwald–Hartwig coupling reaction is still preferred due to its reliability and better-developed ligands. Several different reaction mechanisms have been proposed for this reaction, e.g. oxidative addition,29,30 single electron transfer (SET),31,32 atom transfer,31 sigma bond metathesis,33 and π-complexation of Cu(I) on ArX,34,35 which can depend on the substrates and ligands.36–39 The mechanisms are further complicated by reversible deactivation of the catalyst by reversible disproportionation of a Cu(I) intermediate into deactivated species,40–42 and involvement of the base in the reaction mechanism.43 Thus, little understanding of ligand design, which can improve reaction yields, scope, catalyst loading and catalyst stability, has been reported. This reaction serves as an excellent case study for our automated ligand discovery approach using the CSD and high-throughput calculation of activation energy barriers (Fig. 2). Two transition states, oxidative addition (TSOA) and sigma bond metathesis (TSSig), were chosen in this study. The other possible mechanisms, which require an accurate description of either open-shell complexes or weak interactions, were excluded due to the high computational cost required in a high-throughput context (Fig. 1).44–47
![]() | ||
Fig. 1 The Ullmann–Goldberg coupling reactions studied and proposed transition states for oxidative addition (TSOA) and sigma bond metathesis (TSSig). |
![]() | ||
Fig. 2 Workflow overview for identifying ligands in the CSD, high-throughput calculation of ΔG‡, and ML prediction of ΔG‡ for in silico screening of catalysts. |
Catalytic intermediates IPip and IPyr (Fig. 1) are generated through standard complex generation with piperidine and 2-pyrrolidinone as coupling partners, in a singlet spin state. The structures are optimised before and after ligand addition using the Universal Force Field (UFF).50 Transition state structures are generated via ligand replacement of a transition state template (TSOA or TSSig), generated with GFN2-xTB, using 3,4,7,8-tetramethyl-1,10-phenanthroline (TMPHEN) as the ligand, and iodobenzene as the aryl halide coupling partner. TMPHEN is then replaced using the ligand replacement feature included in molSimplify by defining the coordinating atoms of the new ligand(s). The structure is subsequently optimised with a custom after-core constrained method with UFF, where the transition state ‘core’ is locked and only the ligand(s) are optimised.
For B97-3c//GFN2-xTB high-throughput calculations, the structures were first optimised and numerical frequencies were calculated with GFN2-xTB. Energy calculations were performed at the B97-3c level of theory using DMF as the solvent.
For transition state vetting, the eigenvector corresponding to the imaginary frequency should have motion along one of the transition state active bond stretching modes, with an overlap above the threshold S0 = 0.20 and 0.33 (eqn (2)) for TSOA and TSSig, respectively.
Descriptors for ML were chosen to describe steric and electronic properties of the respective complexes and transition states. All electronic descriptors were extracted from the B97-3c energy files, except the imaginary frequency which is from the GFN2-xTB frequency calculation. Electronic descriptors: HOMO energy, LUMO energy, Lowdin charge, bonded valence, atomic population, bond order and orbital charges. Bond descriptors are for all transition state active bonds and Cu–L bonds. Electronic descriptors for individual atoms are for transition state active atoms and the ligand coordinating atoms, L1 and L2. Steric descriptors: bite angle, change in bite angle, cone angle, sterimol B1, B5, L, % buried volume at 3.5 Å, 5 Å and 7 Å, solvent accessible surface area, Cu–L bond lengths, bond angles and change in bond lengths between the transition state and CuLX starting structure, transition state active bond lengths and bond angles. Cu–L and transition state active steric descriptors were calculated directly from bond lengths and bond angles from the .xyz files. All other steric descriptors were calculated using the Morfeus Python package.56
For these transition states, piperidine and 2-pyrrolidinone were selected as coupling partners in order to minimise conformational flexibility in the organometallic intermediates and transition states (Fig. 1). In each case, a transition state was generated using GFN2-xTB using TMPHEN as the ligand, and iodobenzene as the aryl halide. The structure was optimised to a transition state and the imaginary frequency was checked for the correct vibrational mode (Fig. 3). These transition states were used as a reference structure for ligand identification in the CSD.
The TSOA structure with TMPHEN as the ligand was imported into CSD-CrossMiner and coordinating_atom features were placed on each TMPHEN nitrogen atom and projected onto the copper atom. A bridge of two heavy_atom features between the two coordinating nitrogens was placed on the two bridging carbon atoms. The tolerance for each atom was set at 0.75 Å after manual tuning. The features were constrained to be intramolecular. The substrate sites were defined by placing excluded volume features on each atom of the substrates with a tolerance equal to the van der Waals radii of the base atom. Thus, the created pocket represents the space occupied by both substrates in the transition state, with a soft tolerance allowing the vdW radii of atoms to overlap with the excluded cavity, to allow for variations in individual transition states with different ligands and substrates. Ligands which pre-arrange in this manner will more likely favour the required geometry of the transition state. Only organic structures were included in the search by setting is_organic to True. The catalophore was saved as a .cm file.
The catalophore searches were conducted using the CatSD structural database, a carefully curated subset of the CSD, with the CSD-PythonAPI. The searches were conducted with a maximum molecular weight of 500 Da, a maximum root-mean-square-deviation (RMSD) in geometry between catalophore and the hit of 1,58,59 with Br, Cl, I, Li, Na, K, Ca, Mg, Be and transition metals excluded. SMILES code matching was used to remove duplicate structures. 3D structures were cleaned by assigning all unknown bond types, adding all missing hydrogens and setting all formal charges. Structures were exported in .mol format. In order to generate organometallic complexes with the ligands, the indexes of the coordinating atoms in the 3D structure file are required to define the bonds between the ligand and the metal centre. These were automatically identified for each ligand by matching the coordinates of the coordinating_atom features to the atoms located at those coordinates in the hit structure. The atom indexes, name of the structure file and charge of the ligand are exported as a molSimplify .dict file. This .dict file is used by molSimplify to obtain the data required for structure generation for each ligand.
For piperidine and 2-pyrrolidinone coupling partners, 14483 (ligands_CSD_Pip_set) and 18
886 (ligands_CSD_Pyr_set) unique structures were identified as potential ligands in the CSD, respectively.
Entry | Method/functional | MAE | Time (h) | |
---|---|---|---|---|
d(C–uL) (Å) | ∠(L–Cu–L) (°) | |||
1 | GFN0-xTB | 0.149 | 13.2 | 0.9 |
2 | GFN1-xTB | 0.073 | 4.6 | 2.0 |
3 | GFN2-xTB | 0.029 | 4.7 | 1.7 |
4 | HF-3c | 0.287 | 29.0 | 145 |
5 | PBEh-3c | 0.035 | 3.3 | 503 |
6 | B97-3c | 0.013 | 2.3 | 84 |
7 | B3LYP | 0.066 | 4.2 | 4931 |
8 | M06 | 0.034 | 3.4 | 7850 |
9 | M06-L | 0.020 | 4.1 | 7127 |
10 | TPSSh | 0.032 | 3.9 | 5435 |
11 | MPWLYP1M | 0.063 | 4.3 | 6508 |
12 | BP86 | 0.030 | 4.2 | 2143 |
13 | wB97xD | 0.030 | 4.3 | 9220 |
14 | B3LYP-D3(BJ) | 0.023 | 3.7 | 6179 |
15 | M06-D3(0) | 0.028 | 3.5 | 11![]() |
16 | M06-L-D3(0) | 0.019 | 4.3 | 5271 |
17 | TPSSh-D3(BJ) | 0.016 | 3.3 | 6172 |
18 | TPSS-D3(BJ) | 0.017 | 3.7 | 3886 |
19 | PBE0-D3(BJ) | 0.018 | 3.6 | 5502 |
20 | BP86-D3(BJ) | 0.026 | 4.6 | 4036 |
Without D3 dispersion correction, all methods overestimated bond lengths by several picometers due to the inability of the methods to account for London dispersion interactions and increased steric interactions at the metal centres. The GGA/meta-GGA functionals, BP86 and M06-L, perform better than the hybrid functionals when predicting bond lengths, with BP86 being the most computationally efficient (entry 9 and 12, Table 1). The inclusion of D3 dispersion correction led to significantly improved performance in all DFT methods. TPSSh, a meta-GGA hybrid including 10% HF exchange, showed the best predictions for both bond lengths and bond angles while keeping the computational cost relatively low. Importantly, GFN2-xTB calculations using xtb 6.3.0 with verytight optimisation criteria showed excellent structural agreement with the crystal structures with minimal computational cost (entry 3, Table 1). The composite method B97-3c gave even better results (MAE of 0.013 Å and 2.3°), requiring only 84 hours of single-core CPU time, compared to 6172 hours for TPSSh-D3(BJ) and 4036 hours for BP86-D3(BJ). In fact, B97-3c outperformed the best DFT functional TPSSh-D3(BJ), even with a full triple-ζ basis set. Thus, the GFN2-xTB and B97-3c methods are the most suitable computational methods for very high-throughput computational studies of the Ullmann–Goldberg reaction (Fig. 5).
Lastly, the accuracy of the high-throughput calculation of ΔG‡ for the Ullmann–Goldberg reaction was assessed. Grimme and co-workers have demonstrated that the B97-3c//GFN2-xTB combination can be as accurate as traditional DFT methods in optimising and calculating energy-related properties of stable organometallic compounds.21,60 However, its performance in calculating transition states is unknown. Thus, we benchmarked the accuracy of B97-3c//GFN2-xTB calculated ΔG‡ values (B97-3c single point energy with GFN2-xTB vibrational correction) against those obtained by the ‘gold standard’ method for the calculation of structure energies, CCSD(T), for 100 randomly selected literature ligands. Both energy calculations used the same GFN2-xTB optimised structures. Coupled cluster energies were calculated at the DLPNO-CCSD(T)/def2-TZVPP level of theory and compared to the B97-3c energy. The results are summarised in Fig. 6.
The activation energies barriers calculated by B97-3c were found to correlate reasonably well with those obtained with DLPNO-CCSD(T)/def2-TZVPP method (R2 = 0.5774 for a linear relationship y = 0.8938x + 8.3679). Therefore, the B97-3c derived values were scaled (see ESI,† section 2.2), achieving a mean average error (MAE) of 3.9 kcal mol−1 (Fig. 6). The majority of calculated ΔG‡ (89%) fell with 15× RMSE of the benchmarked values. Thus, the B97-3c method represents a good balance between computational time and accuracy for the calculation of activation energies of Ullmann–Goldberg reactions. Structures containing oximes and O–Cu–O 5-membered ring motifs correlate poorly between the two methods. However, only 8 ligands containing oximes have been reported for the Ullmann–Goldberg reaction (ligands_lit_set), and this was not deemed a significant problem for ligand exploration.
Based on the benchmarking results all optimisations and frequency calculations were performed using GFN2-xTB. Energy calculations were carried out using the B97-3c composite method.
Conversion of ligands into complexes using SMARTS/SMILES strings and rdkit package can suffer from conformational variation from those in the CSD. Thus, ligands are taken as .xyz or .mol 3D structure files derived from their structures in the CSD. For ligands_lit_set, SMILES strings were used due to the lack of suitable 3D structures in the CSD, and monodentate ligands were excluded for simplicity.
In order to automate the generation of transition states TSOA and TSSig, a different strategy was employed. The transition states generated with TMPHEN above were employed as templates and TMPHEN was substituted with the ligand of interest. These structures were then pre-optimised with a custom after-core constrained method using the universal force field (UFF), where the transition state ‘core’ is locked and only the ligand is optimised to ensure the transition state mode is preserved.
The structures generated by molSimplify were pre-optimised with GFN2-xTB with the TightOpt optimisation criteria with the transition state active atoms frozen. The resulting structure was considered close to the transition state and was then optimised using eigenvector following to the transition state. To ensure reliability in cases with a shallow PES the exact Hessian is calculated every five optimisation steps. The presence of a transition state is verified by the presence of a single imaginary frequency. Single point energies are calculated with B97-3c using the TightSCF criteria and SlowConv to improve the reliability of SCF convergence. It is worth noting that many potential energy surfaces for TSOA and TSSig are relatively flat, requiring frequency Hessian calculations which are more costly computationally.
Preliminary examination of automated results showed that the process is prone to generating wrong transition states, e.g. a methyl rotation, dissociation of reactants, no identified transition state and hydrogen transfer between the ligand and the amine/amide. In order to validate the computed transition states, the vetting criteria presented by Jacobsen et al. were used.61 This procedure is not based on an IRC calculation and, therefore, reduces the required computational time. The transition state structure must meet the following three criteria: i) exactly one imaginary frequency of the Hessian (a cutoff value of −40 cm−1 is used to remove structures with frequencies that could be considered as numerical noise); ii) the transition state active bonds (bonds being broken or formed) must be of an intermediate length (eqn (1)),
![]() | (1) |
|vstretchi·vts| ≥ S0 | (2) |
To assess the accuracy of the workflow in generating and optimising the transition states, 198 ligands from ligands_CSD_Pip_set were used as test cases. Both oxidative addition (TSOA) and sigma metathesis (TSSig) transition states with piperidine as the N-partner were generated and optimised. Manual inspection of the structures and visualisation of the imaginary frequency of a subset of ligands gave an optimal value of S0 = 0.20 and 0.33 for TSOA and TSSig respectively, for the Ullmann–Goldberg reaction. For intermediates CuLI and IPip, 98% of all structures containing bidentate ligands were successfully generated (Table 2). For optimisation, 80% of CuLI and 56% of IPip structures were correctly optimised in an automated manner. Compared to these, the optimisation of transition states was very successful, giving a success rate of 63% for TSOA and 88% for TSSig. The most common reason for optimisation failure is low imaginary vibrational frequencies (>−20 cm−1). This is likely due to poor starting structures generated from SMILES strings. To mitigate this on the CSD datasets, all structures (ligands and nucleophiles) were supplied as either X-ray or optimised 3D structures and the TightOpt criteria was used to aid the removal of small imaginary frequencies.
When this workflow was applied to the entire ligands_CSD_Pip_set and ligands_CSD_Pyr_set, the optimisation success rates for stable intermediates were significantly higher than those for ligands_lit_set, thanks to the initial 3D ligand structures supplied from the CSD. On the other hand, the success rates for transition states are significantly lower than those of ligands_lit_set. The success rate of finding and optimising TSOA (33%) was particularly low using ligands_CSD_Pip_set (Table 2). This result reflects that many potential ligands are not suitable for the Ullmann–Goldberg coupling reaction, as suggested by the experimental literature. TSSig is less sterically demanding than TSOA and consequently resulted in better success rates with both ligand sets. Similarly, as 2-pyrrolidinone is less sterically demanding than piperidine, the success rate in optimising TSOA is significantly higher with ligands_CSD_Pyr_set comparing to that with ligands_CSD_Pip_set.
The real-world runtime on a high-performance computing system using 4 cores and 4GB of RAM per job for ligands_CSD_Pip_set is ∼6 weeks for 14483 ligands. The comparative time for ligands_CSD_Pyr_set is ∼4 weeks for 18
886 ligands after fine-tuning.
![]() | ||
Fig. 7 Probability density for the activation energies of the TSOA (blue) and TSSig (red) transition states for piperidine (left) and 2-pyrrolidinone (right). |
![]() | ||
Fig. 8 TSOA activation energy against TSSig activation energy for piperidine (blue) and 2-pyrrolidinone (red). |
Dataset | Original | Highly correlated removed | Low importance removed |
---|---|---|---|
TS dependent | |||
PIP_set_TSOA | 78 | 75 | 20 |
PYR_set_TSOA | 128 | 111 | 24 |
PIP_set_TSSig | 91 | 90 | 36 |
PYR_set_TSSig | 130 | 121 | 41 |
TS independent | |||
---|---|---|---|
PIP_set_TSOA_NoTS | 67 | 60 | 10 |
PYR_set_TSOA_NoTS | 67 | 62 | 14 |
PIP_set_TSSig_NoTS | 67 | 61 | 27 |
PYR_set_TSSig_NoTS | 67 | 62 | 17 |
Common practice in ML relies on R2 and RMSE, which do not consider the level of noise in training data, to evaluate models. Consequently, a new metric was created to evaluate the models: % of ΔG‡ prediction within ± 4 kcal mol−1 (%ΔG‡ ± 4). This reflects the maximum accuracy of the model based on the error present in the DFT calculated activation energies. Calculated datasets were trimmed via binning to remove bias from the uneven distribution of activation energies. Entries with a Cu–L bond order of 0 were discarded as they were bonding in a monodentate manner.
Dataset | Best algorithm | R 2 | RMSE | %ΔG‡ ± 4 |
---|---|---|---|---|
Full descriptors | ||||
PIP_set_TSOA | ET | 0.49 | 7.90 | 75.5 |
PYR_set_TSOA | ET | 0.65 | 6.32 | 66.1 |
PIP_set_TSSig | ET | 0.39 | 5.93 | 77.9 |
PYR_set_TSSig | ET | 0.63 | 5.52 | 68.5 |
Trimmed descriptors | ||||
---|---|---|---|---|
PIP_set_TSOA | ET | 0.66 | 4.81 | 79.6 |
PYR_set_TSOA | ET | 0.71 | 4.86 | 71.3 |
PIP_set_TSSig | ET | 0.48 | 4.33 | 81.5 |
PYR_set_TSSig | ET | 0.66 | 4.95 | 70.6 |
The importance of each descriptor was evaluated using permutation importance over 50 runs.64 Those which showed a permutation importance of mean − 2 × std ≤ 0 (mean = average decrease in R2 over 50 runs, std = standard deviation of the decrease in R2 over 50 runs; this metric means that more than 95% of the permutation importance values from 50 runs are above 0) were removed. This led to a very significant reduction in the number of descriptors in each dataset (Table 3), improving the interpretability of the prediction models. Importantly, the removal of the redundant descriptors led to improvement in all metrics by a significant margin (Table 4).
For TSSig, ΔG‡ showed high dependence on the atomic population on the Cu centre, Cu–N and Cu–I bond orders and TS mode bond angles, HOMO/LUMO energies and orbital charges of the Cu s, d, N s and I d orbitals. The sigma metathesis pathway showed strong dependence on steric descriptors (distances and angles). The amide C–O bond length and C–N bond order are important properties for coupling with 2-pyrrolidinone. This implies that the ability of the ligand to modulate the electron density of the copper centre to bond to and weaken the amide bond in the nucleophile is important. For piperidine, no such trend was observed, indicating a lesser degree of influence from the amine partner beyond a direct sigma donation to the Cu centre.
Structure | Single-core computational time (h) | |||
---|---|---|---|---|
Optimisation | Energy + frequency | Total | % of time | |
Ligands_CSD_PIP_set | ||||
CuLI | 66 | 6244 | 6310 | 3.5 |
IPip | 169 | 9897 | 10![]() |
5.6 |
TSOA | 102![]() |
15![]() |
118![]() |
65.6 |
TSSig | 30![]() |
14![]() |
45![]() |
25.4 |
Ligands_CSD_PYR_set | ||||
---|---|---|---|---|
CuLI | 1173 | 8596 | 9789 | 6.0 |
IPyr | 1693 | 11![]() |
13![]() |
8.2 |
TSOA | 69![]() |
17![]() |
87![]() |
53.6 |
TSSig | 33![]() |
19![]() |
52![]() |
32.2 |
For this purpose, only descriptors based on IPyr and IPip were selected and subjected to the same descriptor trimming process (Table 3). Descriptors for individual atoms were extracted for Cu, N and ligand coordinating atoms (L1 and L2). For ligands_CSD_PYR_set, descriptors were also extracted for the amide C and O atoms. This resulted in four datasets, PIP_set_TSOA_NoTS, PIP_set_TSSig_NoTS, PYR_set_TSOA_NoTS and PYR set_TSSig_NoTS containing 67 descriptors and 1683, 3708, 3990 and 5798 ligands respectively (Fig. 9).
![]() | ||
Fig. 9 Overview of the machine learning model generation process, from computational data to final models. |
The results for the transition state independent models, using the same algorithms, are summarised in Table 7. Predictably, these new models show reduced accuracy compared to those with descriptors from the transition states, except for PYR_set_TSOA_NoTS which shows a slight improvement in R2 (0.67 to 0.69) and RMSE (5.16 to 4.97). Nevertheless, the significant reduction of computational time (5.7–10 times) for new predictions with only a small decrease in accuracy was promising.
Dataset | Best algorithm | R 2 | RMSE | %ΔG‡ ± 4 |
---|---|---|---|---|
Full descriptors | ||||
PIP_set_TSOA_NoTS | SVM | 0.32 | 6.56 | 79.6 |
PYR_set_TSOA_NoTS | SVM | 0.67 | 5.16 | 68.1 |
PIP_set_TSSig_NoTS | SVM | 0.57 | 3.74 | 83.7 |
PYR_set_TSSig_NoTS | ET | 0.66 | 4.83 | 75.8 |
Trimmed descriptors | ||||
---|---|---|---|---|
PIP_set_TSOA_NoTS | ET | 0.29 | 6.95 | 76.3 |
PYR_set_TSOA_NoTS | ET | 0.69 | 4.97 | 66.4 |
PIP_set_TSSig_NoTS | SVM | 0.56 | 3.78 | 82.8 |
PYR_set_TSSig_NoTS | ET | 0.64 | 4.98 | 75.8 |
PBE0 electronic descriptors | ||||
---|---|---|---|---|
PIP_TSOA_PBE0 | SVM | 0.40 | 6.11 | 82.3 |
PYR_TSOA_PBE0 | SVM | 0.71 | 4.59 | 72.6 |
PIP_TSSig_PBE0 | SVM | 0.69 | 3.78 | 84.6 |
PYR_TSSig_PBE0 | ET | 0.68 | 4.66 | 78.0 |
TPSS electronic descriptors | ||||
---|---|---|---|---|
PIP_TSOA_TPSS | ET | 0.34 | 6.03 | 80.5 |
PYR_TSOA_TPSS | SVM | 0.69 | 4.27 | 75.4 |
PIP_TSSig_TPSS | ET | 0.62 | 3.46 | 87.8 |
PYR_TSSig_TPSS | ET | 0.70 | 4.09 | 80.4 |
In order to improve the prediction models for ΔG‡ without calculating the transition states, some of the freed-up computational time was dedicated to the calculation of more accurate electronic descriptors using a higher-level method. The electronic descriptors were recalculated using three DFT methods: TPSS, TPSSh and PBE0. These were selected for their previous successful calculation of first-row transition metals complexes (TPSS) and increasing amount of Hartree–Fock exchange for more accurate bonding description (TPSSh and PBE0).65,66 The triple-ζ basis set def2-TZVP was used with all methods. The single-core CPU time for each energy method for 50 IPip/IPyr structures is 14, 34, 94 and 86 hours for B97-3c, TPSS, TPSSh and PBE0, respectively. The correlation of electronic descriptors and activation energy with those from DLNPO-CCSD(T)/def2-TZVPP were examined (see ESI,† Table S28), and PBE0 and TPSS were selected for prediction model building based on either accuracy or speed. In each case, the new ΔG‡ values from PBE0 or TPSS replaced the values from B97-3c in these models.
Surprisingly, the TPSS-based models outperformed those based on PBE0 with better RMSE metrics across all four datasets, requiring only a third of the CPU time compared to that of PBE0. On average, the computational time per new prediction of ΔG‡ was reduced from ∼10 h, for transition state based B97-3c//GFN2-xTB models, to ∼1 h using TPSS-def2-TZVP//GFN2-xTB descriptors based on IPyr or IPip. The predictions for the PIP_TSSig_TPSS dataset are well within the error of the calculated activation energies (3.9 kcal mol−1). These are very significant improvements to the B97-3c//GFN2-xTB models both with and without the inclusion of the transition state descriptors. This new method using ML algorithms to predict ΔG‡ provides an excellent balance between accuracy and speed for high-throughput ligand/catalyst development.
Lastly, the trend in descriptor importance is consistent between the transition state dependent and transition state independent models. The TSOA ΔG‡ is dependent on the ability of the ligand to modulate the electron density at the copper centre via the s/d orbitals. The lack of amine nitrogen descriptors within the top 5 most important descriptors, after which change in R2 drops significantly, implies that the interaction between the copper and nucleophile is of low importance, which can be justified by the lack of N participation in the oxidative addition transition state. The TSSig ΔG‡ is dependent on the ability of the ligand to create the correct orientation and modulate the electron density on the nucleophile nitrogen. In the case of an amide nucleophile, this is achieved through the weakening of the amide bond.
This workflow offers significant advantages over currently used methods due to its faster speed and good to excellent accuracy compared to higher-level methods. We expect this workflow to have wide applicability in catalyst design, ranging from pharmaceutical process development to novel catalyst design across multiple chemical areas. Further development toward fully automated processes is in progress and will benefit the wider chemistry community.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d3cy00083d |
This journal is © The Royal Society of Chemistry 2023 |