Real-time prediction of 1H and 13C chemical shifts with DFT accuracy using a 3D graph neural network

Nuclear magnetic resonance (NMR) is one of the primary techniques used to elucidate the chemical structure, bonding, stereochemistry, and conformation of organic compounds. The distinct chemical shifts in an NMR spectrum depend upon each atom's local chemical environment and are influenced by both through-bond and through-space interactions with other atoms and functional groups. The in silico prediction of NMR chemical shifts using quantum mechanical (QM) calculations is now commonplace in aiding organic structural assignment since spectra can be computed for several candidate structures and then compared with experimental values to find the best possible match. However, the computational demands of calculating multiple structural- and stereo-isomers, each of which may typically exist as an ensemble of rapidly-interconverting conformations, are expensive. Additionally, the QM predictions themselves may lack sufficient accuracy to identify a correct structure. In this work, we address both of these shortcomings by developing a rapid machine learning (ML) protocol to predict 1H and 13C chemical shifts through an efficient graph neural network (GNN) using 3D structures as input. Transfer learning with experimental data is used to improve the final prediction accuracy of a model trained using QM calculations. When tested on the CHESHIRE dataset, the proposed model predicts observed 13C chemical shifts with comparable accuracy to the best-performing DFT functionals (1.5 ppm) in around 1/6000 of the CPU time. An automated prediction webserver and graphical interface are accessible online at http://nova.chem.colostate.edu/cascade/. We further demonstrate the model in three applications: first, we use the model to decide the correct organic structure from candidates through experimental spectra, including complex stereoisomers; second, we automatically detect and revise incorrect chemical shift assignments in a popular NMR database, the NMRShiftDB; and third, we use NMR chemical shifts as descriptors for determination of the sites of electrophilic aromatic substitution.


Node updating.
The message received by atom j is added to the current node feature vector, updating the node: ℎ " #$% = ℎ " # + , # ( -# " #$% + 1.5. Readout. Node features and edge features are updated three times through the above process. The final node features involving spatial and chemical environment information will pass through a series of dense layers to give the final chemical shift predictions. 1.6. Hyperparameter selection. The graph network model in the present work contains several adjustable hyperparameters, as discussed in reference 3. These hyperparameters include the distance cutoff dc, the size of edge and node features, and the number of updating loops. A small feature vector size cannot capture all necessary information, while too large feature vectors can lead to redundant descriptions, resulting in overfitting as well as an unstable model. We optimized the feature vector size by gradually increasing it until reaching a best performance regarding to the loss function for the dev set (validating set), which was obtained for vectors of length 256.
The distance cutoff dc plays an important role in the network accuracy and speed. We compare the performance of the DFTNN network model on the testing set as a function of distance cutoff (SI Fig. 2a). The optimal distance cutoff dc was selected as 5 Å.
The updating layers, colored by gray shadow in SI Fig. 1, are repeated several times to update the node and edge features. As shown in SI Fig. 2b, the mean absolute error (MAE) on validating set reaches minimum point for three updating loops.
The distance cutoff and number of updating loops influence how the network captures the spatial and chemical environments in a given molecule. For each individual atom, the distance cutoff determines the radius for a local atomic environment to be involved, while the updating layer loops allow indirect communication with a distant atom through intermediate atoms, allowing the network to capture global impacts. For example, such spatial and chemical environments are depicted in SI Fig. 3. In the limit of a single updating loop, information passed by the network is highly local, resulting in poor performance.

Implementation.
The graph convolutional network is implemented using Keras 4 and Tensorflow 5 . Scikit-learn 6 is used to scale prediction targets. RDKit 7 is used to extract interatomic distances and to encode the atoms as integer classes. The model was trained on a single GPU (Tesla K80) with a batch size of 32. Models were optimized using the Adam first-order gradient-based optimizer 8 and MAE loss function. The learning rate of 5x10 -4 and explicit learning rate decay of 4% every 70 epochs was used. Models were trained for 1200 epochs. Early stopping was employed by evaluating the validation loss every 10 epochs and using the model that yields that lowest validation loss. Figure 4. Schematical illustration for workflow generating 3D structures and calculated shielding constants (SC). Figure 5. Distributions of mPW1PW91/6-311+G(d,p)//M06-2X/def2-TZVP calculated 1 H/ 13 C shielding constants for DFT 8K. There are 117,997 1 H and 9,9105 13 C calculated chemical shifts.

Cleaning the experimental chemical shifts with GIAO DFT calculations.
Our transfer learning strategy is reliant upon access to several thousand experimental chemical shifts. These were taken from NMRShiftDB. The assignment of experimental NMR spectra to chemical structures, and of the chemical shifts to individual atoms, is generally performed manually. Erroneous structures and assignments are encountered in the scientific literature, and we cautiously assumed that this may also be possible within our chosen experimental data source. For this reason, we used the presence of statistical outliers in these data (when compared against the chemical shift predictions of DFT calculations) to remove potentially erroneous assignments. For example, extreme disagreements between experiment and DFT (e.g., Dd values > 20 ppm for 13 C) are rare, and generally indicative of incorrect assignments. However, there is a trade-off to be made here, since some of the larger errors may also be due to failures of DFT, and the inclusion of such points would be advantageous for the transfer learning stage of model training. We used a statistical definition of outliers in terms of the interquartile range (IQR), discarding data with DFT vs. experimental error below (Q1 -1.5*IQR) or above (Q3 + 1.5*IQR). This corresponds to -6.1 and 5.9 ppm, respectively. As a result, about 8% of data was discard, leaving 53,331 13 C chemical shifts. Discarded datapoints are accessible from GitHub (https://github.com/patonlab/CASCADE).

S7
Experimental chemical shifts collected in 2.1 and 2.2 were checked and cleaned by GIAO DFT calculated shielding constants obtained in 2.3. In the ideal case, chemical shifts can be calculated from GIAO DFT shielding constants by: where 657 is the calculated shielding constant of a reference sample, for example the 1 H or 13 C nuclei in tetramethylsilane (TMS). However, an alternative and generally more accurate approach is to obtain chemical shifts by applying empirical linear-scaling relationships 16 obtained from correlation between DFT and experiment. Herein, we use such a relationship to check for possible incorrect assignments of experimental chemical shift based on the presence of statistically significant outliers.
The process of checking and discarding incorrect assignments is shown in SI Figure 5. The remaining data are constructed into a database named Exp5K, which containing 5148 structures and 53334 13 C chemical shifts. The distribution of experimental chemical shifts in this database is shown in SI Figure 6. Figure 6. (a) correlation between GIAO DFT shielding constants and experimental chemical shift. (b) Histogram shows the error distribution for the correlation. The boxplot shows the lower quartile (Q1) and upper quartile (Q3). Whiskers show the minimum (Q1 -1.5*IQR) and maximum (Q3 + 1.5*IQR), which corresponds to -6.1 and 5.9 ppm. Data outside the dash line are considered as outliers due to incorrect assignment of experimental chemical shift, which are discarded, as colored in green in figure (a).

Supplementary Information Text 3 | Learning from experimental chemical shifts.
Performance of ExpNN-dft trained against experimental chemical shifts using DFT structure inputs are shown in SI Fig. 7. Figure 8. Scatter plot and histogram show the correlation between chemical shifts predicted from ExpNN-dft and observed experimentally for 13 C. The grey dash line in the scatter plot indicates a perfect correlation.

Supplementary Information Text 4 | Training on empirical MMFF structures.
Performance of ExpNN-ff trained against experimental chemical shifts using MMFF structure inputs are shown in SI Fig. 8. Figure 9. Scatter plot and histogram show the correlation between chemical shifts predicted from ExpNN-ff and observed experimentally for 13 C. The grey dash line in the scatter plot indicates a perfect correlation.

Supplementary Information Text 5 | Performance on the CHESHIRE testing set
Performance of three networks discussed in the present work, ExpNN-ff, ExpNN-dft, and DFTNN are benchmarked against an external validation set, CHESHIRE, which is outside of the NMR8K dataset. The CHESHIRE dataset is composed of small, rigid molecules containingb C, H, N, O and S. We compare these network performances against a fully QM approach (DFT geometry optimization and shielding tensor calculation) and a mixed approach (here termed FFDDT, which involves MMFF geometry optimization and DFT shielding tensor computation). The DFT method selected here is the same used in the generation of the DFT8K database, mPW1PW91/6-311+G(d,p)//M06-2X/def2-TZVP. Molecular structures with atom indices are shown in SI Figure9. Experimental and calculated chemical shifts are shown in SI Table1. We note here that the original CHESHIRE dataset contains 25 neutral molecules, from which one was removed for this study due to the presence of charged atoms in the Lewis structure which are not currently supported by our network. Figure 10. 24 CHESHIRE molecules tested. The number in legends associated with each molecule is the molecule ID from SI Table1. Numbers in molecules associated with each atom are atom indices from SI Table1. The cis-and trans-diastereomers of 1,3-hydroxymethylcyclohexane are distinguishable by our network. Assignment of the diastereomers based on the predicted 13 C chemical shifts is correct. The ring-flipped chair forms of methylcyclohexane are differentiated by our network. The two conformers provide different 13 C predictions, with the largest effect evident at the methyl group, which is oriented equatorially or axially in the two conformers. The final prediction is based on a Boltzmann weighting over all conformers, and compares favorably with experiment.

Supplementary Information Text 7 | Structural assignment
Chemical structure and atom indices as well as experimental and predicted 13 C chemical shifts for six structures in Figure 6 are shown here.

a.
Original