Maria Bånkestad*ab,
Kevin M. Dorst
c,
Göran Widmalm
c and
Jerk Rönnolsb
aDepartment of Information Technology, Uppsala University, Sweden. E-mail: maria.bankestad@it.uu.se
bRISE Research Institutes of Sweden, Stockholm, Sweden
cDepartment of Organic Chemistry, Stockholm University, Sweden
First published on 22nd August 2024
Carbohydrates, vital components of biological systems, are well-known for their structural diversity. Nuclear Magnetic Resonance (NMR) spectroscopy plays a crucial role in understanding their intricate molecular arrangements and is essential in assessing and verifying the molecular structure of organic molecules. An important part of this process is to predict the NMR chemical shift from the molecular structure. This work introduces a novel approach that leverages E(3) equivariant graph neural networks to predict carbohydrate NMR spectral data. Notably, our model achieves a substantial reduction in mean absolute error, up to threefold, compared to traditional models that rely solely on two-dimensional molecular structure. Even with limited data, the model excels, highlighting its robustness and generalization capabilities. The model is dubbed GeqShift (geometric equivariant shift) and uses equivariant graph self-attention layers to learn about NMR chemical shifts, in particular since stereochemical arrangements in carbohydrate molecules are characteristics of their structures.
The inherent complexity of carbohydrate molecules in structural studies and stereochemical assignments stems from two key factors: their large number of stereocenters and the extensive possibilities for interconnecting monosaccharides. For example, combining two glucopyranosyl residues can yield as many as 19 distinct disaccharides, each with a unique structure.3 Additionally, variations in substitution patterns, like acetylation and sulfonation, further contribute to the complexity of carbohydrate structures. Determining carbohydrate structures by NMR spectroscopy can be a formidable task.4
The peaks observed in an NMR spectrum of a molecule provide valuable information about the presence of nuclei and their chemical surroundings, such as carbon and hydrogen isotopes 13C and 1H, and how they are interconnected. Fig. 1 provides examples of 13C and 1H NMR spectra for a monosaccharide.
The position of a peak for a particular nucleus, indicated by its chemical shift δ (δH and δC for 1H and 13C chemical shifts, respectively), corresponds to the resonance frequency of the nucleus within a magnetic field. The local environment of the atom, especially the electron density in the vicinity of the nucleus, strongly influences this resonance frequency (see Fig. 2). Besides the atomic species of the studied nucleus, the primary factors influencing chemical shifts are the neighboring covalently bonded atoms within the molecule because the electronegativity of these nearby atoms correlates closely with the observed chemical shifts. Electron-withdrawing groups, like oxygen and fluorine, located near the observed nuclei deshield them, increasing their chemical shifts. Conversely, proximity to electron-donating groups enhances shielding, thereby decreasing the chemical shifts.
In molecular ring systems (appearing in carbohydrates), the orientation of a hydrogen atom, either axially or equatorially, significantly impacts its δH value. Similarly, for carbon nuclei in a ring system, the arrangement of substituents they carry influences their δC value. Fig. 3, showing the 13C chemical shifts of α- and β-glucopyranose, illustrates this discrepancy. The change in configuration at the anomeric center not only affects the chemical shift of highlighted anomeric carbon but also has a ripple effect, altering the shifts of all carbon atoms in the molecule. It is important to note that spatial interactions can influence chemical shifts beyond the effects of covalent bonds.5
A standard method for predicting the chemical shifts of carbohydrate molecules involves utilizing an extensive database of known carbohydrates.6 This approach entails comparing new carbohydrate structures with those existing in the database, making necessary adjustments for recognized patterns around glycosidic bonds.
The CASPER program7 relies on a relatively small set of NMR data of glycans. It uses approximations to predict chemical shifts of glycan structures not present in the database, which facilitates the coverage of a large number of structures. However, the reliance on these databases is less effective when new structures containing previously uncharacterized sugar residues are encountered.
Alternatively, chemical shifts can be estimated using Quantum Mechanical Density Functional Theory (DFT) calculations.8 While this technique is effective for many molecules, it comes with substantial computational demands, making it both costly and time-consuming. A notable advancement in carbohydrate chemical shift calculation was recently published by Palivec et al.9 and involves an in-depth simulation of the water environment surrounding the molecules under study. This approach employs molecular dynamics and DFT to calculate chemical shifts for small carbohydrate molecules, including mono-, di-, and one trisaccharide.
As previously mentioned, the relationship between a molecule and its chemical shift is intricate, suggesting the utility of artificial neural networks (ANNs), recognized as universal approximators, to model this relationship from data. Neural networks, a subset of machine learning methods, are adept at learning high-dimensional feature spaces and capturing subtle, intricate patterns within the data.10 For predicting chemical shifts, neural networks trained on carefully constructed datasets of experimental chemical shifts can account for various influencing factors, such as electronic environments, steric effects, and long-range interactions, leading to fast, accurate, and reliable chemical shift predictions. As early as 1991, Meyer et al.11 proposed using a feed-forward network to identify 1H NMR spectra for oligosaccharides. More recently, graph neural networks (GNNs) have emerged to predict chemical shifts.12 Some of these models use only the molecular structure (the atoms and their bonds) as input,13–15 while others incorporate atom–atom pairwise distances as additional input features.16,17
While these models demonstrate strong performance for numerous molecules, they struggle when dealing with molecules featuring complex stereochemistry, such as carbohydrates. It is appropriate to assume that these molecules must be treated as dynamic, three-dimensional entities for accurate representation, demanding a network capable of capturing this complexity. This study proposes a model that integrates the three-dimensional molecular structure while preserving the fundamental symmetries of the underlying physics of the molecule.
More specifically, we introduce an E(3) equivariant graph neural network, also known as an Euclidean neural network.18 Equivariance is a transformation property that assures a consistent response when a feature transforms. An example of equivariance is the intramolecular forces holding the atoms together in a molecule. These forces are equivariant to rotation since these forces rotate together with the molecule. An equivariant function preserves relationships between input (molecule) and output (interatomic forces) during transformations. If we have an equivariant function deriving the interatomic forces, these derived forces rotate with the molecule.
An Euclidean neural network is equivariant to the Euclidean group E(3), which is the group of transformations in the Euclidean space, including rotation, translation, and mirroring. Compared to a network that solely considers pairwise distance, an equivariant network considers the relative distance between atoms, encompassing both pairwise distance and pairwise direction. Euclidean neural networks have recently gained recognition for their success in various chemistry applications, spanning from modeling molecule potential energy surfaces19 to predicting toxicity20 and studying protein folding.21
Our model, denoted as GeqShift (geometric equivariant shift), is a GNN that utilizes equivariant graph self-attention layers22 to learn chemical shifts, particularly when stereochemistry is crucial. These attention layers update the node features by considering features of close nodes, so-called neighbors, and weights these neighbors to emphasize the most important information, using so-called attention weights. Our contribution is three-fold: the chemical shift prediction model GeqShift, an innovative data augmentation method inspired by the dynamic movement of molecules in a fluid, and a compiled carbohydrate chemical shift dataset suitable for machine learning applications. By making this dataset public, we hope to stimulate further research in data-driven automated chemical shift analysis.
Our experiments demonstrate that our model and training approach achieve precise predictions, especially in intricate stereochemistry cases. Notably, for the carbohydrate dataset, our network reaches mean absolute errors (MAEs) of 0.31 for δC and 0.032 for δH.
A molecule is inherently dynamic, continuously changing its conformation. The likelihood of these conformations follows the Boltzmann distribution, p(R) ∼ exp(−E(R)), where E is the molecule energy function and R its conformation. Conventionally, in data-driven models, this problem is alleviated by selecting the conformation with the lowest energy, implying the highest probability. This is typically determined through methods like density functional theory (DFT) simulation.
We take a different approach by considering the molecule conformation as dynamic, with not just one but an ensemble of conformations. The predicted NMR chemical shift varies depending on the conformation, resulting in an ensemble of predictions per molecule. The final prediction is the average. We use this technique during both training and testing.
In machine learning terms, this is a data augmentation technique. We hypothesize that this will enhance the generalization capacity of the model, especially given the limited size of the training dataset. As a result, our final model, GeqShift, does not rely on a specific low-energy conformation as input, enabling effective generalization to molecules not seen during training. Fig. 4 presents an overview of the model.
To establish a baseline, we compare our model with the scalable GNN by Han et al.,15 referred to as SG-IMP-IR, which performs state-of-the-art results on the NMRShiftDB2 dataset.25 Additionally, we conducted six ablations to assess the effectiveness of various components in our model, as summarized in Table 1. These evaluations include comparing the use of an invariant version (inv) of the model, the same as setting , the maximum degree of the irreducible representations of the hidden layers (explained further in Section 4.1). Furthermore, we examined the impact of testing and training on an ensemble of conformations by evaluating the model on only a single conformation (1T) and training and testing on a single conformation (1TT). It is important to note that the train/test splits are consistent across all models, with data augmentation achieved by sampling multiple conformations per molecule.
Fig. 5 presents an overview of the performance of the model using violin plots, a combination of a box plot, and a density plot.26 Furthermore, Table 2 provides a detailed comparison of the models, emphasizing prediction accuracy for different types of carbohydrates, including mono-, di-, and trisaccharides.
Monosaccharides 13C | Disaccharides 13C | Trisaccharides 13C | ||||
---|---|---|---|---|---|---|
MAE | RMSE | MAE | RMSE | MAE | RMSE | |
CSDB | 1.23 (1.00) | 3.40 (3.94) | ||||
NMRDB | 2.02 (0.29) | 2.87 (0.41) | ||||
SG-IMP-IR | 1.18 (0.20) | 1.61 (0.30) | 1.02 (0.17) | 1.53 (0.37) | 1.13 (0.16) | 1.61 (0.20) |
GeqShift_1 TT_inv | 0.54 (0.12) | 0.86 (0.23) | 0.44 (0.07) | 0.73 (0.15) | 0.65 (0.11) | 1.06 (0.21) |
GeqShift_1 TT | 0.55 (0.15) | 0.90 (0.37) | 0.47 (0.08) | 0.75 (0.17) | 0.63 (0.11) | 1.05 (0.24) |
GeqShift_1T_inv | 0.39 (0.11) | 0.69 (0.23) | 0.28 (0.06) | 0.51 (0.16) | 0.37 (0.10) | 0.64 (0.21) |
GeqShift_1T | 0.34 (0.08) | 0.61 (0.19) | 0.25 (0.06) | 0.48 (0.18) | 0.33 (0.09) | 0.57 (0.20) |
GeqShift_inv | 0.37 (0.11) | 0.66 (0.23) | 0.26 (0.06) | 0.49 (0.16) | 0.33 (0.08) | 0.59 (0.14) |
GeqShift | 0.31 (0.08) | 0.58 (0.18) | 0.23 (0.06) | 0.46 (0.19) | 0.30 (0.09) | 0.53 (0.16) |
Monosaccharides 1H | Disaccharides 1H | Trisaccharides 1H | ||||
---|---|---|---|---|---|---|
MAE | RMSE | MAE | RMSE | MAE | RMSE | |
CSDB | 0.11 (0.032) | 0.19 (0.083) | ||||
NMRDB | 0.30 (0.036) | 0.37 (0.039) | ||||
SG-IMP-IR | 0.071 (0.026) | 0.110 (0.039) | 0.045 (0.007) | 0.075 (0.014) | 0.055 (0.011) | 0.087 (0.020) |
GeqShift_1 TT_inv | 0.064 (0.011) | 0.100 (0.022) | 0.049 (0.009) | 0.076 (0.017) | 0.067 (0.009) | 0.102 (0.015) |
GeqShift_1 TT | 0.061 (0.016) | 0.115 (0.053) | 0.041 (0.006) | 0.061 (0.012) | 0.060 (0.010) | 0.103 (0.030) |
GeqShift_1T_inv | 0.046 (0.014) | 0.078 (0.040) | 0.034 (0.006) | 0.053 (0.012) | 0.050 (0.010) | 0.079 (0.018) |
GeqShift_1T | 0.037 (0.009) | 0.062 (0.020) | 0.028 (0.003) | 0.046 (0.010) | 0.038 (0.009) | 0.057 (0.017) |
GeqShift_inv | 0.044 (0.015) | 0.077 (0.041) | 0.030 (0.004) | 0.048 (0.011) | 0.043 (0.009) | 0.069 (0.015) |
GeqShift | 0.035 (0.009) | 0.057 (0.018) | 0.026 (0.003) | 0.044 (0.011) | 0.033 (0.009) | 0.052 (0.016) |
Among our models, GeqShift emerges as the top-performing model, closely followed by GeqShift_inv. Compared to using just one conformation per molecule for training, we observe a significant performance improvement when using an ensemble of 100 conformations. For instance, in the case of monosaccharides, the mean absolute error (MAE) notably decreases from 0.55 to 0.37 when trained with 100 conformations. Subsequently, it further drops slightly to 0.31 when also predicting 100 conformations. These results underscore the advantage of incorporating multiple conformations in our training and prediction processes.
GeqShift surpasses the CSDB and NMRDB simulation tools in predicting carbon and proton chemical shifts. Although this comparison is not entirely straightforward, since the CSDB database contains molecules that are part of the testing distribution but does not include all molecules from the training dataset, it still highlights GeqShift's superior generalization capability.
In Fig. 6, we delve deeper into the prediction accuracy of our best-performing method, GeqShift. The figures within this plot illustrate histograms of prediction errors and scatter plots depicting the relationship between the actual and predicted values for both 13C and 1H nuclei. We combined the test sets' prediction results across all ten cross-validation folds to create these visualizations. Notably, the distributions of prediction errors are approximately zero-centered, with a standard deviation of 0.39 for 13C and 0.052 for 1H.
Fig. 7 visualizes the predictions from the whole ensemble of conformations for the monosaccharide α-L-lyxopyranose. The figure displays histograms representing the predictions for each 13C atom in the molecule, the ensemble mean, and the actual NMR peaks. These histograms showcase the distribution of predicted values, allowing for a comparison with a real NMR spectrum (refer to Fig. 1). Furthermore, the ensemble of predictions per chemical shift enables an estimation of prediction uncertainty by examining the standard deviation.
Name | Structures left out | Nbr remove |
---|---|---|
Xyl | All with a Xyl residue | 10 |
Qui | All with a Qui residue | 7 |
Ur_acid | All with a uronic acid GlcA, sManA and GalA left out | 14 |
Ur_acid/GlcA | All with a uronic acid but keep GlcA (ManA and GalA left out) | 8 |
Ac | Remove all with acetylated compounds | 19 |
34Ac | Remove all with acetylated compounds at carbon 3 and 4 | 10 |
Fig. 8 compares the prediction accuracy of GeqShift with SG-IMP-IR, where GeqShift outperforms SGIMPIR on a majority of the substructures. This experiment underscores the importance of including structurally similar molecules in the training data for accurate machine learning predictions. Specifically, when the model is trained on the Ur_acid dataset with all uronic acids excluded, it performs poorly in predicting the NMR spectra of molecules containing uronic acids. However, when GlcA, a specific uronic acid, is included in the training data, the model's performance significantly improves for the excluded uronic acid molecules, ManA and GalA. This result suggests that similar structural motifs in the training data enhance the model's ability to generalize to new, unseen molecules within the same chemical family. Furthermore, it demonstrates the model's capability to extrapolate structural information from one molecule (GlcA) to different but related molecules (ManA and GalA).
![]() | ||
Fig. 8 Prediction performance for chemical shifts (13C and 1H) in terms of mean absolute error (MAE) for the out-of-distribution evaluation. The specific structures that were excluded from the training data and then used as a test set are listed in Table 3. |
![]() | ||
Fig. 9 Prediction performance for chemical shifts (13C and 1H) in terms of mean absolute error (MAE) within the context of the two polysaccharides introduced in Fig. 10. In this evaluation, the models employ an average prediction derived from the ten models trained during ten-fold cross-validation. |
![]() | ||
Fig. 10 The figure illustrates the prediction errors for the 13C and 1H chemical shifts of two E. coli O-antigen polysaccharides, each composed of tetrasaccharide repeating units, from serogroup O77 (upper) and serogroup O176 (lower).27,28 The structures are visualized using symbols from the SNFG standard.29 The repeating units are enclosed in square brackets. The box plots visually represent the prediction errors Δδ per-atom basis. |
To enhance accuracy, we employed data augmentation techniques that replicate the dynamic behavior of molecules. Instead of restricting each molecule to a single conformation, we utilized an ensemble of conformations for both the training and testing datasets. To sample the conformations, we prioritized simplicity and speed. Therefore, we opted for the RDKit open-source toolkit, which employs an energy force field technique (further details in Section 4.3). The results in Table 2 illustrate this approach, demonstrating a decrease in mean absolute error from 0.55 to 0.34 for the predicted 13C chemical shifts of monosaccharides when transitioning from training the model with one conformation per molecule to training on 100 conformations per molecule.
As previously mentioned, this enhancement likely stems from two factors: a better representation of molecular reality and reduced sensitivity of the trained model to minor input variations. Relying solely on a single conformation, as done in previous attempts using 3D information in the input,16,17 for training poses a problem, as it leads to a less resilient model. Moreover, discovering a low-energy conformation through Density Functional Theory (DFT) is time-consuming and computationally intensive.
Because the training set includes various conformations, the model can make precise predictions when the input conformation is relatively similar to the correct one. However, there is room for improvement in conformation sampling. One potential approach for future research is to refine sampling techniques, such as those based on Gibbs free energy.
The obtained prediction errors exceeded our expectations. It must be emphasized that the ranges of chemical shifts are approximately 0–200 ppm for 13C and 0–10 ppm for 1H, so the achieved prediction errors approach the levels that qualify as error margins in measurements. However, for even better chemical shift predictions, additional developments, e.g., considering the temperature at which the NMR data are acquired, will be required to evaluate and train the GNN. To further put the results into perspective, one can compare the prediction errors to other works using similar techniques for different classes of compounds and alternative ways of calculating chemical shifts. The main results are those detailed in Table 2, where our model is compared to a state-of-the-art neural network for chemical shift prediction, which has been retrained on our dataset.
The developed model has great potential for predicting chemical shifts for other organic molecules, particularly compounds with asymmetric centers. This includes, among many different classes, pharmacological compounds and proteins.
Furthermore, the ability of the model to accurately predict physical observables, i.e., chemical shifts based on the molecular structure, highly encourages future application of similar methodology for other analytical techniques, e.g., X-ray photoelectron spectroscopy and X-ray absorption spectroscopy and potentially for predicting other physical parameters.
Most, if not all, studies of prediction methods for NMR chemical shifts are focused on predicting chemical shifts from molecular structure. The inverse problem, where a molecular structure is generated from chemical shifts, is more compelling for experimental practice. At the same time, it is more complex. However, making proper chemical shift predictions builds a solid ground for tackling the inverse problem and a natural segue for future research. The implications are far-reaching and go beyond an advanced understanding of carbohydrate structures and spectral interpretation. For example, it could accelerate research in pharmaceutical applications, biochemistry, and structural biology, offering a faster and more reliable analysis of molecular structures. Furthermore, our approach is a key step towards a new data-driven era in spectroscopy, potentially influencing spectroscopic techniques beyond NMR.
mijk = fm(xik,xjk,eijk), | (1) |
![]() | (2) |
xik+1 = fu(xik,![]() | (3) |
Equivariance is an essential factor when considering NMR chemical shifts. In this study, we focus on predicting the isotropic part of the chemical shift tensor, denoted as δiso, which is a scalar and remains unchanged under the Euclidean group E(3) (the group of rotation, translation, and mirroring) with respect to the input locations of the atoms. However, the actual chemical shift tensor, δ, is a second-rank tensor with an antisymmetric nature ( with even parity). While it is possible to predict the complete chemical shift tensors, as demonstrated by Venetos et al.,30 molecules in solution in a laboratory setting move around relative to the external magnetic field. Consequently, it is the isotropic part of the chemical shift tensor observed in an NMR spectrum. Even though the isotropic chemical shift is a scalar quantity, the relationships governing it are intricate. Therefore, it would be advantageous to use a model capable of accurately capturing these relationships.
Euclidean neural networks can represent a comprehensive set of tensor properties and operations that obey the same symmetries as symmetries of molecules. Formally, a function f: X → Y is equivariant to a group of transformations G if for any input x ∈ X and output y ∈ Y and group element g ∈ G that is well-defined in both X and Y, we have that fDX(g)(x) = DY(g)f(x), where DX(g) and DY(g) are transformation matrices parameterized by g in X and Y. In other words, the result is the same regardless of whether the transformation is applied before the function or vice versa. An example is if you have a function deriving the interatomic forces in a molecule. These forces should be the same relative to the molecule's coordinates, independent of how the molecule is translated or rotated.
The most fundamental aspect of Euclidean neural networks involves categorizing data based on how it transforms under the operations in the Euclidean group E(3), a group in three-dimensional space that contains translations, rotations, and mirroring. These data categories are called irreducible representations (irreps) and are labeled as where
corresponds to a scalar, while
corresponds to a three-dimensional vector. Irreps may also possess a parity, which can be either even or odd, indicating whether the representation changes signs when inverted; odd irreps change signs upon inversion, while even irreps remain unchanged. An irreducible representation with
and odd parity is termed a vector, representing entities like velocity or displacement vectors. In contrast, an irreducible representation with even parity is referred to as a pseudovector, and it characterizes properties such as angular velocity, angular momentum, and magnetic fields. The input to an Euclidean neural network is a concatenation of tensors of different data types; for example, a scalar representing a mass is concatenated with a vector representing a velocity.
We call a tensor composed of various irreducible representations a geometric tensor. In our graph neural network, the equivariant version of vector multiplication involves two geometric tensors and is known as a tensor product x ⊗ wy. Here, w are learnable weights. Our approach employs these tensor products for equivariant message passing, departing from conventional linear operations. For a more in-depth exploration of Euclidean graph neural networks, we refer readers to the study by Geiger et al.31
x0i = (Emb(Zi))‖Emb(Nih)), | (4) |
The layers in the network consist of E(3)-equivariant self-attention/transformer layers,22,32 built using the e3nn library.31 For the layers k = 1, …, K, we derive messages by deriving queries qk, keys kk, and value vk as
qik = Linear(xik) | (5) |
![]() | (6) |
![]() | (7) |
![]() | (8) |
![]() | (9) |
In between the self-attention layers, the geometric tensors are updated with equivariant Layer Normalization (LN)22 and an equivariant neural network (NN) as
xik+1 = LN(NN(xik′) + xik), | (10) |
We train the model by minimizing the mean absolute error,
![]() | (11) |
We train the model with multiple conformations and, thereby, multiple graphs for each chemical shift xi. This results in an ensemble of predictions 0i,…,
ij,
iNi for every output xi. We want the average of this ensemble to be equal to the experimentally determined chemical shift, such that
. Thus, we aim at minimizing
. It follows from the triangle inequality that
![]() | (12) |
The batch size of the models is set to 32 except for SG-IMP-IR, where the recommended batch size of 128 is used. The models are optimized using the Adam optimizer33 starting with a learning rate of 3 × 10−4. We used a small validation set of five percent of the training data for the models trained using only one conformation per molecule. The learning rate decreased during training using the PyTorch ReduceLROnPlateau, which reduces the error when the validation error stops decreasing. A patience of 20 epochs and a reducing factor of 0.1 was used. We did not use a scheduler for instances when multiple conformations were used. Instead, we trained these models during three epochs, and the learning rate decreases by 0.1 for every new epoch.
The model is implemented using Python 3.9.13, PyTorch version 2.0.0, Cuda version 11.7, PyTorch geometric version 2.3.0, e3nn version 0.5.1, RDKit version 2022.09.5, and GlyLES version 0.5.11. The models are trained using one NVIDIA A100 GPU. The training time per model takes around 30 minutes to an hour.
The CSDB predictions are simulated at http://csdb.glycoscience.ru/. The NMR spectrum assignment was done with the help of the chemical shift reference collection and simulation tool for 13C43,44 and 1H44 nuclei at the Carbohydrate Structure Database (CSDB).45 To refine a set of structural hypotheses, the CSDB structural ranking tool46 and empirical chemical shift simulation47 were used. We use the hybrid carbon chemical shift simulation.
The NMRDB predictions for 13C48–50 and 1H48,49,51 are simulated at https://www.nmrdb.org/.
This journal is © The Royal Society of Chemistry 2024 |