Ulrich
Sternberg
*ab and
Christophe
Farès
c
aResearch Partner of Karlsruhe Institute of Technology (KIT), Karlsruhe, Germany. E-mail: ulrich.sternberg@partner.kit.edu
bCOSMOS-Software, Jena, Germany Web: http://www.cosmos-software.de
cMax-Planck-Institut für Kohlenforschung, Kaiser-Wilhelm Platz 1, 45470 Mülheim an der Ruhr, Germany. E-mail: fares@mpi-muelheim.mpg.de
First published on 11th April 2022
A new probability score—named χ-probability—is introduced for evaluating the fit of mixed NMR datasets to calculate molecular model ensembles, in order to answer challenging structural questions such as the determination of stereochemical configurations. Similar to the DP4 parameter, the χ-probability is based on Bayes theorem and expresses the probability that an experimental NMR dataset fits to a given individual within a finite set of candidate structures or configurations. Here, the χ-probability is applied to single out the correct configuration in four example cases, with increasing complexity and conformational mobility. The NMR data (which include RDCs, NOE distances and 3J couplings) are calculated from MDOC (Molecular Dynamics with Orientational Constraints) trajectories and are investigated against experimentally measured data. It is demonstrated that this approach singles out the correct stereochemical configuration with probabilities more than 98%, even for highly mobile molecules. In more demanding cases, a decisive χ-probability test requires that the datasets include high-quality NOE distances in addition to RDC values.
Thankfully, the last decades have witnessed the combination of spectroscopic and computational methods, which offers high potential to better streamline configurational analysis and conformational calculations and to circumvent interpretation pitfalls. The general strategy consists of computationally generating conformational models, back-calculating short-range (NOE, J and CS) or long-range (RDCs, RCSAs) NMR parameters and evaluating them against those measured experimentally. Density functional theory (DFT), the most frequently employed quantum mechanics (QM) modelling method, solves the Schrödinger equation on all nuclei and bonding electrons and is ideally suited to accurately predict chemical shifts and J-couplings on individual conformers.10–13 The many examples where this approach was used leave no doubt on the efficiency of this method. For molecules with large structural fluctuations due to dynamics and chemical interconversion, the use of DFT can quickly become prohibitively costly. The molecular mechanics (MM) approaches, on the other hand, make use of classical potential energy equations applied to all bonds and atoms and represent a robust strategy to efficiently sample vast energy landscapes even for larger molecular systems.
The advent of efficient methods to simulate or calculate NMR parameters from molecular models has prompted the need to develop better tools and metrics to evaluate the quality of the fit between data and simulation. In addition to the traditional metrics such as RMS (root mean square) deviation or MAE (mean absolute error), many methods have placed the emphasis on clearly delineating different models and in assisting the decision-making process, especially for choosing the correct stereochemical configuration.
Interesting are the scoring parameters used for evaluating orientation constraints such as RDCs used within specialized programs such as PALES,14 MSpin15 and MDOC.16,17 Here, special scoring parameters have been used, as for instance the Cornilescu's Q factor:18
![]() | (1) |
This score approaches zero if the differences between the calculated and experimental properties (qexp − qcalc) become small. However, because of the denominator, the Q-factor depends on the magnitude of the NMR parameter. This means, for instance, that parameter sets from stronger alignment media (i.e. larger RDCs) will generally score better.
A more appropriate score, n/χ2, based on the well-known statistical parameter χ2,
![]() | (2) |
![]() | (3) |
In recent papers from Navarro-Vázquez and colleagues,7,20–24 the χ2 parameter was also introduced into the computer-assisted 3D structural elucidation (CASE-3D) as a decision-making tool, which is derived from the Akaike Information Criterion (AIC).25 The aim of the AIC in CASE-3D is to maximize the fit while precluding the overfitting data by penalizing the use of large number of models. With k representing the number of independent variables (including the number of conformations), the AIC is used in the following form:
AIC = χ2 + 2k | (4) |
In the CASE-3D protocol, the conformation model with the smallest AIC is thus selected.
In the study by Tzvetkova et al.,26 the successful MDOC simulation of the correct configuration is based on the more stringent criterion derived from the parameter χ2, applied to RDC datasets:
![]() | (5) |
Building upon the criterion n/χ2 ≥ 1, which requires that the simulated RDC data lie within the experimental error on average, this criterion (eqn (5)), which focusses on the data point with the largest deviation, precludes any outlier among the simulated data. In other words, if this criterion is larger than unity, then the simulation is a perfect representation of the experimental situation. For this approach, the datasets must be rigorously assessed: the dataset must be of high quality and experimental errors must be conservatively determined to represent maximum errors. On the other hand, the data must be extensive enough and the errors reasonably limited in order to ensure that only the correct configuration is fully consistent with the experimental RDC data.
In the case of flexible molecules, one must cope with molecules that populate several conformers and extensive rotamers, even including transient structures, which together contribute to the measured NMR parameters. It becomes evident that the basic set of orientational constraints (RDCs) alone, with their inherent ambiguities, cannot uniquely describe the complete conformational space, and would need to be complemented with other structural information, such as NOE distances and torsional angles from J-couplings. In these more complex cases with larger and mixed datasets, a small number of outliers cannot be completely avoided so that the “outlier test” eqn (5) would leave a user with no definitive answer at all. The problem of the outlier test comes from its “yes-or-no” binary outcome. It would instead be more desirable to numerically evaluate the probability for each candidate within a finite set of model structures to represent the correct answer, thus better guiding a configurational assignment.
A very successful approach in this regard is the statistical test of the calculated chemical shift data of molecules against experiments. The so-called DP4 probability as developed by Smith and Goodman27 has proved to be a valuable tool in assigning the stereochemistry of many small molecules, based only on 1H and 13C chemical shifts and J-couplings.28–30
Along the lines of statistical scoring schemes such as DP4, we propose here a Bayesian probability tool aimed at evaluating MDOC simulation models towards chiral assignments of flexible molecules and based for the first time on the parameter χ applied to large sets of mixed NMR parameters, including RDCs, NOE distances and 3J-couplings. The new method is supported by four cases of varying complexity.
In contrast, the RDCs act as a much better evaluation parameter since in MDOC simulations no extra approximations are introduced into their calculation. It has been demonstrated in several papers (for review see ref. 20) that RDCs are well suited for configuration determination of molecules with a low level of internal flexibility. One-bond residual dipolar couplings such as 1H–13C RDC depend nearly exclusively on the angle θ between the vector connecting the two nuclei and the z-axis of the external magnetic field. Still, this angular dependence is deemed ambiguous since all one-bond 1H–13C vectors along a double-cone with a semi-angle θ opening share the same RDC value. In an investigation of α- and β-D-cellobiose, Di Pietro et al.32 singled out the right configuration among 128 possible diastereomers in MDOC simulations using RDC values only – but in this case the authors extended their constraint dataset with13C–13C RDC values in addition to the 1H–13C RDC. In most cases, one-bond RDCs are not sufficient to conclusively determine a chiral configuration using the criteria in eqn (2) and (5). One example for this conundrum is oidolactone B studied by Tzvetkova et al.26 – this molecule contains two flexible six-membered lactone rings and a methoxy group with rotational freedom. An MDOC analysis with 14 1H–13C RDC as constraints provided two high-score configurations and the correct one could only be determined by comparing the experimental NOE distance with the mean NOE distance calculated from MDOC trajectories.
A number of publications underline the central role of NOE distances in determining the structure and configuration of molecules with internal mobility (see e.g. Kolmer et al.).33 Manifestly, it is possible to measure NOE intensities with high precision and to use these as distance restraints in MDOC simulations34 (for the methods see Vögeli35). Thus, meticulously interpreted NOE distances can be essential parameters in the assignment of chiral structures of mobile molecules. Since tensorial constraints (RDCs) are responsible for driving the rotational and conformational changes in MDOC simulations, distance parameters should be used at first to score the simulations.
The use of 3J-couplings characterizing torsion angles also represents an easily accessible and complementary constraining and evaluating parameter in MDOC simulations. The calculations of 3J-couplings using DFT are even more demanding than CS calculations but nevertheless, it has already been used in a DP4 like framework.36 In MDOC simulations, we used the much faster Altona-equation (Haasnoot et al.37) for 3JHH couplings and the equation of Palermo et al.38 for 3JCH couplings. These methods reached an RMS deviation to DFT calculations of about 0.6 Hz but in some cases, the differences can get higher. This uncertainty has to be added to the error of the experimental determination of 3J-couplings.
(i) First for every measured qexp and calculated value qtheo, a χi value is calculated: χi = (qexp − qtheo)/(Δqexp + Δqtheo). The larger the |χi| value, the more the calculated value deviates from the experimental value. We assume that the χi values follow a normal distribution around zero and calculate a standard deviation σ from all χ values.
(ii) We then calculate for every datapoint i the probability pi of obtaining value equal or larger than χi. This probability is expressed by pi = [1 − Φσ (|χi|)], where Φ is the cumulative density function of the normal distribution (CDF(Nσ), see also Fig. 1):
![]() | (6) |
![]() | ||
Fig. 1 The blue curve is the probability density of normal distribution (PDF(Nσ), in this example with a standard deviation of σ = 1). The red area defines the probability for obtaining a |χ| value as large as |χi| or larger (in the example values above 1.5). The white area under PDF(Nσ) defines the cumulative distribution function CDF of the normal distribution Φ (see eqn (6)). |
(iii) The probability that a calculated set of data for one model fits to the experiment is product of all pi values Pm. To get a relative probability we have to divide product Pm by the sum of all Pm values.
This procedure can be condensed into eqn (7) for the probability P that a set of N data fits to one configuration (denoted with i) from a set of m structures or diastereomers:
![]() | (7) |
In eqn (7), CDF is the cumulative density function of the normal distribution Nσ(|χ|) (also see eqn 6) with the standard deviation σ and the argument . In contrast to the DP4 method of Smith and Goodman,27 which uses the Student's t distribution mostly for practical reasons, standard deviation σ is calculated for all χ values of the m candidate structures. The numerator product of eqn (7) runs over all χ values of the structure i to be tested and this product is divided by the sum of the products over all candidate structures.
In this study, the χ values are calculated from theoretical MDOC simulations (qtheoi), omitting the first nanosecond of the simulation, during which the simulation reaches a steady state. RDCs, NOE distances and 3J-couplings are analyzed with conservatively estimated uncertainties such that the experimental values are with high probability within the error bounds. In Δqtotal we add up the experimental error and the uncertainty of the calculation procedure.
![]() | ||
Fig. 2 (a) Stereoisomers 1-SS (upper panel) and 1-SR (below) of 1,4-diketone. (b) Analysis of the four MDOC trajectories of the 1,4-diketone forms 1-SR and 1-SS using the two NMR datasets, 1A and 1B. In the lower panel the n/χ2 score for the RDC is given and in the upper panel the χ-probability according to eqn (6) is displayed. |
The MDOC trajectories have been analyzed by calculating the mean values over the simulated RDCs, NOE distances and 3J couplings omitting the first nanosecond. We calculated the total n/χ2 quality parameter for all constraints and the results for the RDCs are presented in Fig. 2b (bottom). A higher n/χ2 parameter for the 1-SS and 1-SR configurations to datasets 1A and 1B, respectively, validate the original assignment. For the correct configuration, the n/χ2 criterion is fulfilled, but not the outlier criterion (χmin−2, eqn (5)). In the case of data set 1A, the 24 RDCs, 13 NOE distances and three 3J couplings, were carefully evaluated and the estimated errors were conservatively set. Still, the simulation resulted in one outlier in NOE distances as well as to the four RDC outliers. Revisiting these outliers could not justify removing them or any other from the dataset. We believe that there are two major situations which could lead to inevitable outliers even in the case of well-assigned experimental data in flexible molecules: (i) the complexity of the molecular systems prevents a proper adjustment of the equilibria between the conformers or (ii) one set of NMR parameters interferes with another because their NMR mean values reflect very different time scales.
Do these outliers preclude a conclusive decision about the configuration based on MDOC results? Given that only two configurations are possible, we can at least calculate a statistical probability for each based on these results. The χ-probability parameter introduced in eqn (7) was evaluated for both datasets (see Fig. 2b). Remarkably, the χ-probability that each dataset corresponds to its assigned configuration (i.e., 1A to 1-SS and 1B to 1-SR) is about 99.99%, despite the presence of outliers. In other words, the χ-probability is able to provide an unambiguous result even in the case of an “imperfect” MDOC simulation with outliers. This unequivocal result can be rationalized by the fact that even a few larger deviations contribute strongly towards low probabilities thanks to the product series Π of eqn (7), and consequently, they are not “masked” by the large number of smaller deviations, as they would in the simple summation for χ2 in eqn (2).
The macrolide system is much more demanding than 1,4-diketone, not only due to its size, but also because the energy barriers for conformational changes in the large ring system are much higher, leading to a large ensemble of possible twist states of the ring. Additionally, to a structure investigation of the ring system, a cross-validation of 2p and 11-epi-2p with 2r and 11-epi-2r structures was performed by running MDOC simulations with the 2p and 11-epi-2p NMR datasets applied to all four configurations. In Fig. 4a, the final n/χ2 quality results for the RDCs, NOE distances and 3J-couplings are displayed individually and in combination.
![]() | ||
Fig. 4 (a) Quality criteria n/χ2 for the four mandelalide structures 2p, 11-epi-2p, 2r and 11-epi-2r for different NMR data subsets (RDCs, NOEs, 3JHH and combined). The upper panel displays the MDOC simulated values for the NMR dataset 2p and the lower panel, those for the 11-epi-2p dataset. (b) The χ-probability for the mandelalide structures 2p, 11-epi-2p, 2r and 11-epi-2r are presented. In the upper panel the χ probabilities (see eqn (6)) are calculated from the MDOC simulation with the 2p dataset and the lower panel is obtained from the 11-epi-2p dataset. Left (blue) shows the probability for the RDC data only and right (red) the probability for RDC plus NOE distance data. |
Upon inspection of the n/χ2 values of the 2p-dataset (Fig. 4a, upper), it can be observed that all configurations have an overall good fit to the data (n/χ2 > 1). The NOEs (yellow), the J-couplings (pink) and the combined data (red) only marginally prefer the correct structure, whereas the RDC data alone (blue) even points to the incorrect structure. Still, a marked difference can be observed between the 2p and 2r (2.7 vs. 1.0) and less so between the 2p and 11-epi-2p (2.8 vs. 2.9) configurations. After combining all constraints, a logical score sequence between configurations is apparent but the differences between the values are still too subtle to warrant a definitive assignment. The outlier criterion χmin−2 does not improve the overall outcome of the demarcation between 2p and 11-epi-2p since their MDOC simulations feature four RDC outliers with χmin−2 = 0.216 and 0.256 when compared to the 2p-dataset. The cross-validation with the 11-epi-2p data (see lower part of Fig. 3a) is similarly misleading, since the RDC n/χ2 values show a small preference for the wrong structure. Only the sums over all constraints give the right sequence but the differences are again marginal. Overall, since the differences in n/χ2 and in χmin−2 are small, it is fair to say that this criterion allows only weak differentiation of the chiral assignment, at best.
In Fig. 4b, the dataset evaluation is presented using χ-probability according to eqn (7). The upper left part displays the probability using the RDC values only (blue columns). While the probability for the RDC values clearly differentiates between the 2p and 2r structures, it does not offer a significant delineation between the closely related diastereomers 2p and 11-epi-2p, even favoring the latter incorrect structure (2p with 35.9% vs.11-epi-2p with 64.1%). Even more surprising is the probability for the RDCs (blue columns) for the dataset 11-epi-2p. In this case, the widely different 11-epi-2r structure has a higher probability (65.8%) than the correct 11-epi-2p structure (33.5%). As a result, the RDC data alone do not allow reliable chiral differentiation.
The situation changes starkly if the evaluation dataset is complemented with the NOE distances. The product over k in eqn (7) runs then over the χ values calculated from the RDC and NOE distances combined. In the right part of Fig. 4b, these combined probabilities are presented as red columns. In the case of the 11-epi-2p dataset, the probability for the correct configuration reaches nearly exact 100% and in the case of the data set 2p a value of 97.9% is reached, whereas a residual probability of 2.1% is calculated for the 11-epi-2p structure. If the χ values of the 3J couplings are added to the probability the score for the 2p configuration rises to nearly exact 100%. The empty columns in Fig. 3 indicate probabilities of <0.01%, making a strong case for the correct assignment. Clearly in more demanding cases, the NOE distances are of crucial assistance to single out the right configuration.
Using all NMR data (RDC, NOE distances and 3J-couplings) provided by Schuetz et al.34 as constraints, an MDOC simulation was performed by Sternberg et al.44 Since only data from the central part of the molecule were available, no conclusions could be drawn on conformers or motion of the two chains R1 and R2 (see Fig. 5).
Fig. 5 presents the results concerning the quality n/χ2 and the χ-probability of five MDOC simulations to the NMR dataset. In Fig. 5(a), the red columns show that RDC values fulfill the requirement n/χ2 > 1 in all cases. Each of the configurations a, d and e also exhibit a single RDC outlier, which does not advance the accuracy of the assignment. If all three constraints RDC, NOE distances and 3J couplings are taken together into consideration (blue columns), a clearer tendency for configuration a emerges, offering the highest total quality n/χ2 and the lowest total number outliers (Fig. 5(a)). Still, the configurations c and d also show qualities n/χ2 > 1 and cannot be rejected based on this criterion alone.
The perspective changes from the χ-probability standpoint as revealed in Fig. 6(b). Considering the RDC χ values alone, a probability of nearly 82% is obtained for the configuration a, followed by configuration d with 13.7%. If the NOE distances are included (green columns in Fig. 6(b)), the probability for configuration a increases to 96% providing an even clearer assignment for this configuration. This score slightly improves if the 3JHH and 3JCHχ values are added (98%, blue columns in Fig. 6(b)).
![]() | ||
Fig. 6 Quality n/χ2 and χ-probability for the MDOC simulations of sagittamide A configurations with the NMR dataset from Schuetz et al.34 (a) Quality n/χ2 for the five configurations a to e. Red: RDC data only, green: RDC and NOE distances and blue: RDC, NOE distances and 3J-couplings. (b) χ-probability for the configurations with the same color scheme as in (a). |
The derivate of the belizentrin A precursor with the substituents R1, R2 and R3 as shown in Fig. 7 was investigated using NMR in solution and in oriented media. In addition to the 44 backbone NOEs, 48 NOEs involving CH2 and CH3 groups were quantitatively evaluated. From a 5 mg-sample in CDCl3 dispersed in a polysterol gel 24 CH RDCs and 8 HH RDCs of the backbone could be measured. To account for the motions of the substituents R1, R2 and R3 in MDOC simulations 27 RDC in the CH2 and CH3 groups of the substituents were analyzed. Additionally, 6 3JHH couplings were measured along the linker. RDCs provided tensorial constraints to drive whole-molecule and internal reorientations in MDOC simulations and the NOE distances and 3J couplings served as traditional scalar constraints.
Using the modeling tools of the COSMOS Frontend 8 molecular models of the belizentrin A precursor (GRX570, see Fig. 7) were constructed with all eight possible combinations of the linker configuration C8–C9–C10. In addition, two CH2 sites at C11 and C17 were not amenable to a clear prochiral assignment and since their data (RDC, NOE, 3J) may influence the conformational outcome in the simulation, we decided to additionally include all possible prochiral combinations at each of these sites. With the eight R/S combinations of three linker atoms C8, C9 and C10 and the two prochiral assignments at C11 and C17, a final number of 32 MDOC simulations were necessary. Since these simulations are independent, they could be run in parallel.
Preliminary 10ns-MDOC simulations of the RRR linker configuration are performed to adjust the pseudo-force strengths and other simulation parameters (see ESI†). Then 20-ns MDOC simulations of all 32 possibilities are performed including all RDC tensors as orientational constraints as well as the NOE and 3JHH as internuclear distance and torsional constraints. These initial MDOC simulations revealed the following issues: (i) no agreeable RDC results (n/χ2 > 1) could be obtained giving a maximum n/χ2 of 0.72 for all 32 simulations and (ii) the quality of the linker was even worse giving a maximum n/χ2 of 0.33. The large RDC outliers of the glucopyranose 6-ring and especially of the linker carbons indicated that the motion of these molecular parts could not be adequately described in these 20 ns MDOC simulations. The molecular models reveal that the silylated side chains R1, R2 and R3 are entangled, and their interactions represent barriers for conformational changes of the linker and the glucopyranose ring. The reorientation of the crowded side chains consumes many MD steps and impedes conformational changes of the molecular backbone.
As in the sagittamide A example, we decided to concentrate our efforts on the description of the linker and performing the MDOC simulations only with constraints connected to this part of the molecule with the carbon atoms C8 to C11 and the three R2 side chains (see Fig. 7). The 20 ns MDOC simulations are performed with the same parameters as those used in the former cases (see supplementary information) and included 34 RDC values, 84 NOE distances and 6 3JHH-couplings as constraints. Because the linker contained only the C11 carbon atom with prochiral hydrogen atoms, the total number of simulations reduced to 16.
The statistical results of the 16 simulations are displayed in Fig. 8. The quality n/χ2 (Fig. 8(a)) for the RDC simulations (red columns) reaches in all cases values larger than 4.5 and, in all cases, at least one outlier was observed. The inclusion of NOE distances and J-couplings not only reduces the level of overall quality, but also evens out the outcome (4.0 < n/χ2 < 7.8 for all 16 cases). Clearly, in complex cases with a large dataset combined with large potential flexibility, the weight of key defining constraints is washed out by a large majority constraint that can easily be fulfilled within the simulation.
Fig. 8(b) displays the χ-probability according to eqn (7) for the same subset combinations: red columns for RDC values, green columns RDC + NOE distances and finally blue ones for adding 3JHH-couplings to the former two data types. Strikingly, the RDC dataset alone (red columns) exhibits strong preference for the incorrect SRR configuration with a χ-probability of 78%; the next best configurations RSS and SSS, χ-probability values of 14% and 6% (all data are given in the supplementary information), are equally incorrect. However, the addition of NOE distances (84) and of 3JHH-coupling (6) constraints is the key to the assignment success; indeed, the χ-probabilities for the combined datasets reach almost 100% for the correct RRS configuration (Fig. 8(b), green and blue). It should be noted here that “a” assignment of the C11 prochiral methylene 1H is supported by all three NMR data subset: the χ-probability for the alternative assignment “b” is in all cases smaller than one.
In the case of mandelalide (see Fig. 4) the χ-probability calculated only from RDC values gave a wrong sequence for the preferred configuration but in the present case of the belizentrin A precursor the χ-probability calculated from RDC values for the preferred RRS configuration was far below 1% (see ESI†). In both cases the addition of NOE distances changed the picture and produced high probabilities for the right structures. This could be explained by the fact that the chiral centers in the linker are weekly coupled by C–C single bonds with large rotational freedom. As already stated in the methods section measured 1H–13C couplings describe only the directions of isolated CH bonds and not the relations to neighbor carbons. For the determination of its relative configuration, additional interactions between the chiral centers are necessary, and this requirement seems to be crucially mediated by long-range constraints like NOE distances or 3J couplings.
The new criterion was tested on four examples with increasing complexity. In the first and simplest case we obtain a χ-probability of 99.99% for the right configuration calculated only from the RDC values. In more complex cases, it is recommended to calculate the joined χ-probability from the RDC values and NOE distances to obtain a conclusive answer. If a larger number of 3J couplings are available these values can be used to support the results obtained with the RDC and NOE distances; this was done in the case of third example sagittamide A. The 3J couplings are clearly the weakest aid to the χ-probability mainly because their calculation introduces additional uncertainty to the experiment.
In this paper the χ-probability was exclusively used in connection to NMR data simulations using MDOC. This does not mean that the method can only be used in connection with MDOC simulations but can be used in many other methods of NMR data predictions as well. Equipped with the χ-probability NMR data simulations attain an excellent fidelity in discriminating chiral configurations or elucidating assignments or structural differences.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: 10.1039/d2cp00330a |
This journal is © the Owner Societies 2022 |