Alexander
Howarth
,
Kristaps
Ermanis
* and
Jonathan M.
Goodman
*
Centre for Molecular Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, UK. E-mail: ke291@cam.ac.uk; jmg11@cam.ac.uk
First published on 6th March 2020
A robust system for automatic processing and assignment of raw 13C and 1H NMR data DP4-AI has been developed and integrated into our computational organic molecule structure elucidation workflow. Starting from a molecular structure with undefined stereochemistry or other structural uncertainty, this system allows for completely automated structure elucidation. Methods for NMR peak picking using objective model selection and algorithms for matching the calculated 13C and 1H NMR shifts to peaks in noisy experimental NMR data were developed. DP4-AI achieved a 60-fold increase in processing speed, and near-elimination of the need for scientist time, when rigorously evaluated using a challenging test set of molecules. DP4-AI represents a leap forward in NMR structure elucidation and a step-change in the functionality of DP4. It enables high-throughput analyses of databases and large sets of molecules, which were previously impossible, and paves the way for the discovery of new structural information through machine learning. This new functionality has been coupled with an intuitive GUI and is available as open-source software at https://github.com/KristapsE/DP4-AI.
An attractive and now established alternative1,2 is to use computational NMR prediction. This process uses DFT to calculate NMR shifts for all the diastereomers of the uncertain structure and compare these predications with the published spectra using parameters such as, correlation coefficient, mean absolute error (MAE), corrected mean absolute error (CMAE).3 DP4 analysis is particularly powerful as it not only predicts the relative stereochemistry and other variations of the molecule, but also using Bayes theorem gives a probability that each candidate molecule is the correct one (assuming one of the provided or generated structures is correct).3,4 DP4 has been successfully used to determine the stereochemistry of many natural product like molecules, synthetic intermediates, natural product fragments and also pharmaceutical compounds.5–10 and has been explored further in DP4+ and J-DP4 analyses.11,12
Since its publication, the calculation of DP4 has been streamlined and user input minimized as all calculations are now automatically managed by the Python program PyDP4.11–13 Only a structure of the molecule with undefined stereochemistry and assigned experimental 1D 13C and 1H NMR spectra are required as inputs from the user. The most user intensive part of relative stereochemistry elucidation using this program is now the assignment of the NMR spectra. This is not only very time consuming but also often laborious and error prone.14 Automated interpretation of NMR spectra has been a major goal of analytical chemistry for many years.15 Much of this work has been focused on developing CASE (Computer Aided Structure Elucidation) software16–18 for automated 2D structure determination and dereplication rather than automated assignment of atoms in a known structure. Typically a number of 2D NMR spectra in addition to the 1D NMR spectra must be provided.19
A small number of commercial software packages offer expert-guided NMR assignment algorithms for 1H NMR spectra, notably Mestrelab Mnova.20 This software has focused on aiding a user to interpret NMR spectra as opposed to automated processing and assignment of raw NMR data.
In this work a system for fully automatic robust processing and assignment of both 1H and 13C NMR spectra is presented (Fig. 1). A schematic of this program is displayed in Fig. 2. It provides automated relative stereochemistry and structural ambiguity prediction using 1D 1H and 13C NMR spectra. Chemical shift values are calculated using the DFT GIAO method. Shift prediction by this method can also be performed using only free and open source software such as NWChem21 and Tinker22 making this method more accessible than any other software currently available.
The automation DP4-AI affords is exciting as it will allow high-throughput DP4 analysis of databases and large sets of molecules, which was previously impossible. In addition, automatic processing and assignment of NMR spectra will reduce the time constraints of synthesis, allowing for more opportunities in chemical discovery. Moreover, this system will also provide a framework for the development of automated interpretation of more complex NMR experiments in the near future and could be used in conjunction with CASE software to solve structural elucidation problems from analytical data.
Quantum mechanical calculations were carried out using Gaussian09. NMR shielding constants were found using the GIAO method.31–33 The functional mPW1PW91 (ref. 34) was chosen with the 6-311G(d) basis set35 for NMR shift prediction as this has been shown to be optimal for DP4 calculation.12 For molecules containing iodine, the basis set def2-SVP36,37 was chosen. All DFT calculations were performed using the implicit PCM solvent model.38 The molecular geometries were also optimized at the DFT level of theory, this was performed using the B3LYP functional39,40 with the 6-31G(d) basis set. Finally, single-point energies were separately calculated using M06-2X functional and def2-TZVP basis set.
The calculations were managed by the PyDP4 Python script written in Python 3.7 which is now part of DP4-AI. DP4-AI is available from http://www-jmg.ch.cam.ac.uk/tools/nmr/ and GitHub https://github.com/KristapsE/DP4-AI/. Some elements of NMR processing was performed using the package NMRglue.41
After performing a Fourier transform the spectrum may display phasing errors which must be corrected prior to further processing. Unfortunately, none of the existing methods phased the test set of spectra as reliably as required. To alleviate this issue, a hybrid method combining, the signal classification method developed by Wang et al.,42 the entropy based objective function of the phasing algorithm ACME43 and the robust framework of weighted linear regression approach (WLR) developed by Zorin et al.44 was employed.
Many spectra also display baseline distortions which must be removed. A modified version of the algorithm developed by Wang et al.42 was incorporated into the final program (ESI Section S2.1.4†). For 1H spectra, initially peak picking is performed using first and second derivatives of the spectrum. Potential peaks are found as points that are simultaneously zero in the first derivative and minima in the second derivative. These candidate peaks are picked if they are both, above an amplitude threshold and below a second threshold in the second derivative. These threshold values are adaptive as they are set to multiples of the noise standard deviation values. Peak-picking in this manner allows both threshold values to be set very low, screening out as much noise as possible whilst missing as few signals as possible. In addition, the use of derivative ensures baseline independence. This process is summarized in Fig. 3.
In 1H spectra signal peaks must be grouped together to establish where the multiplet centers are located. The maximum coupling constant expected to be seen between protons in 1H spectra is around 18 Hz. Any peaks <18 Hz apart can be grouped together as multiplets. To avoid missing any signal peaks, the peak picking threshold for signal to noise ratio is deliberately set very low. However, this increases the probability of noise peaks being mistaken for signal peaks and can cause over grouping of peaks (ESI Section S2.1.6†).
To mitigate this issue an algorithm for removing noise utilizing objective model selection was developed. Picked peaks separated by less than 18 Hz are grouped together to define signal containing regions. For each region a line shape model is constructed with multiple generalized Lorentzian line shape functions.45 The parameters in the model of each region are varied iteratively until the integral of the model converges to within 1% of the corresponding region of the spectrum. To avoid overfitting, the groups of parameters describing each peak are then tested for their information content. A new model is constructed without each line shape function in turn. If the Bayesian Information Criterion (ESI Section S2.1.6†) of the model is lowered by more than a threshold value, these parameters are assumed to describe a noise peak (as they do not increase the information content of the model) and are deleted. Once all of the peaks have been tested, the remaining signals are regrouped to produce the final multiplets.46 An example of this modelling process is displayed in Fig. 4.
Fig. 4 An example multiplet (blue) and deconvolved model (orange). The signal peaks are highlighted in cyan, the peaks determined to be noise are highlighted in red. |
Using this modelling process, solvent peaks and other contaminants can also be selectively removed. The solvent used is defined by the user to adjust DFT solvent model. To identify the solvent multiplet in the experimental data, each peak in the region of the spectrum expected to contain the solvent is given a score. This score takes into account how closely the pattern of peak locations and amplitudes around each peak match that of the expected solvent multiplet and also the distance from the expected solvent location. The peaks that most closely match those of the simulated solvent multiplet are removed from the model and the spectrum is referenced (see ESI Section S2.1.9†).
Finally, the multiplets in 1H spectra must be integrated. Due to the 100% abundance of the 1H isotope of hydrogen the integrals of multiplets in the spectra are proportional to the number of protons in each chemical environment. If this constant of proportionality can be estimated, the assignment algorithm (AA) can be told explicitly how many protons can be assigned to each multiplet.
The algorithm for estimating this constant of proportionality for 1H spectra incorporated into the program has been developed from previous work in this area.20 The premise of this algorithm is to iterate this constant k from the minimum possible number of protons in the spectrum (the number of protons in the structure minus the number of labile protons) to a maximum value (which has been set to twice the total number of protons in the ambiguous structure) and calculate a score based on how integer like the corresponding set of integrals are (the integrals of the multiplets are calculated using the model spectrum as described by Schoenberger et al.45). The value of k producing the highest score is taken as the constant of proportionality, and is used to normalize the integrals (ESI Section S2.1.10†). This scoring method is particularly advantageous as it accounts for deviations from integer integral values that are often observed due to, the choice of shimming parameters or incomplete relaxation for example. Peak-picking of 13C spectra is performed using a similar algorithm. The most intense peak in the spectrum is picked and simple Lorentzian function is fitted to it to create an initial model, this is repeated for the next most intense peak. This process continues until all the unpicked peaks fall within three times the standard deviation of the noise of the fitted model. This algorithm has been chosen as it effectively discards noise peaks whilst identifying low intensity signal peaks such as quaternary carbons.
The core of the AA calculates the assignment probability matrix M. The elements of this matrix Mij give the probability of calculated shift i corresponding to experimental peak j. The matrix M is used to find the most probable assignment by the Hungarian linear sum minimization47 method as shown in Fig. 5.
The value M is calculated using a statistical model (ESI Section S2.2†) that takes into account the distribution of DFT prediction errors observed for the chosen computational conditions and, in the case of 13C NMR, also the amplitudes of the experimental peaks.
GIAO shift predictions are subject to systematic errors that vary depending on position within the spectrum and the computational conditions.12,48 These systematic errors must be corrected prior to calculation of M. Classical DP4 alleviates this problem by performing an internal scaling process.3 It is not possible to use this method in this program as the assignments are unknown.
To mitigate this issue, the assignment process is performed in three stages. In the first round of assignment, prior to calculation of M a linear scaling is performed using known external scaling factors (ESI Section S2.2.1†). After the first assignment has been completed, the assigned shifts and peaks are used to calculate internal linear scaling factors in a similar fashion to DP4. The calculated shifts are then rescaled and the assignment repeated.
In 13C the number of experimental peaks may not be equal to the number of carbon atoms in the molecule. The GIAO shift predictions may also not reflect the degeneracy seen in the spectrum. The 13C is provided with additional flexibility to assign peaks in the spectrum multiple times, using a penalty system given by eqn (1).
(1) |
The multiple assignment penalty for experimental peak i ki depends on the amplitude KDE group peak i is in. A value of k = 1 is given to the group containing the most intense peaks, then k = 2 to the group with the second most intense peak etc. The value of t represents the number of times the peak has already been assigned.
(2) |
The bias for shift i is given above. Where is a vector containing the amplitudes of all unassigned peaks within ±10 ppm of the peak assigned to calculated shift i and Aunassignedi is the amplitude weight of the peak assigned to calculated shift i
The 13C algorithm also takes into account the amplitudes of experimental peaks. Each element of M, Mij is multiplied by a weight Aj derived from the amplitude of experimental peak j. This has been incorporated to prioritise the assignment of more intense peaks over those more likely to be noise. The peaks in 13C spectra typically fall into three groups which can be distinguished by amplitude: noise, 1-atom signals and signals corresponding to multiple equivalent carbon atoms. In order to capture this variation the probability density function of peak amplitudes in the spectra is estimated,49 the peaks are grouped by which minima in the second derivative of this function their amplitudes fall between. The amplitude weights are then calculated using the number of peaks in each group and the expected number of carbon atoms in the structure as shown in Fig. 6.
The 13C AA is also able to bias the assignment towards position or amplitude information (ESI Section S2.2.2†) by considering the distribution of peak intensities and positions in the local environment around each calculated shift. After the second round of assignment, the unassigned peaks within 10 ppm of the experimental peak assigned to each calculated shift are analyzed. The bias for calculated shift i is given by eqn (2). Any shifts with biases above a value of one are reassigned in order of bias to unassigned experimental peaks within 10 ppm in order of amplitude.
The role of the bias is to assess whether any signal peaks have been missed during the initial assignment. This is particularly useful in spectra where a large amount of noise has been carried through, as the AA typically favors assigning close noise peaks rather than more distant intense signal peaks in the first pass.
In contrast the 1H AA does not require amplitude weighting, biasing or the multiple assignment penalty as this AA can be told explicitly how many times each peak may be assigned using the integral information. The 1H AA also has an additional stage for the assignment of methyl protons. Protons in methyl groups consistently appear as equivalent in 1H NMR spectra and hence should be assigned to the same peak. The 1H AA assigns these protons in groups to peaks with sufficient integrals prior to the assignment of the remaining protons.
Fig. 7 Figure illustrating the 47 molecules utilized to evaluate the performance of DP4-AI. Molecules, AT3, TS3A, TS4 and NL1A have only have corresponding 1H NMR data, all other molecules have both 1H and 13C NMR data. The spectra for molecules JB7, JB11, JB5 and JB8 were taken in solvents methanol, benzene, DMSO and methanol respectively, whilst all others were taken in CDCl3. Sources for the spectral data: AT1-3,50,51 BYH1-2,52 JB1-13B,53,54 TP1-4 (personal correspondence), TS1-4 (personal correspondence), OD1 (personal correspondence). |
To predict the relative stereochemistry of a molecule in the current release of DP4, the user must provide an NMR description. The minimum amount of information required in the NMR description is, the experimental peak locations and either a description of which atoms in the molecule are chemically equivalent or the number of times each peak can be assigned. With this information DP4 assigns the atoms in the molecule in order of chemical shift to the peaks in the NMR description. We call this approach “the pairwise AA” and it is used as the benchmark for comparison with DP4-AI.
The pairwise AA was performed for all the molecules in the test set. This was very hard work, as it required manual analysis of all of the NMR spectra in order to break the signal into individual peaks and their multiplicities. This is the most time-consuming part of classical DP4, and also has the potential for subjectivity and the introduction of errors. DP4 probabilities were calculated using three different sets of computational conditions. The first level of theory tested was MM derived geometries with GIAO shift predictions utilizing the mPW1PW91 functional, 6-311G(d) basis set (def2-SVP was used for molecules containing iodine) and PCM solvent model as recommend in previous work.11 DP4 calculations were also performed after optimizing the geometries at the DFT level using the B3LYP functional prior to GIAO NMR shift predictions. The highest level of theory tested utilized the same DFT optimized geometries, with single point energies calculated using the M06-2X functional and def2-TZVP basis set.
DP4 also requires a statistical model describing the NMR shift prediction error probabilities. As the prediction error distribution is expected to change with computational conditions, a different model is required for each set of conditions used. Four different statistical models were tested (ESI Section S3.1†), it was found that the highest performance was obtained utilizing a single region 3 Gaussian model fitted to an empirical prediction error distribution derived from the test set. As this statistical model was constructed using the molecules in the test set and also used to calculate DP4 probabilities for the same test set, a cross validation study was also completed to assess if any overfitting was occurring. This cross-validation study was performed in a leave-one-class-out fashion for each group of molecules denoted in Fig. 7 by their initials.
DP4-AI was tested at all three levels of theory described with each statistical model (ESI Section S3†). A comparison of DP4-AI and the pairwise AA for the highest level of theory and most reliable statistical model is presented in Fig. 8.
Fig. 8 The correct prediction rates for DP4-AI (orange) and the pairwise AA (blue) at the three levels of theory tested for the compounds in Fig. 7 (average number of stereocentres equal to 3.49). These predictions were produced using the fitted 3 Gaussian cross validated statistical model. |
The performance of DP4-AI, relative to pairwise AA, increases with the level of theory (Fig. 8). As in previous work13 shows that as the level of theory is increased in the DP4 calculation, the correct prediction rate of the pairwise AA also increases. DP4-AI shows a greater sensitivity to the level of theory. This is because both the assignment and the DP4 calculation are dependent on the accuracy of the NMR shift calculations. Therefore, it can be concluded that when using DP4-AI, the conditions that produce the most accurate shift predictions should always be used.
DP4-AIs performance could be improved even further by robustly addressing some of the remaining challenges in the GIAO NMR prediction, including conformational flexibility, specific solvent interactions and the presence of heavy atoms. The performance may be improved further by adding explicit support for spectra containing mixtures of compounds (such as IP2 see ESI Section S3.2†). These issues will be addressed in future developments of DP4-AI. An example of a spectrum assigned by DP4-AI is given in Fig. 9 (All the processed and assigned spectra are provided in the ESI, Section S4†).
Footnote |
† Electronic supplementary information (ESI) available: Detailed description of the program, the full results from the program evaluation and discussion, DP4-AI processed and assigned spectra and NMR shift calculations and experimental values. All DFT and MM calculations, as well as the raw NMR data at https://doi.org/10.17863/CAM.47721. See DOI: 10.1039/d0sc00442a |
This journal is © The Royal Society of Chemistry 2020 |