Kristaps
Ermanis
a,
Kevin E. B.
Parkes
b,
Tatiana
Agback
b and
Jonathan M.
Goodman
*c
aCentre for Molecular Science Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, UK. E-mail: ke291@cam.ac.uk
bMedivir AB, PO Box 1086, SE-141 22 Huddinge, Sweden
cCentre for Molecular Science Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, UK. E-mail: jmg11@cam.ac.uk; Tel: +44 (0)1223 336434
First published on 23rd March 2016
The DP4 parameter, which provides a confidence level for NMR assignment, has been widely used to help assign the structures of many stereochemically-rich molecules. We present an improved version of the procedure, which can be downloaded as Python script instead of running within a web-browser, and which analyses output from open-source molecular modelling programs (TINKER and NWChem) in addition to being able to use output from commercial packages (Schrodinger's Macromodel and Jaguar; Gaussian). The new open-source workflow incorporates a method for the automatic generation of diastereomers using InChI strings and has been tested on a range of new structures. This improved workflow permits the rapid and convenient computational elucidation of structure and relative stereochemistry.
A typical NMR prediction process starts with a conformational search on all of the isomers. DFT GIAO calculations are then run on all low-energy conformers, the predicted shifts are combined using Boltzmann weighting and finally, the DP4 analysis is employed to decide which structure is the most likely to be correct. DP4 analysis achieves this by first empirically scaling all the predicted chemical shifts to remove any systematic errors. Errors are then calculated for each of the signals. Then, assuming that each error follows a normal or t distribution with known parameters, the probability of encountering each error in a correct structure is calculated. Next, all of the probabilities for signals in a candidate structure are multiplied to give the overall absolute probability, that the candidate structure is the correct one. Finally, these resulting probabilities are converted to a set of relative probabilities that each candidate structure is the correct one using Bayes’ theorem.
DP4 was originally developed for the elucidation of the relative stereochemistry of natural products. Drug compounds is another other class of compounds that also exhibit diverse stereochemistry. They inhabit a distinct chemical space characterized by higher number of heteroatoms and aromatic systems. While DP4 should in principle be applicable to any organic compound, its performance in drug compounds is all but untested.
So far in our investigations we have been using MacroModel6 for conformational searches and Jaguar7 or Gaussian8 for DFT and GIAO calculations. However, there exist open-source alternatives for both conformational search and for DFT calculations. Incorporation and validation of this software in the DP4 process would significantly increase its accessibility.
The compounds and their diastereomers were submitted to MacroModel conformational search, all conformers with energy less than 10 kJ mol−1 relative to the global minimum were then submitted to DFT single point calculation and NMR GIAO calculation10 using Gaussian quantum chemistry software. Our earlier studies demonstrated that this protocol provides an effective balance of precision and speed.3 Finally, the predicted NMR shifts were submitted to DP4 analysis. For mometasone, 4, and testosterone, 3, only the structures diastereomeric at the peripheral stereocentres were considered. In the case of tiotropium, 7, the epoxide chiral centres were varied in concert, as the trans isomers would be too strained to exist. For the rest of the compounds all possible diastereomers were considered. The results of this study are shown in Fig. 2.
![]() | ||
Fig. 2 Overall DP4 probabilities assigned to drug compound diastereomers using MacroModel and Gaussian. Only diastereomers with probabilities greater than 0.1% shown. |
Ideally, the DP4 parameter will give 100% confidence for the correct diastereoisomer, but all certainty beyond a random selection is useful, and certainty on some stereocentres is helpful. In the figure, the colours correspond to the utility of the assignment. Dark blue is the probability of the completely correct structure. The other colours relate to the utility – six out of seven stereocentres correct is a better result than three out of four. We note that all of the structures considered are C1 symmetry, and so the number of diastereoisomers that need to be considered is always 2N−1.
Early in our investigations it was found that inclusion of the solvent in the DFT calculations reduced the errors of the chemical shift prediction and improved the identification of the correct diastereomer. Therefore the PCM solvent model11 was used for all NMR GIAO calculations. In most cases DP4 is able to correctly determine the relative stereochemistry of these compounds with good confidence and shows the generality and robustness of the DP4 approach. Even highly flexible structures like rosuvastatin, 2, and formoterol, 1, were correctly identified. In the case of ezetimibe, 8, however DP4 favoured the structure epimeric at the remote OH centre as the most likely with 98% confidence, with the correct structure as a remote second most likely candidate. Similarly in the case of darunavir, 6, the most likely structure was identified as the one with epimeric OH centre. However, the relative stereochemistry of the remaining four stereocentres was identified correctly, despite the flexible nature of the molecule. The case of simvastatin, 5, was particularly challenging as this compound is quite flexible and has seven stereocentres and 64 possible diastereomers. Unfortunately this particular problem appears to be too complex for the current DP4 methodology and the diastereomers identified as likely candidates were completely different from the correct structure. However, the DP4 conclusion that four diastereomers have significant probability may suggest the challenge of this highly flexible structure to the user and thus would likely not lead to incorrect conclusions.
Several of our previous studies have shown that optimization of the geometries at the DFT level provides only limited improvement at high computational cost. We decided to test if this still holds true on the compounds in the current dataset and we chose two compounds with the most potential for improvement – tiotropium, 7, and ezetimibe, 8. The geometries were optimized at the same DFT level as used for shift calculation and the DP4 analysis was repeated at a cost of approximately 30-fold increase in the computational time. For tiotropium, DFT optimization reduced the average corrected NMR shift error by 0.27 ppm for carbon and by 0.02 ppm for proton, and left the separate DP4 confidences based on carbon and proton NMR essentially unchanged. However, the overall confidence was increased from 14.7% to 68.2% because the carbon and proton were now in conflict as to which is the most likely structure and this let the correct structure emerge as the overall top result. For ezetimibe, DFT optimization reduced the average corrected NMR shift error by 0.77 ppm for carbon, but increased the error by 0.01 ppm for proton. The overall confidence improved from 2.8% to 72.0%. Therefore improvement can be gained by optimizing the geometries at the DFT level for these anomalous examples. However, the computational cost makes this approach impractical in many cases: to repeat the DP4 analysis with DFT optimization on darunavir and simvastatin would require about 12 and 36 CPU months on our desktop setup, respectively.
Overall the results from this study are very encouraging. In majority of cases DP4 is capable of elucidating the relative stereochemistry of the drug compounds despite the fact that the statistical model was developed for natural products and their fragments.
One well-known open-source DFT package with NMR GIAO implementation is NWChem.12 We have previously shown that the differences between Gaussian and Jaguar DFT packages do not significantly affect the outcome of DP4 calculations, and, therefore, we hoped that the same would be true for NWChem. Probably the most significant difference between Gaussian and NWChem are the available solvent models. While Gaussian has several versions of PCM models implemented, the only solvent model available in NWChem is COSMO.13 It was hoped that the different solvent models would not cause large differences to the chemical shift calculations. To test this, NWChem was used for DP4 workflow and the process was repeated for six drug molecules (Fig. 3). The results were very similar to those achieved with Gaussian, with the mean absolute difference in the predicted shifts being 0.34 ppm for carbon and 0.08 ppm for proton NMR. This shows that NWChem can be used in place of Jaguar or Gaussian with little effect on DP4 performance, using the standard workflow, which uses molecular geometries generated by force-fields.
![]() | ||
Fig. 3 Overall DP4 probabilities assigned to drug compound diastereomers using MacroModel and NWChem. Only diastereomers with probabilities greater than 0.1% shown. |
One significant difference was in the case of formoterol, 1. While the MacroModel/Gaussian combination gives 81.9% overall DP4 probability for the correct diastereomer, MacroModel/NWChem combination gives only 0.5% for the correct structure. This arises from conflicting conclusions from carbon and proton spectra in both cases. The proton data gives high (>95%) confidence in the correct structure, while carbon data gives high (>95%) confidence in the incorrect one. The confidence in the correct proton assignment is slightly lower in the NWChem case, probably because of the different solvent model. This causes the confidence in the carbon NMR based assignment to rise and also causes a switch in the overall assignment. In all other cases the overall DP4 probabilities are very similar with the Gaussian version of the workflow.
TINKER is an open-source molecular mechanics package14 which we tested as an alternative for MacroModel in perfoming conformational searches. The performance of TINKER/Gaussian workflow was tested by repeating the calculations on the drug molecules (Fig. 4). In our experience TINKER conformational searches were slower than the same searches run using MacroModel and generated many more conformations. As a result, we were not able to complete the calculations for darunavir, 6, and simvastatin, 5. For the rest of the molecules, however, the DP4 results are quite similar to the MacroModel workflow and in general seem equally good at identifying the correct diastereomer. For tiotropium, 7, the DP4 performance improved the outcome and the confidence in the correct structure increased from 14.7% when using MacroModel to 60.9% with TINKER. In contrast, the confidences for the correct diastereomers of tadalafil, 10, and solifenacin, 9, were reduced. Both structures can only have two diastereomers and in the case of tadalafil the confidence in the correct one is only 50.5%, essentially having no predictive value. In the case of solifenacin, the confidence was reduced even further to 42.8%. The cause for these differences is most likely the differences in the search algorithm implementations. On average, however, the DP4 results appear to be of similar quality regardless of the molecular mechanics software used.
![]() | ||
Fig. 4 Overall DP4 probabilities assigned to drug compound diastereomers using Tinker and Gaussian. Only diastereomers with probabilities greater than 0.1% shown. |
Finally, the fully open-source version of the process was tested using TINKER and NWChem. Calculations were run on the same set of molecules as in the MacroModel/NWChem case (Fig. 5). Overall the results appear to be very similar to the previous results. As in the case of TINKER/Gaussian workflow, results for solifenacin, 9, and tadalafil, 10, were indecisive. In all other cases, however, the correct diastereomer was identified with high confidence. The average performance of this workflow once again appears to be similar to the other three investigated. This demonstrates a fully open-source version of the DP4 workflow and should significantly improve its accessibility to chemists in all areas of organic chemistry.
![]() | ||
Fig. 5 Overall DP4 probabilities assigned to drug compound diastereomers using Tinker and NWChem. Only diastereomers with probabilities greater than 0.1% shown. |
To generate diastereomers, the base candidate 3D structure is first converted into an InChI string. This can be done using one of several utilities and frameworks freely available online, including IUPAC utilities,16 OpenBabel17 and Marvin MolConverter.18 From this, InChI strings for diastereomeric structures are generated by modifying the relative stereochemistry layer to represent all possible combinations of the configurations. Finally, InChIs are converted back into 3D structures. Several utilities offer this service, and we found that only Marvin MolConverter, which is available as free software from ChemAxon,18 produced good 3D structures. After this, structures could be immediately used in the DP4 process without any further modification. Both the generation of all diastereomers and diastereomers at particular chiral centres have been implemented as a Python script with Marvin Molconverter as the backend for the generation of InChI strings from structures and the generation of structures from InChI strings. This process was successfully used in the drug study to generate all the candidate structures. The only exception was tiotropium, 7, because some of the automatically generated structures would be too strained to be viable and manually drawn structures were analyzed in this case.
The script also alleviates the shortcomings in TINKER conformation pruning. TINKER only removes duplicate conformations based on energy; therefore the number of conformations generated is much larger than in the case of MacroModel. To keep the number of conformations submitted to the DFT calculations down, we implemented RMSD pruning of conformations. RMSD pruning works by calculating the differences in atom positions in different conformers and if the RMSD value exceeds a chosen cut-off value, the conformations are considered similar and one of the conformations is discarded. This procedure requires the molecules to be aligned and to achieve this, QTRFIT alignment algorithm was also implemented in Python.19
All of these improvements in combination mean that DP4 structure elucidation and verification can now be done quickly and easily. In our experience, a full elucidation of the relative stereochemistry could be done with as little as one hour of active user effort. In principle, new developments in this area, such as Sarotti's recent modification of DP4,22 could be incorporated into a similar workflow.
Quantum mechanical calculations were carried out using Gaussian '098 and NWChem 6.5,12 B3LYP functional23 and 6-31G(d,p) basis set.24 PCM and COSMO solvent models were used in the case of Gaussian and NWChem, respectively.11,13 NMR shielding constant calculations used the GIAO method.10 Calculation setup, data extraction and DP4 analysis were done using the PyDP4 script written in Python 2.7. The script is available on the group website (http://www-jmg.ch.cam.ac.uk/tools/nmr), as well as on GitHub (https://github.com/KristapsE/PyDP4) and in the ESI† under the MIT license. TINKER, NWChem and Marvin Molconverter can be obtained directly from their authors.
A novel and general approach to automatic generation of diastereomers was developed and relies on the use of InChI strings. Finally, a full automation of the DP4 workflow has reduced user interaction to a minimum – verification or elucidation of the stereochemical structure can now be done in as little as one hour of active effort.
Footnote |
† Electronic supplementary information (ESI) available: PyDP4 Python script. See DOI: 10.1039/c6ob00015k |
This journal is © The Royal Society of Chemistry 2016 |