The DP5 probability, quantification and visualisation of structural uncertainty in single molecules

Whenever a new molecule is made, a chemist will justify the proposed structure by analysing the NMR spectra. The widely-used DP4 algorithm will choose the best match from a series of possibilities, but draws no conclusions from a single candidate structure. Here we present the DP5 probability, a step-change in the quantification of molecular uncertainty: given one structure and one 13C NMR spectra, DP5 gives the probability of the structure being correct. We show the DP5 probability can rapidly differentiate between structure proposals indistinguishable by NMR to an expert chemist. We also show in a number of challenging examples the DP5 probability may prevent incorrect structures being published and later reassigned. DP5 will prove extremely valuable in fields such as discovery-driven automated chemical synthesis and drug development. Alongside the DP4-AI package, DP5 can help guide synthetic chemists when resolving the most subtle structural uncertainty. The DP5 system is available at https://github.com/Goodman-lab/DP5.

: The DP5 GUI can utilised to set up and run DP5 calculations particular DP5 probability has been calculated and suggest potential regions of the structure that are likely to be correct and incorrect.
Our automatic NMR processing software, NMR-AI has also been integrated into DP5, as a result, DP5 probabilities can be calculated from raw NMR data. By utilising the GUI, users can also explore the NMR assignments made by NMR-AI and thus further understand how a DP5 probability has been calculated.    . displays a roadmap of the DP5 probability calculation. The basic premiss of the calculation is as follows. For each atom in the candidate structure a chemical shift is calculated using DFT. By subtracting this value from the experimentally observed shift, a prediction error can be calculated for that atom. Each atom is treated as a random variable that can take one of two states, correct or incorrect. Using a statistical model, the probability of observing a prediction error of this size can be found. This probability is equated to the probability of the atom in the candidate structure being incorrect. The atomic probabilities are then combined to yield an overall molecular probability. The DP5 probability is then calculated from this value using a Bayesian correction function. Each step in this calculation is described in more detail in the following sections.

NMR Shift calculation
DP5 calculates NMR shifts for the atoms in the candidate structure utilising the highly optimised and well established method developed in previous works. [1][2][3][4] First the a molecular mechanics search is performed to obtain a representative set of conformer geometries, all conformational searches are performed in the gas phase utilising the MMFF force field and a mixture of Low Mode following and Monte Carlo search algorithms. The step count for MacroModel 5 is set so that all low energy conformers were found at least 5 times.
The conformer geometries are then optimised at the DFT level of theory (B3LYP/6-311g(d)). 6,7 DP5 includes quantum mechanical calculation support for both the commercial program Gaussian 8 and the free software Tinker. 9 For each conformer a single point energy calculation is completed (M062x/def2-TZVP) [10][11][12] and NMR shielding constants are found using the GIAO method. 13 The functional mPW1PW91 14 and 6-311G(d) basis set was chosen for NMR shift prediction as this has been shown to be optimal for DP4 calculations.
All calculations are managed by the DP5 Python script written in Python 3.7. DP5 is available from https://github.com/orgs/Goodman-lab/

Bespoke Atomic Prediction Error Probability Function Generation
The backbone of the DP5 probability calculation is generating and integrating a bespoke DFT-NMR prediction error probability function for each atom in every conformer of the candidate structure.
It is well known that the expected magnitude and variance of DFT prediction errors for different functionals show strong complex, nonlinear dependencies on atomic environment. 4,15 A number of different statistical models have been explored in previous works to help alleviate this issue such as, multi-gaussian, multiregion and kernel density estimation (KDE) models. 4 However, these models all display the same type of issue. The underlying assumption here is that the probability of a prediction error of a specific magnitude being observed in an external dataset is equal to the probability of observing the same prediction error within a given structure, this is not always the case. For example, some atomic environments may be expected to produce large prediction errors, these types of environments are typically underrepresented in the datasets used to develop the prediction error statistical models, as a result the corresponding prediction error probabilities will be close to zero. This is not a particular problem for a DP4 calculation, due to its comparative nature, these same systematic errors are likely to be present in all the structures being compared and tend to cancel out. However, this is a significant problem when developing a standalone probability for a single candidate structure. To solve this issue, DP5 produces a bespoke prediction error probability distribution for every atomic environment in every conformer of the candidate structure. This probability distribution takes into account the molecular geometry around each atom and the surrounding atom types by utilising the FCHL representation. 16 The bespoke prediction error probability distribution for a given atom in a given conformer of the candidate structure is found as follows. The FCHL representation for the atom is calculated (this calculation is performed utilising the python package qml). This calculation has similarly been performed for 63542 carbon atoms present in a dataset of 5140 molecules from NMRShiftDB. 17,18 The l2 distance between this representation and all those in the dataset are then calculated (also performed using qml). These distances are then mapped onto a similarity value using a gaussian kernel, equation 1. This yields a similarity between the test atomic environment and those in the dataset. A kernel density estimation is then performed on the prediction errors from the atoms 63542 in the external dataset, with each equation for the similarity between two atomic environments utilising a gaussian kernel, where ||Ai − Aj ||2 describes the l2 distance between the FCHL 16 representation for atomic environment i and atomic environment j and σ is an optimised parameter (see section 2.4) point weighted by the corresponding atomic environments similarity to the test atoms environment. Using this methodology a bespoke prediction error distribution is produced for the test atom taking into account the chemical environment the atom is in. Systems were also tested with weights multiplied by the corresponding regression coefficients (see section 4) found by solving KRR models for DFT NMR error prediction.
In order to simplify the integration process the external error dataset is made symmetrical around 0, this is achieved by taking the absolute value of each error in the dataset followed by concatenating these values and the same values multiplied by -1 into the same dataset (with twice the size) the previously calculated weights are used for both the positive and negative error values.

Atomic Representation Generation
The FCHL representation 16 was chosen due to the similarity between this calculation and kernel ridge regression (KRR). The FCHL representation of atomic environments have been used very successfully in KRR models to predict chemical shifts. 19 Similar KRR models were also constructed and evaluated in this study, see section 4. This illustrates that the FCHL representation effectively encodes the information required to define the properties of an atomic environment and that this encoding can be used to calculate the similarity between different atomic environments. As this is exactly the property required for this work and due to the ease of use, the FCHL representation was implemented in the DP5 calculation. A number of other representations were tested in this work including coulomb matrices and Morgan fingerprints of circular fragments around each atom, however, FCHL was found to be the best performing representation. It may be possible to improve the performance of DP5 by testing more atomic representations and potentially by generating a bespoke representation for this task.
In order to calculate the covariance of two FCHL representations, they must have the same vector length. This places an upper limit on the number of atoms in a molecule that the model can compare. In this study this value is set to 83, the maximum number of atoms in the NMRShiftDB dataset. For any molecules with more than 83 atoms DP5 produces molecular fragments with radius 3 around each atom in the molecule, these fragments must have fewer than 53 atoms (assuming a maximum valence of four) and a separate model based on representations of this size is used instead. No significant loss in accuracy was seen when using this fragmentation method for KRR NMR shift prediction tasks, demonstrating this is a reasonable approximation for larger molecules. This is likely to be due to the radial cutoff incorporated into the FCHL representation.

Parameter selection in Gaussian Kernel and FCHL representation generation
In order to generate an FCHL 16 representation a radial cutoff value must be defined, in this work this was set to 4.53Å. This value was found to be the optimum value for kernel ridge regression models for NMR shifts prediction using the FCHL representation (see section 4). The optimum cutoff value of 4.53Å was found using bayesian optimisation by training KRR models on a smaller dataset of 500 molecules 19 and using the mean absolution prediction error of the model across the dataset as the loss function. Due to the similarity between the use of the FCHL representation in these KRR models and in the DP5 calculation, it was concluded that this cutoff value would be similarly applicable in this context.
The second important parameter in the DP5 calculation is the sigma (σ) value used in the gaussian kernel (equation 1) during the similarity calculation described in section 2.2. This parameter changes the distance in chemical space (FCHL representation space) over which two atomic environments will be described as similar. Similarly to the radial cutoff value, this parameter was also investigated when building KRR regression models for NMR shift prediction and optimised using bayesian optimisation. Due to the sensitivity of the DP5 probability calculation to changes in this parameter, the full DP5 system was later evaluated for multiple sigma values.
Histograms of atomic similarities between the atomic environments in the 5140 NMRShiftDB molecules were also plotted for multiple values of σ, the results of this study are presented in Figure 7. These histograms illustrate as the value of σ is decreased the average similarity between atomic environments in this dataset decreases. If σ is set at a large value, all the atomic environments become very similar to each other, in this limit the DP5 probability recovers the issue where prediction error probabilities are no longer dependent on the atomic environment they occur in. Whilst if σ is set too low, all environments will share no similarity, making it impossible to perform a weighted kernel density estimation as the resulting weights cannot be normalised. A number σ values were investigated during the main evaluation of the system as described in Section 3.2. Allowing σ value to vary with atomic environment was also investigated, this methodology is essentially the same as k-nearest-neighbours, which was also tested separately for different values of k. Utilising the k nearest neighbours method ensures each atomic environment has the same number of data-points contributing to its corresponding error kernel density estimation. This is also discussed in Section 3.2.
where MAE is the mean absolute error in the external database where MAEw is the weighted mean absolute error in the external database, the weights used for each datapoint are the similarity values calculated in section 2.2

Atomic Prediction Error Probability Function Integration
Once the prediction error probability distribution has been generated for an atom in a conformer of the candidate structure, this function must then be integrated to produce a prediction error probability for that atom. Three different sets of integration limits were tested in this work, they are shown in equations. 2,3 and 4. All three integrals represent reasonable definitions of the required probability, this choice was found to have a relatively large impact on the results of the DP5 calculation.
Initially equation 2 was investigated. Integrating with these limits means that atoms with large prediction errors will be assigned values close to one, whilst atoms with smaller errors will be assigned smaller probabilities. This method has the advantage that if by chance the DFT calculated shifts are very accurate this will be reflected in the atoms probability value. However, this method overlooks the fact that DFT NMR predictions have inherent error. The MAE for the DFT calculations run on the NMRShiftDB dataset is 1.57ppm, this implies that observing errors smaller than this value can be just as unlikely as observing large errors. Moreover, if an atom has a very small associated prediction error, it is perhaps more probable that the atom is incorrect than it is that the DFT calculation is very accurate by chance. Equations 3 and 4 take this into account, penalising errors either side of the dataset MAE value equally.

Equation 4
is the only method tested where the integration limits also depend on the atomic environment. In this approach, in the same way that points in the DFT-NMR prediction error KDE are weighted by their environments similarity to the test environment, M AE w in equation. 4 is found by performing a weighted average. This method accounts for the fact that the mean error is also likely to change depending on the region of the representation space the test atomic environment is in. The differences between these integration limits are highlighted by Figure 7.
It was found that equations 3 and 4 were more effective than 2. Equation 2 typically assigns lower probabilities to atoms. This is because most atoms will have errors around the mean of the dataset, equation 2 will thus typically assign probabilities of 0.5 to these atoms. Equations 3 and 4 treat atoms differently, atoms with errors closer to the mean of the dataset (or closer to the mean observed in the corresponding region of chemical space in the case of equation 4) will be assigned probabilities closer to zero. This has the benefit that atoms with errors in the expected range will be assigned high confidence. Equation 3 and 4 do however show the disadvantage that if the DFT calculation predicts a shift with a lower error than expected, this prediction, similarly to one with a larger than expected error, will be assigned a low confidence. Equation 4 mitigates this issue as much as possible by allowing the mean value to vary with the region of chemical space the test atomic environment is in, making it much less likely for this situation to arise, as in environments where the DFT predictions are likely to be most accurate, the M AE w in the integration limits will be close to zero.
where i is the atom index and n is the total number of carbon atoms in the molecule where i is the atom index and n is the total number of carbon atoms in the molecule where i is the atom index and n is the total number of carbon atoms in the molecule where i is the atom index and n is the total number of carbon atoms in the molecule

Atomic Prediction Error Probabilities from Multiple Conformers
To make the DP5 calculation as accurate as possible, probabilities for each atom are calculated not just for a single geometry, but for every atom in every conformer of the structure found during the conformational search. This is necessary as the probability assigned to each atom depends on the exact geometry of the atomic environment and this will change between conformers.
Once a DFT prediction error probability has been calculated for each atom in each conformer, these values are combined in a Boltzmann weighting process similar to the NMR shift calculation, to produce overall probabilities for each atom in the molecule. This Boltzmann weighting process utilises the same single point energy for each conformer found during the NMR Shift calculation stage (see section 2.1).

Molecular Probability From Atomic Probabilities
Having found the prediction error probabilities for each atom in the molecule, these must now be combined in order to produce a probability for the whole structure. This part of the DP5 calculation was explored extensively, and was found to have a large effect on the final results. A number of different methods for combining the atomic probabilities were tested, the most relevant of these are displayed in equations 5 to 8.
Equations 5 to 8 all combine the atomic probabilities in mathematically reasonable ways. Consider each atom as an independent random variable, each atom can have two states, correct and incorrect. The probability of the atom being in the incorrect is described by the atomic probability p i . If these probabilities are multiplied together as in equation 5, the resulting value describes the probability of at least on of the atoms being correct. If the atomic probabilities are combined by equation 6, the resulting value instead describes the probability that all of the atoms in the structure are correct. In order to decide upon the most effective formulation of the DP5 probability, the most useful definition of when a structure should be classed as correct must first be decided upon.
It was found that when combining atomic probabilities by equation 5 the resulting molecular probabilities were often clustered around zero. This is due to a property of the underlying prediction error distribution, these distributions are typically sharply peaked with very wide tails. As a result even in incorrect structures it is not uncommon to have number of atoms with large prediction errors, in these cases, when combining probabilities by equation 5 the molecular probability will be forced close to zero by these atoms. Incorrect structures are often highlighted by a small number of large errors, this leads to equation 6 seeming like a more useful choice. equation 6 performs much better than equation 5 however it still displays a tendency to force molecular probabilities to zero. The molecular probability should reflect the real world probability of a structure being correct rather than acting as a binary classifier of correct and incorrect structures.
To help alleviate this forcing behaviour, equations, 5 and 6 were modified into equations, 7 and 8 by incorporating a geometric mean. The geometric mean was found to greatly reduce the forcing behaviour displayed by equations 5 and 6, giving a more balanced estimate of the structure being correct based upon all of the atoms. Equations including an arithmetic mean and median were also tested, however, these did not show any greater performance.

Bayesian Molecular DP5 probability
The final stage in the calculation applies Bayes theorem to the molecular probability to yield the over all DP5 probability. The purpose of this stage is to ensure the final DP5 probability assigned to a molecular structure is as close as possible to the real world probability as possible given the data available. In turn, this ensures that the DP5 probability is both easy to interpret and as useful as possible.
First a probability density function for the molecular probabilities assigned to the 5140 molecules in the dataset is found using a KDE. Similarly a weighted KDE is performed on the incorrect combinations of structures and spectra as described in section 3.2. Using these PDF functions the probabilities of the structure being correct and incorrect given the assigned molecular probability can be calculated. Finally using the information that a proposed structure must be either correct or incorrect we can define the Bayesian DP5 probability using equation 9. In the ideal case, this function (defined between zero and one) would fall exactly on the line y=x, meaning the assigned DP5 probabilities match the real world probabilities exactly. In order to ensure the final probabilities fall on the y=x line the DP5 probability calculated for the molecule is then scalded by equation 9.

Combinatorial Study
A major challenge in the development of DP5 involved constructing a method to assess the efficacy of the system. As the DP5 probability is not a tangible physical property that can be measured, it is not straight forward to compare the DP5 probability assigned to a molecule with an experimental value. In order to solve this problem, a comprehensive cross validation methodology was developed.
This study was performed utilising the database of 5140 molecules from NMRShiftDB. 17,18 The NMR-ShiftDB ID for each molecule used is given in section 6. For each of these molecules, this database contains, a DFT optimised geometry, a DFT predicted carbon NMR spectrum and accompanying experimental NMR spectrum. Structure proposals are simulated by all forming pairs (or combinations) of experimental spectra and structures (with the same number of carbon atoms) in the dataset. This produces 5140 N combinations where a structure is paired with the correct experimental spectrum (correct combinations), and on the order ofÑ 2 − N combinations where an incorrect structure has been proposed (incorrect combinations).
In the first study, DP5 probabilities for all of the correct combinations are calculated. DP5 probabilities for incorrect combinations are calculated if the maximum error between the experimental spectrum and the paired DFT prediction spectrum is less than 10 ppm. This simulates correct and incorrect structure proposals that experienced chemists should be able to distinguish based upon the prediction errors alone.
It should be noted that the DP5 calculation relies upon the representations of the atomic environments in the same dataset of molecules. In order to make this study as robust as possible, all DP5 calculations are performed in a leave-one-out cross validation style where the representations for the atomic environments from the test structure are removed from the set used by DP5. The distributions of the resulting DP5 probabilities for the correct and incorrect combinations are plotted utilising a kernel density estimation.
All the results of this study are presented in section 5.
If all incorrect combinations with a maximum error less than 10ppm are considered, the mean absolute error distribution (MAE) for the correct and incorrect combinations are expected to be different. Typically, the incorrect combinations have larger MAEs than the correct combinations. The DP5 probability would prove even more effective if it could be used to distinguish between correct and incorrect structure proposals that belong to the same MAE distribution. In the second stage of this evaluation the methodology is modified to account for this. The MAE for each incorrect combination is calculated, the probability of a correct combination having this MAE is found using the corresponding empirical PDF function. This probability is assigned to the incorrect combination as a weight, all subsequent frequency plots are performed using weighted kernel density estimations, this process is equivalent to directed sampling of the incorrect combinations. The effect of weighting the incorrect combinations in this way is the MAE distribution of the weighted incorrect combinations now approximates (as closely as possible) the MAE distribution of the correct combinations. As a result when these weights are considered the resulting plots and statistics simulate the case where both the correct and incorrect combinations belong to the same MAE distribution. Another effect of this modification is there is no longer an integer number of incorrect combinations, but rather, the sum of the weights assigned to the incorrect combinations gives an equivalent expected number of incorrect combinations. The results of this study evaluate the performance of the DP5 probability in the limit where the correct and incorrect combinations are indistinguishable by their prediction errors to even an experienced chemist. All the results of this study are presented in section 5.
This combination style analysis is particularly powerful as negative examples of incorrect structure proposals could be synthesised from real world data, avoiding more unreliable methods involving generating fake experimental or calculated spectra.

Kernel Ridge Regression
The KRR model for some property y for some system with the vector representationÃ is defined by equation 10 The kernel matrix K is found by equation 11, where A i is the vector representation of datapoint i The regression coefficients α in equation 10 can then be found through kernel matrix inversion and multiplication with the corresponding reference values y, equation 12 Where λ is the regularisation constant Prior to the development of the DP5 probability, kernel ridge regression (KRR) models for NMR shift prediction were investigated. All KRR calculations have been performed utilising the python package qml. 20 It has been previously shown that KRR models can be trained to reproduce NMR shielding constants predicted by DFT calculations. 19 In this study, this possibility has been pushed further utilising KRR models to predict experimentally observed NMR shifts for atoms in a molecule.
KRR models in this work were constructed and evaluated utilising the database of 5140 molecules from NMRShiftDB. The molecules are first partitioned randomly into a training and test set at a ratio of 1:20.
For each carbon atom in the training set an atomic FCHL representation is then generated. The symmetric kernel matrix K for the training representations is then found utilising a gaussian kernel by equation 11. Sigma values σ of 0.5, 0.3, 0.1, 0.075, 0.05 and 0.025 were tested in this study. Once the kernel matrix has been found, equation 12 can then be solved utilising the experimental shift values corresponding to the training representations to yield the regression coefficients α. The cross kernel matrix (describing the covariances of the test and training representations) is similarly found by equation 12. Finally the experimental shifts for the test set atoms can be found by equation 10. Models were constructed and evaluated for each of these sigma values in a 20 fold cross validation study, the results are presented in Figure 8.
In order to utilise equation 11 to calculate the kernel similarity between a training and test representation, these representations must have the same dimensions. The FCHL representation relies on a matrix where the first dimension must be greater or equal to the number of atoms being represented. When generating representations for a dataset of molecules, this dimension is set as the number of atoms in the largest molecule in the dataset. However, by setting this dimension a hard limit is placed upon the number of atoms a test molecule can have for atomic property prediction. To mitigate this issue KRR models were also constructed utilising a fragmentation routine. Atomic representations are generated by first creating a molecular fragment around the central atom with a specific radius, the atomic FCHL 16 representation of the central atom is then calculated utilising this fragment. Importantly by assuming the maximum valence of the atoms in the training molecules is 4, the molecular fragments have a maximum size. When applying the model constructed in this manner, test molecule can be similarly fragmented allowing the test and training representations to have equal sizes, this allows molecules with any number of atoms to be evaluated by the model. All models were evaluated using a random 20 fold cross validation process.

Stereochemistry Elucidation Results
DP5 was evaluated against a test set of 42 stereochemistry elucidation problems. This dataset of example was originally used as benchmark to evaluate DP4-AI. The molecules in this dataset are displayed in figure 12.
The full DP4 and DP5 results are displayed in figures, 13, 14 and 15

Combinatorial Study Results
Below are plots of the results from the combinatorial study described in 3.