Yao
Zhang
ab and
Alpha A.
Lee
*a
aCavendish Laboratory, University of Cambridge, Cambridge CB3 0HE, UK. E-mail: aal44@cam.ac.uk
bDepartment of Applied Mathematics and Theoretical Physics, University of Cambridge, Cambridge CB3 0WA, UK
First published on 10th July 2019
Predicting bioactivity and physical properties of small molecules is a central challenge in drug discovery. Deep learning is becoming the method of choice but studies to date focus on mean accuracy as the main metric. However, to replace costly and mission-critical experiments by models, a high mean accuracy is not enough: outliers can derail a discovery campaign, thus models need to reliably predict when it will fail, even when the training data is biased; experiments are expensive, thus models need to be data-efficient and suggest informative training sets using active learning. We show that uncertainty quantification and active learning can be achieved by Bayesian semi-supervised graph convolutional neural networks. The Bayesian approach estimates uncertainty in a statistically principled way through sampling from the posterior distribution. Semi-supervised learning disentangles representation learning and regression, keeping uncertainty estimates accurate in the low data limit and allowing the model to start active learning from a small initial pool of training data. Our study highlights the promise of Bayesian deep learning for chemistry.
Nonetheless, graph neural networks are usually developed using frequentist maximum likelihood inference, with the benchmark being the mean error on a test set. However, if the goal of QSPR is to replace mission-critical but expensive experiments, a low mean error is insufficient: the user needs to have an estimate of uncertainty and know when the model is expected to fail. This is because typically only a small number of top-ranked predictions are selected to test experimentally, thus outliers can ruin a discovery campaign. Moreover, cost limits the number of experiments that can be run, thus an approach that judiciously designs the training set to maximise information gained is needed.
Uncertainty quantification, or domain applicability, has been extensively considered in the QSPR literature but not in the context of graph neural networks and not in a statistically complete way. Previous works estimate uncertainty of prediction as the distance in descriptor space between the input molecule and the training set, or training an ensemble of models and evaluating the variance.11–14 More recent works consider conformal regression,15,16 which trains two models, one for the molecular property and one for the error. However, there are two sources of uncertainty: epistemic uncertainty arises due to insufficient data in the region of chemical space that the model is asked to make predictions on. Aleatoric uncertainty arises due to noise in the measurements themselves (e.g. noisy biochemical assays).17 Distance to the training set and variance within a model ensemble approximately capture epistemic uncertainty, whilst employing an ancillary model for prediction error approximately captures aleatoric uncertainty. We will show that the Bayesian statistical framework captures both sources of uncertainty in a unified and statistically principled manner.
Active learning strategies have been considered in the drug discovery literature.18,19 However, those pioneering works considered a priori defined molecular descriptors, and estimate uncertainty via variance within an ensemble of models. Notwithstanding the shortcomings with incomplete modelling of uncertainty discussed above, employing graph neural networks in active learning presents unique opportunities and challenges: high model accuracy in the big data limit comes at the cost of being data-hungry. As the descriptor is fully data-driven, the model cannot estimate how “far” a compound is from the test set in the low-data limit, leading to poor uncertainty estimate and breaking down the active learning cycle. Low-data drug discovery has been considered in the context of one-shot learning20 which estimates distance in chemical space by pulling data from related tasks. Nonetheless, this approach requires a priori knowledge on which tasks are related. Works on generative molecular design overcome this problem21 by starting the active learning cycle with <1000 quantitative measurements, which impose a significant upfront experimental cost.
In this paper, we combine Bayesian statistics – a principled framework for uncertainty estimation – with semi-supervised learning which learns the representation from unlabelled data. We show that Bayesian semi-supervised graph convolutional neural networks can robustly estimate uncertainty even in the low data limit and drive an active learning cycle, and overcome dataset bias in the training set. Further, we demonstrate that the quality of posterior sampling is directly related to accuracy of the uncertainty estimates. As different Bayesian inference methods can be mixed and matched with different models, our study opens up a new dimension in the design space of uncertainty-calibrated QSPR models.
(1) |
(2) |
(3) |
We use the implementation reported in the repository.† In all experiments, we consider T = 3, hidden layer at each level has 128 units, fingerprint length 256 (i.e.), and f(·) is a two layer neural network with 128 units each and relu units.
The insight behind the semi-supervised approach is that significant amount of chemical knowledge is contained within the molecular structures themselves, without any associated molecular properties (i.e. unlabelled data). Thermodynamic stability puts constrains on what bonds are possible, and tends to put certain bonds near each other, forming persistent chemical motifs. For example, just by looking at drug molecules, one would immediately spot ubiquitous motifs such as amide group, benzene rings etc., and some motifs often occur together as scaffolds.23 The key assumption is that those persistent chemical motifs contribute to the molecular property that we want to predict. We can make mathematical progress by constructing a descriptor akin to eqn (1)–(3). However, the objective is no longer trying to fit a particular property. Rather, the hidden states htv, which summarises the atomic environment around atom v within radius t, are constructed such that they are predictable from the hidden states of the surrounding atoms. Therefore, the model learns a descriptor that clusters similar environments.
Specifically, we use the semi-supervised approach developed by Nguyen et al.,24 which builds on the paragraph vector approach in natural language processing.25 Given a set of molecular structures , the hidden states htv maximise the log-likelihood
(4) |
(5) |
After finding parameters that maximise the objective (4), {htv} are then passed to a neural network, eqn (3). The parameters of the neural network as well as the readout matrices Wt are learned in a supervised manner. Note that this formalism infers descriptors using unsupervised learning and uses supervised learning to relate descriptors to molecular properties.
We use the implementation reported in the Github repository‡ accompanying ref. 24. In all experiments, we consider T = 3, hidden layer at each level has 128 units, fingerprint length 256 (i.e.), and f(·) is a two layer neural network with 128 units each and relu units.
yi = F(xi,θ) + εi. | (6) |
(7) |
(8) |
The uncertainty of model predictions can be readily derived from this Bayesian formalism. There are two types of uncertainties.17 First, the epistemic uncertainty, is given by the variance of the prediction with respect to the posterior
(9) |
Second, the aleatoric uncertainty, is the intrinsic noise of the measurement σi2. This aleatoric noise can depend on the input, σi2 = σ(x)2, as certain areas of the chemical space can be intrinsically more variable.
We note that the log posterior is, up to a constant,
(10) |
The Bayesian formalism is easy to state but computationally expensive. The numerical bottleneck is the numerical evaluation of the high dimensional integrals (8) and (9). A plethora of approximate numerical methods have been developed in the literature to overcome this bottleneck. However, there is no free lunch, and methods which approximate the posterior well are usually computationally expensive. In this paper, we will consider two approximate methods spanning the cost-accuracy spectrum.
For a neural network with M units, ref. 17 and 26 show that the above algorithm is approximately equal to finding parameters θ = (Θ1, Θ2⋯ΘM) that fit the distribution
(11) |
Distribution (11), although not the same as the true posterior distribution, is significantly easier to sample: in the prediction phase, the model is run N times, and akin to the training phase each unit has probability p of being set to 0. The final prediction and total uncertainty is taken to be the mean over N different predictions of depending variable and variance, {yi,(σi)2}Ni=1,
(12) |
The first term in eqn (12) is the epistemic uncertainty and the second term is the aleatoric uncertainty.
In our numerical experiments, dropout is applied to every unit that is trained using supervised learning, i.e. every unit in the supervised graph convolutional neural network is subjected to dropout, whereas for the semisupervised case the layers on top of the hidden states are trained with dropout.
θt+1i = θti + ηϕ(θti), | (13) |
(14) |
The key advantage of eqn (13) and (14) is that it is a well-defined approximation: frequentist inference (c.f.eqn (10)) is recovered if N = 1, whereas when N → ∞ the system exactly samples from the posterior. Therefore, for finite N, the algorithm interpolates between frequentist and full Bayesian inference. The computational cost and memory demands increase with N, and in this paper we use N = 50.
To illustrate the computational demands of SVGD, Fig. 1 shows the wall clock time, on a Nvidia P100 GPU, as a function of the number of gradient updates steps for graph convolution with dropout, semi-supervised with dropout, and semi-supervised with SVGD on the melting point dataset discussed below. Both SVGD and Stochastic Gradient Descent use back-propagation to optimize the neural network parameters. For models trained using Stochastic Gradient Descent, the computational complexity of back-propagation at each iteration is O(BM), where B is the number of training samples at each iteration and M is the number of parameters in the model. The semi-supervised model has less parameters than the fully supervised model, thus the wall-clock time is less per iteration. In SVGD, we need to update N Stein particles per iteration, thus wall clock time per iteration scales as O(BMN).
To give the reader a sense of how “hard” the different datasets are, we consider a simple baseline model of XGBoost33 on ECFP6 fingerprints.34 We split the data into 80/10/10 (training/validation/testing). We take MaxDepth = 5, LearningRate = 0.01, and optimise the number of estimators (nEstimators = [50, 100, 150, 200, 250]) using the validation set. Table 1 shows the coefficient of determination R2 and root mean squared error (RMSE). Judging from the coefficient of determination, the dataset difficulty is (from easiest to hardest) FreeSolv < Melting < ESOL < CatS < Malaria < p450.
Dataset | Number of data points | R 2 | RMSE |
---|---|---|---|
FreeSolv | 643 | 0.765 | 1.90 |
ESOL | 1128 | 0.704 | 1.14 |
CatS | 595 | 0.654 | 0.367 |
MeltingPoint | 3025 | 0.725 | 50.2 |
p450 | 8817 | 0.291 | 0.996 |
Malaria | 13417 | 0.499 | 0.655 |
Fig. 2 shows that semi-supervised learning with SVGD accurately estimates uncertainty and significantly outperforms the baseline on every dataset. The plots show how the test set error varies as a function of confidence percentile – i.e. what is the error if we only consider the top n% of compounds in the test set ranked by confidence (note that confidence is inverse of the uncertainty, quantified by eqn (12)); the shaded region is one standard deviation, estimated by analysing 5 random partitions of the data into training and test sets. In every case, the error is a decreasing function of model confidence, thus the model successfully estimates which predictions are likely to be correct and which predictions are outliers.
Another metric that we can evaluate is the shape of the confidence–error curve. For ESOL and FreeSolv, the error is a steeply decreasing at the low confidence limit before plateauing, suggesting that most predictions are accurate but for a few outliers, which the Bayesian method can identify. The situation is different for MeltingPoint, Malaria and p450 – the error is slowly decreasing at the low confidence limit before sharply decreasing when it approaches the 100% confidence percentile limit (see also insets of Fig. 2). This suggests that a few predictions are very accurate, and Bayesian method can pick out those accurate predictions amid many less accurate ones. We note that our Bayesian model is well-suited for virtual screening applications, where the challenge is ensuring that the top-ranked actives picked out by the model are indeed actives, since only a very small proportion of the compounds ranked will actually be screened experimentally (the “early recognition problem”).35,36
A lingering question whether the quality of the uncertainty estimate is due to a set of good descriptors (obtained via semi-supervised learning) or accuracy of the Bayesian methodology. Fig. 2 also shows that replacing SVGD with dropout significantly reduces model performance. At the same confidence percentile, SVGD consistently outperforms dropout. This suggests that the quality of posterior sampling drastically impacts the quality of uncertainty estimation.
The quality of uncertainty estimates can also be gauged by the correlation between the predicted uncertainty on test data points and the error that the model incurs. Table 2 shows the Spearman correlation coefficient between the predicted variance and model error. As expected, combining semi-supervised learning with SVGD leads to method with the highest rank correlation. This result is consistent with Fig. 2, which shows that semi-supervised learning with SVGD has the lowest confidence–error curve. Moreover, the rank correlation between predicted uncertainty and error broadly (although not exactly) follows the “difficulty” of the data (c.f.Table 1) – FreeSolv and ESOL have the highest rank correlation, and p450 has the lowest.
Dataset | Graph convolution with dropout | Semi-supervised with dropout | Semi-supervised with SVGD |
---|---|---|---|
FreeSolv | 0.531 ± 0.061 | 0.439 ± 0.093 | 0.688 ± 0.053 |
ESOL | 0.112 ± 0.035 | 0.306 ± 0.079 | 0.553 ± 0.026 |
CatS | 0.049 ± 0.036 | 0.066 ± 0.044 | 0.310 ± 0.019 |
MeltingPoint | 0.192 ± 0.016 | 0.284 ± 0.035 | 0.337 ± 0.013 |
p450 | 0.167 ± 0.015 | 0.185 ± 0.049 | 0.213 ± 0.010 |
Malaria | 0.315 ± 0.028 | 0.317 ± 0.031 | 0.378 ± 0.019 |
Previous works on domain applicability have focused on either building an auxiliary model to predict the error,15,16 or estimating the uncertainty of a prediction via the distance of the input to the training set.11,12,14 The former models aleatoric uncertainty whereas the latter approximately captures epistemic uncertainty. Our Bayesian method captures both sources of uncertainty in a statistically principled manner. However, our model also provides independent estimates of epistemic and aleatoric uncertainties. As such, we can ask the question: is knowing epistemic or aleatoric uncertainty alone sufficient to estimate whether a prediction is accurate?
Fig. 3 shows that the confidence–error curve for semi-supervised learning with SVGD obtained by considering both epistemic and aleatoric uncertainty is below (or matches) that obtained by considering epistemic or aleatoric uncertainty alone. Considering both sources of uncertainty leads to much more accurate predictions at the high confidence limit for ESOL and p450. Moreover, there is no consistent trend as to whether epistemic or aleatoric uncertainty is more important – for ESOL, epistemic uncertainty is a better estimate of error than aleatoric uncertainty, whereas the opposite is true for p450 and CatS. As such, one cannot overlook epistemic or aleatoric uncertainty a priori, and our approach of combining both sources of uncertainty leads to an accurate uncertainty estimate.
To show that Bayesian uncertainty estimation overcomes dataset bias, we consider a toy problem where we know the ground truth and deliberately introduce bias: we consider the problem of predicting octanol–water partition coefficient (logP) values, and use computed ACD logP values as a surrogate. We construct a dataset comprising all molecules on ChEMBL with either a beta-lactam or a benzodiazepine scaffold. The dataset is obviously very biased as it only contains 2 scaffolds. We then train a model using SVGD with semi-supervised learning, with the standard 8:2 split between training and test set on the biased dataset. Fig. 4 (left) show that model is reasonably accurate on the biased test set. We now simulate how an user might unwittingly fall foul of dataset bias – suppose we use the model to predict logP of all molecules with a steroid scaffold on ChEMBL. Fig. 4 (middle) show that the model performance, perhaps unsurprisingly, is poor. Steroids are not part of the training set, thus the model cannot predict its physiochemical properties. Bayesian uncertainty estimation provides a way out of this quandary – Fig. 4 (right) shows that the estimated uncertainty of logP prediction on steroids is significantly greater than logP prediction on the test set of beta-lactams or a benzodiazepines. In other words, the model can inform the user when it is inaccurate, thus mitigating the impact of dataset bias.
Fig. 5 shows that semi-supervised learning significantly outperforms full supervised learning in the low data limit. The mean learning curves and error bars are obtained by analysing 20 active learning runs starting from random dataset splits. Moreover, in the case of full supervised learning, active learning is unable to deliver a better learning curve than random sampling, whereas for semi-supervised learning there is a sizeable gap between the learning curves of random sampling and active learning. This is because the full supervised method generates molecular descriptors directly from data. Therefore, in the low data limit, it is unable to learn descriptors that describe the structure of chemical space and chemical similarity between compounds, thus cannot generate meaningful uncertainty estimates to drive active learning.
The importance of choosing diverse compounds in the initial screen has been discussed extensively in the literature,40–42 and the performance of our active learning method also depends on the chemical diversity in the initial screen. Fig. 6 shows that active learning does not outperform random sampling when the initial training set biased and contain only a small number of scaffolds. We model scaffold bias by splitting the data using scaffold splitting implemented in DeepChem,10 and consider the Malaria example where active learning most clearly outperforms random sampling in Fig. 5. The underperformance of active learning is perhaps unsurprising – if the initial screen only consists of one scaffold, the knowledge that the model has on the other scaffolds would be minimal, i.e. the model is equally ignorant about all the other scaffolds. As such, randomly sampling the other scaffolds becomes a reasonable strategy.
Our observation that the choice of Bayesian inference methodology significantly impacts the quality of the uncertainty estimate suggests an evident followup that probes the mathematical limit of Bayesian inference – i.e. benchmarking approximate inference techniques against importance sampling of the posterior using Markov Chain Monte Carlo till convergence, which is computationally expensive but mathematically exact. Moreover, we note that most approximate inference techniques in the literature have been benchmarked in terms of RMSE error or log-likelihood,43 rather than explicitly considering the quality of the uncertainty estimate in a manner relevant for chemoinformatics such as the confidence–error curve. An open question is the design of appropriate approximate inference techniques for graph convolutional neural networks that solves the trilemma between computational cost, model accuracy, and the quality of uncertainty estimate.
Another open question is whether the model has accurately disentangled aleatoric and epistemic uncertainty. Answering this question would require estimates of the ground truth aleatoric uncertainty, which is obtainable via repeating the experimental measurement and reporting the variance. Benchmark datasets which provide accurate experimental uncertainty estimates will be invaluable to advancing the Bayesian methodology.
Finally, our active learning methodology performs well when the initial screen covers diverse compounds. To successfully perform active learning on a scaffold-biased initial set, the model needs information on the bioactivity of those unseen scaffolds. We speculate that strategies such as multitask learning,44,45 which pools information from other cognate assays which have explored the unseen scaffolds, will be a fruitful avenue.
Footnotes |
† https://github.com/debbiemarkslab/neural-fingerprint-theano |
‡ https://github.com/pfnet-research/hierarchical-molecular-learning |
This journal is © The Royal Society of Chemistry 2019 |