Bayesian semi-supervised learning for uncertainty-calibrated prediction of molecular properties and active learning

We report a statistically principled method to quantify the uncertainty of machine learning models for molecular properties prediction. We show that this uncertainty estimate can be used to judiciously design experiments.


Introduction
Predicting physiological properties and bioactivity from molecular structurequantitative structure-property relationships (QSPR)underpins a large class of problems in drug discovery. Classical QSPR workows 1 separate descriptor generationmapping a 2D 2-5 or 3D molecular structure 6,7 into a vector of real numbers using some handcraed rulesand the machine learning method that connects descriptors to a property. Pioneering advances in machine learning such as graph neural networks directly take a molecular graph as input and infer the optimal structure-to-descriptor map from data, 8,9 outperforming classical machine learning methodologies with handcraed descriptors. 10 Nonetheless, graph neural networks are usually developed using frequentist maximum likelihood inference, with the benchmark being the mean error on a test set. However, if the goal of QSPR is to replace mission-critical but expensive experiments, a low mean error is insufficient: the user needs to have an estimate of uncertainty and know when the model is expected to fail. This is because typically only a small number of top-ranked predictions are selected to test experimentally, thus outliers can ruin a discovery campaign. Moreover, cost limits the number of experiments that can be run, thus an approach that judiciously designs the training set to maximise information gained is needed.
Uncertainty quantication, or domain applicability, has been extensively considered in the QSPR literature but not in the context of graph neural networks and not in a statistically complete way. Previous works estimate uncertainty of prediction as the distance in descriptor space between the input molecule and the training set, or training an ensemble of models and evaluating the variance. [11][12][13][14] More recent works consider conformal regression, 15,16 which trains two models, one for the molecular property and one for the error. However, there are two sources of uncertainty: epistemic uncertainty arises due to insufficient data in the region of chemical space that the model is asked to make predictions on. Aleatoric uncertainty arises due to noise in the measurements themselves (e.g. noisy biochemical assays). 17 Distance to the training set and variance within a model ensemble approximately capture epistemic uncertainty, whilst employing an ancillary model for prediction error approximately captures aleatoric uncertainty. We will show that the Bayesian statistical framework captures both sources of uncertainty in a unied and statistically principled manner.
Active learning strategies have been considered in the drug discovery literature. 18,19 However, those pioneering works considered a priori dened molecular descriptors, and estimate uncertainty via variance within an ensemble of models. Notwithstanding the shortcomings with incomplete modelling of uncertainty discussed above, employing graph neural networks in active learning presents unique opportunities and challenges: high model accuracy in the big data limit comes at the cost of being data-hungry. As the descriptor is fully datadriven, the model cannot estimate how "far" a compound is from the test set in the low-data limit, leading to poor uncertainty estimate and breaking down the active learning cycle. Low-data drug discovery has been considered in the context of one-shot learning 20 which estimates distance in chemical space by pulling data from related tasks. Nonetheless, this approach requires a priori knowledge on which tasks are related. Works on generative molecular design overcome this problem 21 by starting the active learning cycle with <1000 quantitative measurements, which impose a signicant upfront experimental cost.
In this paper, we combine Bayesian statisticsa principled framework for uncertainty estimationwith semi-supervised learning which learns the representation from unlabelled data. We show that Bayesian semi-supervised graph convolutional neural networks can robustly estimate uncertainty even in the low data limit and drive an active learning cycle, and overcome dataset bias in the training set. Further, we demonstrate that the quality of posterior sampling is directly related to accuracy of the uncertainty estimates. As different Bayesian inference methods can be mixed and matched with different models, our study opens up a new dimension in the design space of uncertainty-calibrated QSPR models.

Methods and data
A machine learning method has two independent components: model and inference. The model is a function with parameters that relate the input to the output. Inference pertains to the methodology by which the model parameters are inferred from data. In terms of model, we focus on graph convolutional neural network models that take molecular graphs as input. In terms of inference, we focus on the Bayesian methodology.

Supervised graph convolutional neural network
Our baseline model is the graph convolutional ngerprint model. 9 The salient idea is the message passing operation, 22 which creates a vector that summarises the local atomic environment around each atom while respecting invariance with respect to atom relabelling. A molecule is described by a graph, where the nodes are atoms and the edges are bonds. Atom i is described by a vector of atomic properties x v , and a bond connecting i and j is described by bond properties e vw . The algorithm is iterative: at step t, each atom has a hidden state h t v , which depends on "messages" m t v received from surrounding atoms as well as h tÀ1 v . The hidden states can be interpreted as descriptors of local atomic environment, and the messages allow adjacent atoms to comprehend the environment of its neighbours. Each atom is initialised to its atomic features, and where N ðvÞ denotes the set of atoms bonded to atom v, s($) is the sigmoid function, H N t is a learned matrix for each step t and vertex degree N. The algorithm is run T times, with T being a hyperparameter. In the nal step, the output is given by a multilayer neural network f($) that takes a weighted average of the hidden states at each step as input return a prediction, where W t are learned readout matrices, one for each step t.
We use the implementation reported in the repository. † In all experiments, we consider T ¼ 3, hidden layer at each level has 128 units, ngerprint length 256 (i.e. W t˛ℝ 128Â256 ), and f($) is a two layer neural network with 128 units each and relu units.

Semi-supervised graph convolutional neural network
The fully supervised approach learns molecular descriptors directly from data. This is an advantage if one has a lot of data but a disadvantage in the data-limited settings such as active learning applications, where the objective is to design informative experiments starting from a small pool of initial training data.
The insight behind the semi-supervised approach is that signicant amount of chemical knowledge is contained within the molecular structures themselves, without any associated molecular properties (i.e. unlabelled data). Thermodynamic stability puts constrains on what bonds are possible, and tends to put certain bonds near each other, forming persistent chemical motifs. For example, just by looking at drug molecules, one would immediately spot ubiquitous motifs such as amide group, benzene rings etc., and some motifs oen occur together as scaffolds. 23 The key assumption is that those persistent chemical motifs contribute to the molecular property that we want to predict. We can make mathematical progress by constructing a descriptor akin to eqn (1)-(3). However, the objective is no longer trying to t a particular property. Rather, the hidden states h t v , which summarises the atomic environment around atom v within radius t, are constructed such that they are predictable from the hidden states of the surrounding atoms. Therefore, the model learns a descriptor that clusters similar environments.
Specically, we use the semi-supervised approach developed by Nguyen et al., 24 which builds on the paragraph vector approach in natural language processing. 25 Given a set of molecular structures M , the hidden states h t v maximise the loglikelihood where u n is the molecular identier, obtained by maximising eqn (4), with h t v dened by eqn (1) and (2). We can interpret u n as a vector that describes the "type" of molecule, and the objective encourages the hidden states h t v to take values such that similar molecules have similar atomic environments.
Aer nding parameters that maximise the objective (4), {h t v } are then passed to a neural network, eqn (3). The parameters of the neural network as well as the readout matrices W t are learned in a supervised manner. Note that this formalism infers descriptors using unsupervised learning and uses supervised learning to relate descriptors to molecular properties.
We use the implementation reported in the Github repository ‡ accompanying ref. 24. In all experiments, we consider T ¼ 3, hidden layer at each level has 128 units, ngerprint length 256 (i.e. W t˛ℝ 128Â256 ), and f($) is a two layer neural network with 128 units each and relu units.

Bayesian deep learning
In Bayesian inference, the aim is to determine the distribution of model parameters that conforms to the data, the so-called posterior distribution. Let q be model parameters, x i the dependent variables and y i the independent variable, such that where Z is a normalising constant. The prediction for an unknown input x^is obtained by averaging over the posterior The uncertainty of model predictions can be readily derived from this Bayesian formalism. There are two types of uncertainties. 17 First, the epistemic uncertainty, is given by the variance of the prediction with respect to the posterior Second, the aleatoric uncertainty, is the intrinsic noise of the measurement s i 2 . This aleatoric noise can depend on the input, s i 2 ¼ s(x) 2 , as certain areas of the chemical space can be intrinsically more variable. We note that the log posterior is, up to a constant, which is exactly the mean-squared loss if s i is constant, with log P(q) being the regulariser. Therefore, maximum likelihood inference is a special case of Bayesian inference.
The Bayesian formalism is easy to state but computationally expensive. The numerical bottleneck is the numerical evaluation of the high dimensional integrals (8) and (9). A plethora of approximate numerical methods have been developed in the literature to overcome this bottleneck. However, there is no free lunch, and methods which approximate the posterior well are usually computationally expensive. In this paper, we will consider two approximate methods spanning the cost-accuracy spectrum.
2.3.1 Dropout variational inference. Variational inference seeks to approximate the posterior distribution by a distribution that is much easier to sample from. Ref. 17 and 26 show that a popular way to regularise neural networksdropoutis equivalent to approximate Bayesian inference. The algorithm is simple: the neural network is forked at the last layer to have two outputs, the predicted aleatoric uncertainty s i 2 and dependent variable y i , and trained to minimise the loss (10). However, during training, each unit has a probability p of being set to 0. For a neural network with M units, ref. 17 and 26 show that the above algorithm is approximately equal to nding param- to the posterior distribution P(q|{x i },{y i }), where Q m is the parameter vector associated with the m th unit. Distribution (11), although not the same as the true posterior distribution, is signicantly easier to sample: in the prediction phase, the model is run N times, and akin to the training phase each unit has probability p of being set to 0. The nal prediction and total uncertainty is taken to be the mean over N different predictions of depending variable and variance, The rst term in eqn (12) is the epistemic uncertainty and the second term is the aleatoric uncertainty.
In our numerical experiments, dropout is applied to every unit that is trained using supervised learning, i.e. every unit in the supervised graph convolutional neural network is subjected to dropout, whereas for the semisupervised case the layers on top of the hidden states are trained with dropout.
2.3.2 Stein variational gradient descent (SVGD). Rather than tting a distribution to the posterior, Stein Variational Gradient Descent (SVGD) 27 directly draws samples from the posterior via gradient descent. Specically, let {q 0 i } N i¼1 be parameters randomly and independently initialised in parameter space. We want to evolve parameters such that, aer T steps, {q T i } N i¼1 are N independent samples drawn from P(q t j |{x i },{y i }). Ref. 27 shows that the following dynamical system does the trick: where and k($,$) is a generic kernel function and h is the learning rate. eqn (13) and (14) can be interpreted as free energy minimisation of an interacting particle system: a "particle" (parameter vector) is subjected to a "force" f(q), which drives particles to regions of low energy (low loss), whilst forcing the particles apart to maximise entropy. The total uncertainty is evaluated also with eqn (12), except {y m } N m¼1 are predictions from different model parameters {q i } N i¼1 . The key advantage of eqn (13) and (14) is that it is a well-dened approximation: frequentist inference (c.f. eqn (10)) is recovered if N ¼ 1, whereas when N / N the system exactly samples from the posterior. Therefore, for nite N, the algorithm interpolates between frequentist and full Bayesian inference. The computational cost and memory demands increase with N, and in this paper we use N ¼ 50.
To illustrate the computational demands of SVGD, Fig. 1 shows the wall clock time, on a Nvidia P100 GPU, as a function of the number of gradient updates steps for graph convolution with dropout, semi-supervised with dropout, and semisupervised with SVGD on the melting point dataset discussed below. Both SVGD and Stochastic Gradient Descent use backpropagation to optimize the neural network parameters. For models trained using Stochastic Gradient Descent, the computational complexity of back-propagation at each iteration is O(BM), where B is the number of training samples at each iteration and M is the number of parameters in the model. The semi-supervised model has less parameters than the fully supervised model, thus the wall-clock time is less per iteration. In SVGD, we need to update N Stein particles per iteration, thus wall clock time per iteration scales as O(BMN).

Architectures and hyperparameters.
As the objective of this paper is to demonstrate the types of chemical problems that Bayesian deep learning can tackle, we adopt common parameters for graph convolutional neural networks taken from the literature rather than performing extensive hyperparameter optimisation and neural architecture search. For both supervised and semi-supervised graph convolutional neural networks, we keep the number of hidden layers the same as the original implementation in the GitHub repositories cited above. Following the implementations in the repositories we keep the dimension of the ngerprint twice of n h , the number of neurons in the hidden layers, and only optimise the n h by a grid search over {32, 64, 128, 256}, choosing the value of n h with the best averaged 5-fold cross-validation root mean squared error over all the datasets. This leads to the architecture of T ¼ 3, N ¼ 50, hidden units ¼ 128 and two layers.

Datasets
We consider a set of common regression benchmarks for physical properties prediction and bioactivity prediction. The melting point dataset is a collection of 3025 melting point measurements of drug-like molecules used in a benchmark study. 28 The ESOL dataset is a set of 1128 measured aqueous solubilities of organic molecules, 29 and the FreeSolv dataset is a set of 643 hydration free energy measurements of small molecules in water. 30 The ESOL and FreeSolv datasets are used in the MoleculeNet benchmark. 10 The CatS dataset comprises half-maximal inhibitory concentration (log IC 50 ) measurements of 595 molecules against Cathepsin S, taken from the D3R Grand Challenge 3 and 4. 31 The Malaria dataset is a set of in vitro half-maximal effective concentration (log EC 50 ) measurement of 13 417 molecules against a sulde-resistant strain of P. falciparum, the parasite that causes malaria 32 used in benchmark. 9 The p450 dataset is a dataset of half-maximal effective concentration measurements of 8817 molecules against Cytochrome P450 3A4, a key enzyme for metabolism and clearance of xenobiotics, taken from the PubChem assay AID 1851.
To give the reader a sense of how "hard" the different datasets are, we consider a simple baseline model of XGBoost 33 on ECFP6 ngerprints. 34 We split the data into 80/10/10 (training/ validation/testing). We take MaxDepth ¼ 5, LearningRate ¼ 0.01, and optimise the number of estimators (nEstimators ¼ [50, 100, 150, 200, 250]) using the validation set. Table 1 shows the coefficient of determination R 2 and root mean squared error (RMSE). Judging from the coefficient of determination, the

Uncertainty quantication
We rst consider how well can the model estimate its own uncertainty given the full dataset, split into training (80%) and test (20%) sets. The quality of the uncertainty estimate is operationalised by asking what is the model accuracy when the most uncertain predictions are removed, with uncertainty quantied by the variance computed from eqn (12). Our baseline method is graph convolution with dropout, which has recently been implemented in the DeepChem package as a feature, 10 although to our knowledge this is the rst study that benchmarks Bayesian graph convolutional neural networks in terms of uncertainty quantication. Fig. 2 shows that semi-supervised learning with SVGD accurately estimates uncertainty and signicantly outperforms the baseline on every dataset. The plots show how the test set error varies as a function of condence percentilei.e. what is the error if we only consider the top n% of compounds in the test set ranked by condence (note that condence is inverse of the uncertainty, quantied by eqn (12)); the shaded region is one standard deviation, estimated by analysing 5 random partitions of the data into training and test sets. In every case, the error is a decreasing function of model condence, thus the model successfully estimates which predictions are likely to be correct and which predictions are outliers.
Another metric that we can evaluate is the shape of the condence-error curve. For ESOL and FreeSolv, the error is a steeply decreasing at the low condence limit before plateauing, suggesting that most predictions are accurate but for a few outliers, which the Bayesian method can identify. The situation is different for MeltingPoint, Malaria and p450the error is slowly decreasing at the low condence limit before sharply decreasing when it approaches the 100% condence percentile limit (see also insets of Fig. 2). This suggests that a few predictions are very accurate, and Bayesian method can pick out those accurate predictions amid many less accurate ones. We note that our Bayesian model is well-suited for virtual screening applications, where the challenge is ensuring that the top-ranked actives picked out by the model are indeed actives, since only a very small proportion of the compounds ranked will actually be screened experimentally (the "early recognition problem"). 35,36 A lingering question whether the quality of the uncertainty estimate is due to a set of good descriptors (obtained via semisupervised learning) or accuracy of the Bayesian methodology. Fig. 2 also shows that replacing SVGD with dropout signicantly reduces model performance. At the same condence percentile, SVGD consistently outperforms dropout. This suggests that the quality of posterior sampling drastically impacts the quality of uncertainty estimation.
The quality of uncertainty estimates can also be gauged by the correlation between the predicted uncertainty on test data points and the error that the model incurs. Table 2 shows the Spearman correlation coefficient between the predicted variance and model error. As expected, combining semi-supervised learning with SVGD leads to method with the highest rank correlation. This result is consistent with Fig. 2, which shows that semi-supervised learning with SVGD has the lowest condence-error curve. Moreover, the rank correlation between predicted uncertainty and error broadly (although not exactly) follows the "difficulty" of the data (c.f. Table 1) -FreeSolv and ESOL have the highest rank correlation, and p450 has the lowest.
Previous works on domain applicability have focused on either building an auxiliary model to predict the error, 15,16 or estimating the uncertainty of a prediction via the distance of the input to the training set. 11,12,14 The former models aleatoric uncertainty whereas the latter approximately captures epistemic uncertainty. Our Bayesian method captures both sources of uncertainty in a statistically principled manner. However, our model also provides independent estimates of epistemic and aleatoric uncertainties. As such, we can ask the question: is knowing epistemic or aleatoric uncertainty alone sufficient to estimate whether a prediction is accurate? Fig. 3 shows that the condence-error curve for semisupervised learning with SVGD obtained by considering both epistemic and aleatoric uncertainty is below (or matches) that obtained by considering epistemic or aleatoric uncertainty alone. Considering both sources of uncertainty leads to much more accurate predictions at the high condence limit for ESOL and p450. Moreover, there is no consistent trend as to whether epistemic or aleatoric uncertainty is more importantfor ESOL, epistemic uncertainty is a better estimate of error than aleatoric uncertainty, whereas the opposite is true for p450 and CatS. As such, one cannot overlook epistemic or aleatoric uncertainty a priori, and our approach of combining both sources of uncertainty leads to an accurate uncertainty estimate.

Overcoming dataset bias
Our Bayesian methodology also overcomes dataset bias, which has be noted in the recent literature as the leading cause for overly optimistic results on benchmarks. 37 Most ligand-based benchmarks are biased in the sense that the molecules reported are tightly clustered around a few important chemical scaffolds, such that when the dataset is randomly split into training set and test set, the molecules in the test set are structurally very similar to the training set. Therefore, a model that only memorises the training set will still achieve a high accuracy on the test set yet cannot generalise to other regions of chemical space. Methods such as scaffold splitting 10 and attribution 38 attempt to estimate what would be the true performance of the model if the dataset were not biased. However, bias is fundamental in chemical dataan "uniform distribution" in chemical space does not exist because chemical space does not have a well-dened metric. Regardless of how one preprocesses the dataset or train the model, model predictions will always be awry for scaffolds that are not represented in the dataset. As such, rather than "unbiasing" the data, the practical question is whether the model can estimate whether it is likely  . 3 Epistemic uncertainty and aleatoric uncertainty are distinct sources of uncertainty, and a combination of them is needed to obtain a good estimate of model error. We plot the confidence-error curve for semi-supervised learning with Stein Variational Gradient Descent where the confidence is estimated from combining epistemic and aleatoric uncertainty, epistemic uncertainty alone, and aleatoric uncertainty alone.
to make a correct prediction for an unseen molecule given a biased training set.
To show that Bayesian uncertainty estimation overcomes dataset bias, we consider a toy problem where we know the ground truth and deliberately introduce bias: we consider the problem of predicting octanol-water partition coefficient (log P) values, and use computed ACD log P values as a surrogate. We construct a dataset comprising all molecules on ChEMBL with either a beta-lactam or a benzodiazepine scaffold. The dataset is obviously very biased as it only contains 2 scaffolds. We then train a model using SVGD with semi-supervised learning, with the standard 8 : 2 split between training and test set on the biased dataset. Fig. 4 (le) show that model is reasonably accurate on the biased test set. We now simulate how an user might unwittingly fall foul of dataset biassuppose we use the model to predict log P of all molecules with a steroid scaffold on ChEMBL. Fig. 4 (middle) show that the model performance, perhaps unsurprisingly, is poor. Steroids are not part of the training set, thus the model cannot predict its physiochemical properties. Bayesian uncertainty estimation provides a way out of this quandary - Fig. 4 (right) shows that the estimated uncertainty of log P prediction on steroids is signicantly greater than log P prediction on the test set of beta-lactams or a benzodiazepines. In other words, the model can inform the user when it is inaccurate, thus mitigating the impact of dataset bias.

Low data active learning
Having considered the quality of the uncertainty estimates in the data-abundant limit, our next question is whether we can estimate uncertainty in the low data limit and drive an active learning cycle. We consider the objective of obtaining a low model error with a small training set. The model is rst trained from a small initial pool of data (25% of the full training set, picked randomly), the model then selects a batch of molecules (2.5% of the full training set) that has the largest epistemic uncertainty to put into the training set, and then the model is retrained to suggest other additions, and the cycle continues. The test set is always 20% of the full dataset, held out at the beginning of the experiment. We note that other acquisition functions have been suggested in the literature, 39 and the objective function is problem-dependent. 18,19 Nonetheless, the goal of our experiment is to evaluate the quality of uncertainty estimate, thus we focus on a simple objective and acquisition functions. Further, as active learning requires constant retraining of the model, and SVGD is signicantly more computationally intensive than dropout, we will only consider dropout variational inference. Fig. 5 shows that semi-supervised learning signicantly outperforms full supervised learning in the low data limit. The mean learning curves and error bars are obtained by analysing 20 active learning runs starting from random dataset splits. Moreover, in the case of full supervised learning, active learning is unable to deliver a better learning curve than random sampling, whereas for semi-supervised learning there is a sizeable gap between the learning curves of random sampling and active learning. This is because the full supervised method generates molecular descriptors directly from data. Therefore, in the low data limit, it is unable to learn descriptors that describe the structure of chemical space and chemical similarity between compounds, thus cannot generate meaningful uncertainty estimates to drive active learning.
The importance of choosing diverse compounds in the initial screen has been discussed extensively in the literature, [40][41][42] and the performance of our active learning method also depends on the chemical diversity in the initial screen. Fig. 6 shows that active learning does not outperform random sampling when the initial training set biased and contain only a small number of scaffolds. We model scaffold bias by splitting the data using scaffold splitting implemented in DeepChem, 10 and consider the Malaria example where active learning most clearly outperforms random sampling in Fig. 5. The underperformance of active learning is perhaps unsurprisingif the initial screen only consists of one scaffold, the knowledge that the model has on the other scaffolds would be minimal, i.e. the model is equally ignorant about all the other scaffolds. As such, randomly sampling the other scaffolds becomes a reasonable strategy. Fig. 4 Dataset bias can be mitigated with Bayesian uncertainty estimation. We consider a toy problem of predicting computationally calculated log P using Stein Variational Gradient Descent and semi-supervised learning, with a biased dataset comprising all beta-lactams or a benzodiazepines from ChEMBL. (Left) The model performance when the test set is also drawn from beta-lactams and benzodiazepines. (Middle) The model performance when the test set is all steroids from ChEMBL. (Right) The distribution of predicted uncertainty for the model applied to steroids and the model applied to beta-lactams and benzodiazepines.
We propose a novel method to quantify uncertainty in molecular properties prediction. We show that our methodology signicantly outperforms the baseline on a range of benchmark problems, both in terms of model accuracy and in terms of uncertainty estimates. Our method also overcomes dataset bias by returning a large uncertainty estimate when the test set is drawn from a different region of chemical space compared to the training set. Moreover, our methodology can drive an active learning cycle, maximising model performance while minimising the size of the training set. The key to the success of our method is the combination of semi-supervised learning and Bayesian deep learning. Semi-supervised learning allows us to learn informative molecular representation in the low data limit. Bayesian deep learning allows us to estimate aleatoric and epistemic uncertainty in a statistically principled manner. We exemplied our methodology on regression as it is generally more challenging than classication, although it can be readily extended to classication problems.
Our observation that the choice of Bayesian inference methodology signicantly impacts the quality of the uncertainty estimate suggests an evident followup that probes the mathematical limit of Bayesian inferencei.e. benchmarking approximate inference techniques against importance sampling of the posterior using Markov Chain Monte Carlo till convergence, which is computationally expensive but mathematically exact. Moreover, we note that most approximate inference techniques in the literature have been benchmarked in terms of RMSE error or log-likelihood, 43 rather than explicitly considering the quality of the uncertainty estimate in a manner relevant for chemoinformatics such as the condence-error curve. An open question is the design of appropriate approximate inference techniques for graph convolutional neural networks that solves the trilemma between computational cost, model accuracy, and the quality of uncertainty estimate.
Another open question is whether the model has accurately disentangled aleatoric and epistemic uncertainty. Answering this question would require estimates of the ground truth aleatoric uncertainty, which is obtainable via repeating the experimental measurement and reporting the variance. Benchmark datasets which provide accurate experimental uncertainty estimates will be invaluable to advancing the Bayesian methodology. Semi-supervised learning significantly outperforms full supervised learning in active learning. The model starts with 25% of the full training set, selected randomly, and at each iteration 2.5% of the full training set is added to the training set. The molecules added are picked randomly (random sampling) or picked because they have the largest predicted epistemic uncertainty (active learning). The curves show the mean model error and standard error of the mean, averaged over 20 active learning runs starting from random dataset splits, as a function of iteration. The insets focus on the performance of semi-supervised learning with SVGD. Fig. 6 Choosing diverse compounds in the initial screen is crucial to successful active learning. We first randomly split the Malaria dataset into training (80%) and test (20%) sets. We then scaffold-split the training set to obtain a biased initial set (25% of the total training set), and at each iteration 2.5% of the training set is given to the model, selected randomly (random sampling) or based on highest epistemic uncertainty (active learning).
Finally, our active learning methodology performs well when the initial screen covers diverse compounds. To successfully perform active learning on a scaffold-biased initial set, the model needs information on the bioactivity of those unseen scaffolds. We speculate that strategies such as multitask learning, 44,45 which pools information from other cognate assays which have explored the unseen scaffolds, will be a fruitful avenue.

Conflicts of interest
There are no conicts to declare.