Accelerating high-throughput virtual screening through molecular pool-based active learning

Structure-based virtual screening is an important tool in early stage drug discovery that scores the interactions between a target protein and candidate ligands. As virtual libraries continue to grow (in excess of 108 molecules), so too do the resources necessary to conduct exhaustive virtual screening campaigns on these libraries. However, Bayesian optimization techniques, previously employed in other scientific discovery problems, can aid in their exploration: a surrogate structure–property relationship model trained on the predicted affinities of a subset of the library can be applied to the remaining library members, allowing the least promising compounds to be excluded from evaluation. In this study, we explore the application of these techniques to computational docking datasets and assess the impact of surrogate model architecture, acquisition function, and acquisition batch size on optimization performance. We observe significant reductions in computational costs; for example, using a directed-message passing neural network we can identify 94.8% or 89.3% of the top-50 000 ligands in a 100M member library after testing only 2.4% of candidate ligands using an upper confidence bound or greedy acquisition strategy, respectively. Such model-guided searches mitigate the increasing computational costs of screening increasingly large virtual libraries and can accelerate high-throughput virtual screening campaigns with applications beyond docking.


Introduction
Computer-aided drug design techniques are widely used in early stage discovery to identify small molecule ligands with affinity to a protein of interest. 1,2 Broadly speaking, these techniques fall into one of two domains: ligand-based or structure-based. Ligand-based techniques often rely on either a quantitative structure-activity relationship (QSAR) or similarity model to screen possible ligands. Both techniques require one or more previously labeled active/inactive compounds that are typically acquired through physical experimentation.
In contrast to ligand-based techniques, structure-based techniques, such as computational docking and molecular dynamics, try to simulate the physical process of a ligand binding to a protein active site and assign a quantitative score intended to correlate with the free energy of binding. 3 These techniques require a three-dimensional structure of the target protein but do not require target-specific bioactivity data. Scoring functions used in structure-based techniques are typically parameterized functions describing the intra-and intermolecular interactions at play in protein-ligand binding. 3 As a result, structure-based methods are in principle more able to generalize to unseen protein and ligand structures compared to ligand-based methods. This advantage has been leveraged to discover novel ligand scaffolds in a number of recent virtual screening campaigns. 4 A typical virtual screening workflow will exhaustively predict the performance of ligands in a virtual chemical library. However, over the past decade, these libraries have grown so large that the computational cost of screening cannot be ignored. For example, ZINC, a popular database of commercially available compounds for virtual screening, grew from 700k to 120M structures between 2005 and 2015 and, at the time of writing, now contains roughly 1 billion molecules. 5,6 ZINC is not unique in its gargantuan size; other enumerated virtual libraries exist that number well over one billion compounds. 7 Non-enumerated libraries contain an implicit definition of accessible molecules and can be much larger, containing anywhere from 10 10 to 10 20 possible compounds. 8-10 Despite some debate around whether "bigger is better" in virtual screening, 11 such large virtual libraries are now being used for screening in structure-based drug design workflows. [12][13][14][15] These studies required computational resources that are inaccessible to many academic researchers (e.g., 475 CPU-years in the case of Gorgulla et al. 12 ). Moreover, this high computational cost makes such a strategy impractical to apply to many distinct protein targets. As virtual libraries grow ever larger, new strategies must be developed to mitigate the computational cost of these exhaustive screening campaigns.
The goal in any virtual screening approach is to find a set of high-performing compoundsherein, computational "hits" with the most negative docking scores-within a significantly larger search space. To restate this formally, we are attempting to solve for the set of topk molecules {x i } k i=1 * from a virtual library X that maximize some black-box function of molecular identity f : x ∈ X → R, i.e., In this study, the black-box function f (x) is the negative docking score of a candidate ligand but other possibilities include the HOMO-LUMO gap of a candidate organic semiconductor, extinction coefficient of a candidate dye, turnover number of a candidate catalyst, etc. Brute force searching-screening every molecule in a library indiscriminately-is a straightforward and common strategy employed to solve this type of problem, but it necessarily spends a significant fraction of its time evaluating relatively low-performing compounds ( Figure 1A).
However, algorithmic frameworks exist that aim to solve Equation 1 with the fewest number of required evaluations. Bayesian optimization is one such framework that uses a surrogate model trained on previously acquired data to guide the selection of subsequent experiments.
We describe Bayesian optimization in more detail in the Methods section below, but we refer a reader to ref. 16 for an in-depth review on the subject. The application of Bayesian optimization to physical problems is well-precedented, e.g., with applications to materials design, 17-20 bioactivity screening, 21-23 and molecular simulations, [24][25][26] but few examples exist of its application in virtual screening for drug discovery. Furthermore, the design space is both large and discrete but not necessarily defined combinatorially, which is a relatively unexplored setting for Bayesian optimization.
Two recent examples of work that used active learning for computational drug discovery can be found in Deep Docking 27 and a study from Pyzer-Knapp. 28 Deep Docking uses a greedy, batched optimization approach and treats docking scores as a binary classification problem during the QSAR modelling step with a fingerprint-based feed forward neural network surrogate model. Pyzer-Knapp also utilized a batched optimization approach with a Gaussian process (GP) surrogate model using a parallel and distributed Thompson sampling acquisition strategy; 29 this approach is well-suited to small design spaces (e.g., thousands of compounds) but GP training does not scale well to millions of acquired data points due to its O(N 3 ) complexity. 30 In contrast to Deep Docking, our work treats docking score as a continuous variable, so our surrogate model follows a regression formulation. Lyu et al. observed correlations between the success rates of experimental validation and computed docking scores, 13 suggesting that preserving this information during model training will improve our ability to prioritize molecules that are more likely to be validated as active.
In this work, we leverage Bayesian optimization algorithms for docking simulations in a manner that preserves the fidelity of these structure-based methods while decreasing the computational cost needed to explore the structure-activity landscape of virtual libraries by over an order of magnitude ( Figure 1B). We demonstrate that surrogate machine learning models can prioritize the screening of molecules that are associated with better docking scores, which are presumed to be more likely to validate experimentally. 13

Small virtual libraries
As an initial evaluation of the experimental efficiency Bayesian optimization can provide, we turned to a dataset containing scores from 10,540 compounds (Enamine's smaller Discovery Diversity Collection, "Enamine 10k") docked against thymidylate kinase (PDB ID: 4UNN 31 ) using AutoDock Vina. 32 Data acquisition was simulated as the iterative selection of 1% of the library (ca. 100 molecules) repeated five times after initialization with a random 1% subset for a total acquisition of 6% of the library. We first looked at a random forest (RF) operating on molecular fingerprints as our surrogate model along with a greedy acquisition strategy.
This combination yields clear improvement over the random baseline, representative of a brute-force search, when looking at the percentage of top-100 (ca. top-1%) scores in the full dataset found by MolPAL (Figure 2, left panel). The greedy strategy finds, on average, 51.6% (±5.9 standard deviation over five runs) of the top-100 scores in the full dataset when exploring only 6% of the pool.
We define the enrichment factor (EF) as the ratio of the percentage of top-k scores found  Each trace represents the performance of the given acquisition metric with the surrogate model architecture corresponding the chart label. Each experiment began with random 1% acquisition (ca. 100 samples) and acquired 1% more each iteration for five iterations. Error bars reflect ± one standard deviation across five runs. RF, random forest; NN, neural network; MPN, message-passing neural network; UCB, upper confidence bound; TS, Thompson sampling; EI, expected improvement; PI, probability of improvement.
(EF = 12.0) with greedy acquisition, is statistically identical to the best NN result.
These analyses were repeated for Enamine's larger Discovery Diversity Collection of 50,240 molecules ("Enamine 50k") also against 4UNN with the same docking parameters ( Figure 3). We again took 1% of the library as our initialization with five subsequent exploration batches of 1% each. All of the trends remained largely the same across acquisition metrics and model architectures; we observed comparable quantitative performance for every surrogate model/acquisition metric combination as compared to the smaller library. For example, the RF model with a greedy acquisition strategy now finds 59.1% (±2.9) of the top-500 scores (ca. top-1%) using the Enamine 50k library vs. the 51.6% of the top-100 scores (ca. top-1%) when using the Enamine 10k library. There was little difference between the results of the NN and MPN models on the Enamine 50k results, which find 74.8% and 72.9% of the top-500 scores after exploring just 6% of the library, respectively. These values represent enrichment factors of 11.3 and 11.0, respectively, over the random baseline.

Enamine HTS Collection
Encouraged by the significant enrichment observed with the 10k and 50k libraries, we next tested Enamine's 2.1 million member HTS Collection ("Enamine HTS")-a size more typical of a high-throughput virtual screen. We again use 4UNN and Autodock Vina to define the objective function values. With this larger design space, acquisitions of 1% represent a significant number of molecules (ca. 21,000); therefore, we also sought to reduce exploration size. Given the strong performance of the greedy acquisition metric and its simplicity (i.e., lack of a requirement of predicted variances), we focus our analysis on the this metric alone.
We tested three different batch sizes for initialization and exploration, with five exploration batches, as in our above experiments. Using a batch size of 0.4% for a total of 2.4% of the pool, we observed strong improvement over random exploration for all three models  Figure 3: Bayesian optimization performance on Enamine 50k screening data as measured by the percentage of top-500 scores found as function of the number of ligands evaluated. Each trace represents the performance of the given acquisition metric with the surrogate model architecture corresponding the chart label. Each experiment began with random 1% acquisition (ca. 500 samples) and acquired 1% more each iteration for five iterations. Error bars reflect ± one standard deviation across five runs.  (Tables S3-S5).

Ultra-large library
One goal of the Bayesian optimization framework in our software, MolPAL, is to scale to even larger chemical spaces. A two million member library is indeed a large collection of molecules, but screens of this size are compatible with standard high-performance computing clusters.
Our final evaluation sought to demonstrate that MolPAL can make virtual screens of ≥ 10 8member libraries accessible to researchers using modest, shared clusters. We turned to a recent study by Lyu et al. that screened 99 million readily accessible molecules against AmpC β-lactamase (PDB ID: 12LS) using DOCK3.7. 13 The full dataset of 99.5 million molecules that were successfully docked against 12LS ("AmpC") is used as the ground truth. 34 We measure our algorithm's performance as a function of the top-50000 (top-0.05%) scores found for all three models using acquisition sizes of 0.4%, 0.2%, or 0.1%. Each trace represents the performance of the given model with a greedy acquisition strategy. Chart labels represent the fraction of the library acquired in the random initial batch and the five subsequent exploration batches. Error bars reflect ± one standard deviation across five runs.
We see a similar qualitative trend for the AmpC dataset as for all previous experiments: namely, that the use of Bayesian optimization enables us to identify many of the top-performing compounds after exploring a small fraction of the virtual library, even when using a greedy acquisition metric ( Figure 5). For the 0.4% batch size experiments, the MPN model finds 87.9% of the top-50000 (ca. top-0.05%) scores after exploring 2.4% of the total pool (EF = 36.6). Decreasing the batch size to 0.2% led to a recovery of 67.1% (EF = 55.9), and further decreasing the batch size to 0.1% resulted in 52.0% recovery (EF = 86.7.) The NN and RF surrogate models were not as effective as the MPN, but still led to significant enrichment above the baseline. Results for additional acquisition functions can be found in Tables S6-S8. The UCB acquisition metric led to notable increases in the performance of the MPN model for both a 0.4% and 0.2% batch size, finding 94.8% (EF = 39.5) and 75.5% (EF = 62.9) of the top-50000 scores, respectively; however, UCB was not consistently superior It is worth commenting on the differences in quantitative enrichment factors reported for the AmpC data and Enamine HTS data. There are at least two differences that preclude a direct comparison: (1) The top-k docking scores in the AmpC data were generated by DOCK and span a range of -118.83 to -73.99. This is compared to docking scores from the Enamine HTS collection dataset calculated with AutoDock Vina, where the top-k docking scores range from -12.7 to -11.0. The lower precision of AutoDock Vina scores makes the top-k score metric more forgiving (discussed later in the Limitations of evaluation metrics subsection.) (2) The chemical spaces canvassed by both libraries are different. This will necessarily impact model performance and, by extension, optimization performance.

Single-iteration active learning
A critical question with these experiments is the importance of the active learning strategy.
From the Enamine HTS data (Figure 4), we observe a sharp increase in the percentage of top-1000 scores found after the first exploration batch (e.g., from 0.4% to 70.3% for the MPN 0.4% acquisition), suggesting that the randomly selected initial batch is quite informative.
We look at "single-iteration" experiments, where the first batch is selected randomly and the second (and final) batch is selected according to our acquisition function ( Figure 6).
Selecting all 42,000 molecules at once after training on the initial 8,400 molecules results in finding 94.9% (± 0.8) of the top-1000 scores with an MPN model after exploring 2.4% of the library (EF = 36.5). This is slightly (but significantly) lower than the active learning case with an MPN model, which finds 97.7% of the top-1000 scores (EF = 37.6). Using an NN or an RF model, the differences in active learning versus the single-iteration case are more pronounced. We also test the setting where the initial batch consists of 42,000 molecules and the single active learning iteration only selects 8,400. The smaller acquisition size for the second batch results in considerably lower MPN performance, finding only 77.4% of the top-1000 scores (EF = 29.8). This is worse than any model we tested with an active learning-based strategy at the same number of function evaluations. The less flexible NN and RF models suffer even more pronounced drops in performance with a large initial batch and small exploration batch.

Dynamic convergence criterion
Our evaluations so far have demonstrated reliable performance of MolPAL using a fixed exploration strategy (i.e., number of iterations). However, we will typically not know a priori what an appropriate total exploration size is. We therefore define a convergence criterion for the Enamine HTS dataset that is satisfied when the fractional difference between the current average of the top-1000 scores and the rolling average of the top-1000 scores from the previous three epochs, corresponding to the top 0.05% of compounds, is less than 0.01. Figure 7 illustrates the use of this convergence criterion using a 0.1% batch size (ca. 2,100 molecules) with a greedy acquisition metric. We find that not only do the higher capacity models converge sooner (MPN > NN > RF), but they also converge to a higher percentage of top-1000 scores found (88.7%, 85.4%, and 75.8%, respectively).

Chemical space visualization
To visualize the set of molecules acquired during the Bayesian optimization routine, we projected the 2048-bit atom-pair fingerprints of the Enamine HTS library into two dimensions using UMAP 35 (Figures 8 and S16). The embedding of the library was trained on a random 10% subset of the full library and then applied to the full library. Comparing the location of the top-1000 molecules ( Figure 8A) to the density of molecules in the entire 2M library ( Figure 8B) reveals several clusters (black ellipses) of high-performing compounds located in sparse regions of chemical space. To observe how the three separate surrogate models cope with this mismatch, we plot the embedded fingerprints of the molecules acquired in the zeroth (random), first, third, and fifth iterations of a 0.1% batch size greedy search ( Figure 8C) While all surrogate models select candidates in these two areas, there are differences in

Effect of acquisition strategy on performance
An interesting result from these experiments was the consistently strong performance of the greedy acquisition metric. This is surprising given the fact that the greedy metric is, in principle, purely exploitative and vulnerable to limiting its search to a single area of the given library's chemical space. Prior work in batched Bayesian optimization has focused on Embedded fingerprints of the molecules acquired in the given iteration using a greedy acquisition metric, 0.1% batch size, and specified surrogate model architecture. Circled regions indicate clusters of high-scoring compounds in sparse regions of chemical space. Color scale corresponds to the negative docking score (higher is better). X-and y-axes are the first and second components of the 2D UMAP embedding and range from -7.5 to 17.5 developing strategies to prevent batches from being homogeneous, including use of an inner optimization loop to construct batches one candidate at a time. 36,37 Despite this potential limitation of the greedy acquisition metric, it still leads to adequate exploration of these libraries and superior performance to metrics that combine some notion of exploration along with exploitation (i.e., UCB, TS, EI, PI). One confounding factor in this analysis is that methods used for uncertainty quantification in regression models are often unreliable, 38 which may explain the poorer empirical results when acquisition depends on their predictions.

Effect of library size
The principal takeaway from our results on different library sizes is that Bayesian optimization is not simply a viable technique but an effective one in all of these cases. Though it is difficult to quantitatively compare algorithm performance on each dataset due to their differing chemical spaces, the impact of library size on the optimization is still worth commenting on. We observe the general trend in our data that, as library size increases, so too does top-k performance given a constant fractional value for k, even when decreasing relative exploration size. We anticipate that this is due in part to the greater amount of training data that the surrogate model is exposed to over the course of the optimization. Despite the relative batch size decreasing, the absolute number of molecules taken in each batch and thus the number data points to train on increases from roughly 500 to nearly 8,400 when moving from the Enamine 50k dataset (1% batch size) to the Enamine HTS dataset (0.4% batch size). We analyzed the mean-squared error (MSE) of MPN model predictions on the entire 10k, 50k, and HTS libraries after initialization with random 1%, 1%, and 0.4% batches, respectively; the MSE decreased with increasing library size: 0.3504, 0.2617, and 0.1520 (Spearman's ρ = 0.6699, 0.7826, 0.9094). This trend suggests that the overall "diversity" of the chemical library is not increasing at the same rate as the size, i.e., the scores of molecules in the larger chemical spaces are more easily predicted with a proportional increase in the training set size. As a result, the surrogate model is better able to aid in the acquisition of high-performing molecules.

Consistency across repeated trials
Bayesian optimization can be prone to large deviations across repeated trials, but our results showed consistent performance across all datasets and configurations (Tables S1-S8).
To investigate whether the consistency in performance is a result of consistency in the exact molecular species selected, we calculate the total number of unique SMILES strings acquired across all repeated experiments as a function of optimization iteration ( Figures S6 and S7).
Comparing these results to both the theoretical maximum (each trial acquiring a completely unique subset of molecules at each iteration) and the theoretical minimum (each trial acquiring an identical subset of molecules at each iteration after the initialization) approximates the degree to which each repeat explores the same or different regions of chemical space.
Traces closer to the maximum would indicate that each experiment is exploring a unique subset of highly performing molecules, and traces closer to the minimum signal the opposite: that each experiment is converging towards the same regions of chemical space. Our data are most consistent with the latter, suggesting that each trial is converging towards the same optimal subset of the library regardless of its initialization. We hypothesize that this is due to relatively smooth structure-property landscapes present in these datasets and lack of statistical uncertainty.

Limitations of evaluation metrics
In this study, three separate evaluation criteria were used to assess performance: the average top-k docking score identified, the fraction of top-k SMILES identified, and the fraction of top-k scores identified throughout the optimization campaign (calculation details are provided in the Methods section below). The average metric is sensitive to the scale and distribution of scores, making direct comparison between datasets challenging. The top-k SMILES metric asks whether a specific set of molecules is selected by the algorithm, which can be overly strict if multiple molecules have identical performance (i.e., the same score) but are not within the top-k due to arbitrary data sorting ( Figures S2-S4). The top-k score metric overcomes this limitation by assigning equal weight to each molecule with the same score regardless of its specific position in the sorted dataset. As a result, this makes the scores metric more forgiving for datasets with smaller ranges and lower precision (e.g., those calculated using AutoDock Vina with a precision of 0.1) than those with larger ranges and higher precision (e.g., those calculated using DOCK with a precision of 0.01). In contrast to the average top-k score, however, this metric does not reward "near-misses", for example, identifying the k + 1 ranked molecule with a nearly identical score to the k-th molecule.

Optimal batch size for active learning
The number of molecules selected at each iteration represents an additional hyperparameter for Bayesian optimization. In one limit, Bayesian optimization can be conducted in a purely sequential fashion, acquiring the performance of a single molecule each iteration.
Fully sequential learning would offer the most up-to-date surrogate model for the acquisition of each new point but it would also be extremely costly to continually retrain the model and perform inference on the entire candidate pool. In the other limit, molecules would be selected in a single iteration, which can lead to suboptimal performance depending on the acquisition size ( Figure 6). Finding a principled balance between these two without resorting to empirical hyperparameter optimization is an ongoing challenge. In each of our experiments, the relative batch size was held constant at one sixth of the total exploration size. Future performance engineering work will seek to examine the effects of dynamic batch sizes in batched optimization. Note that overall batch diversity is another consideration in batched Bayesian optimization. While selected batches in this study did not appear to suffer from homogeneity, previous approaches to improve batch diversity could be explored as well. 29,37,39 Cost of surrogate model (re)training The question of optimal batch size cannot be decoupled from the computational cost of model retraining and inference. Throughout these studies, we have focused only on the number of objective function calculations necessary to achieve a given level of performance. While objective function calculation is significantly more expensive than the cost of model training and inference, inference costs scale linearly with the size of the dataset and contribute to the overall cost of our algorithm.
The MPN model was shown to be superior in the largest datasets (Enamine HTS and AmpC), but its costs are markedly higher than those of the fingerprint-based NN model. The tradeoff between sample efficiency (number of objective function calculations) and surrogate model costs (training and inference) should be balanced when selecting a model architecture.
In our evaluation, the costs of the MPN and NN cannot be directly compared due to differences in their implementation and extent of both parallelization and precalculation. For more details, see the software design subsection in the supplementary text. An additional choice when seeking to limit surrogate model costs is whether to train the model online with newly acquired data or fully retrain the model at each iteration with all acquired data. We examined this possibility, but online learning lead to consistently lower performance in our experiments (Tables S1-S8).

Conclusion
In this work, we have demonstrated the application of Bayesian optimization to the prioritization of compounds for structure-based virtual screening using chemical libraries ranging in size from 10k to 100M ligands. A thorough evaluation of different acquisition metrics and surrogate model architectures illustrates the surprisingly strong performance of a greedy acquisition strategy and the superiority of a message passing neural network over fingerprintbased feed forward neural network or random forest models. In the largest library tested, the 100M member library screened against 12LS by Lyu et al., we identify 87.9% of the top-50000 scoring ligands with a >40-fold reduction in the number of docking calculations using a greedy acquisition metric and 94.8% using the UCB acquisition metric.
We believe that this model-guided approach to compound prioritization should become standard practice as a drop-in replacement for exhaustive high-throughput virtual screening when the exact set of top-k compounds is not needed. Moreover, this approach is also relevant to experimental high-throughput screening, an expensive and important tool for challenging drug discovery problems.

Batched Bayesian optimization
Bayesian optimization is an active learning strategy that iteratively selects experiments to perform according to a surrogate model's predictions, often using machine learning (ML) models. In the context of this work, the Bayesian optimization was performed on a discrete set of candidate molecules, herein referred to as a "pool" or virtual library, and points were acquired in batches rather than one point at a time.
α(x,f , f * ) is then selected, or "acquired". The set of objective function values corresponding to these points is calculated {(x, f (x)) : x ∈ S} and used to update the dataset D. This process is repeated iteratively until a stopping criterion is met (e.g., a fixed number of iterations or a lack of sufficient improvement).

Algorithm 1: Batched Bayesian Optimization
Input: objective function f (x), acquisition function α, surrogate modelf (x), candidate set X Select random batch S ⊂ X

Acquisition metrics
The following acquisition functions were tested in this study: Random(x) ∼ U(0, 1) For the experiments reported in the paper, we used β = 2 and ξ = 0.01

Surrogate models
In the context of our study, the surrogate modelling step involved training a machine learning L(y,ŷ,σ 2 ) = log 2π 2 + logσ 2 2 + (y −ŷ) 2 2σ 2 (2) All of the surrogate models were used exactly as described above without additional hyperparameter optimization. The models were fully retrained from scratch with all acquired data at the beginning of each iteration.

Datasets
The datasets generated for these studies were produced from docking the compounds contained in both Discovery Diversity sets and the HTS collection from Enamine against thymidylate kinase (PDB ID: 4UNN). The docking was performed using AutoDock Vina with the following command line arguments: --receptor=4UNN.pdbqt --ligand=<ligand_pdbqt> --center_x=9 --center_y=20 --center_z=-5 --size_x=20 --size_y=20 --size_z=17 All other default arguments were left as-is. The ligands were prepared from SDFs available from Enamine, 50,51 parsed into SMILES strings using RDKit, 52 and processed into PDBQT files using OpenBabel 53 with the --gen-3d flag. The receptor was prepared with PDBFixer 54 using the PDB ID 4UNN, selecting only chain A, deleting all heterogens, adding all suggested missing residues and heavy atoms, and adding hydrogens for pH 7.0. The AmpC screening dataset is the publicly available dataset published by Lyu et al and was used as-is. 34 The score distribution of each dataset may be found in Figures S2-S5.

Evaluation metrics
MolPAL performance was judged through three evaluation metrics: (1) average top-k docking score identified ("Average"), (2) the fraction of top-k SMILES identified ("SMILES"), and (3) the fraction of top-k scores identified ("Scores"). For (1), the average of the top-k scores of molecules explored by MolPAL was taken and divided by the true top-k molecules' scores based on the full dataset.
(2) was calculated by taking the intersection of the set of SMILES strings in the true top-k molecules and the found top-k molecules and dividing its size by k.
(3) was calculated as the size of the intersection of the list of true top-k scores and the list of observed top-k scores divided by k.

Hyperparameter optimization
The Explorer is checked and, if satisfied, the Explorer terminates and outputs the top-k evaluated molecules. A schematic of both the design and workflow of MolPAL may be seen in Figure S1. The experiments performed in this study were all performed retrospectively using the LookupObjective subclass of the Objective with a fully generated dataset as a lookup

Supporting Information Available
Additional methods and results can be found in the supporting information. All code and data needed to reproduce this study and the figures herein can be found at https://github. Figure S1: Overview of the MolPAL software structure and workflow.

Alternative surrogate models
Feed forward neural network models Two alternative NN models were defined for confidence estimation purposes: an ensemble model and a mean-variance estimation (MVE) model. The ensemble model was the same as the base model, with the only difference being that an ensemble of five models was trained. Each of the trained models was used for inference, and these five separate predictions were averaged and a variance taken to produce both a mean predicted value and an uncertainty estimate, respectively. The mean-variance estimation model used an output layer size of two, the learning rate was increased to 0.05 from 0.01, and the same loss function from the MPN-MVE was used (Equation 2). Neither of these alternate models was used for experiments due to their lower performance as compared to the dropout model.
Directed-message passing neural network models An MPN dropout model was also defined for confidence estimation purposes. This model was built similar to the NN dropout model, with the key difference being that the dropout layer was prepended to the hidden layer. Again, a dropout probability of 0.2 was used and dropout was performed during model inference. Mean predicted values were calculated by averaging 10 forward passes through the model and the variance of these predictions was used to as the predicted uncertainty.
This alternate model was not used in experiments due to its significantly higher inference costs.

Retraining strategy
In addition to fully retraining the surrogate model from scratch using all acquired data, we tested an online training strategy. For online training, the trained surrogate model from the previous epoch was trained only on newly acquired data. Note that online training applies only to the NN and MPN models, as the RF is reinitialized each time it is fit.

Additional Results
Dataset score distributions    Figure S9: Bayesian optimization performance on Enamine 50k screening data as measured by the percentage of top-500 scores found as a function of the number of ligands evaluated. Each trace represents the performance of the given acquisition metric with the surrogate model architecture corresponding to the chart label. Full opacity: online model training. Faded: full model retraining. Each experiment began with a random 1% acquisition (ca. 500 samples) and acquired 1% more each iteration for five iterations. Error bars reflect ± one standard deviation across five runs.   Figure S13: Bayesian optimization performance on AmpC screening data as measured by the percentage of top-50000 scores found as a function of the number of ligands evaluated. Each trace represents the performance of the given acquisition metric with the surrogate model architecture corresponding to the chart label. Full opacity: online model training. Faded: full model retraining. Each experiment began with a random 0.4% acquisition (ca. 40,000 samples) and acquired 0.4% more each iteration for five iterations. Error bars reflect ± one standard deviation across three runs.  Figure S15: Bayesian optimization performance on AmpC screening data as measured by the percentage of top-50000 scores found as a function of the number of ligands evaluated. Each trace represents the performance of the given acquisition metric with the surrogate model architecture corresponding to the chart label. Full opacity: online model training. Faded: full model retraining. Each experiment began with a random 0.1% acquisition (ca. 10,000 samples) and acquired 0.1% more each iteration for five iterations. Error bars reflect ± one standard deviation across three runs.          Figure S16: Visualization of the chemical space searched in the Enamine HTS library at the given iteration using a greedy acquisition metric, 0.1% batch size, and specified surrogate model architecture. Points represent the 2-D UMAP embedding of the given molecule's 2048-bit Atom-pair fingerprint. The embedding was trained on a random 10% subset of the full library. Circled regions indicate clusters of high-scoring compounds in sparse regions of chemical space. X-and y-axes are the first and second components of the 2-D UMAP embedding and range from -7.5 to 17.5. Color scale corresponds to the negative docking score (higher is better).