 Open Access Article
 Open Access Article
      
        
          
            Michael 
            Dodds
          
        
       , 
      
        
          
            Jeff 
            Guo
, 
      
        
          
            Jeff 
            Guo
          
        
       , 
      
        
          
            Thomas 
            Löhr
, 
      
        
          
            Thomas 
            Löhr
          
        
       , 
      
        
          
            Alessandro 
            Tibo
, 
      
        
          
            Alessandro 
            Tibo
          
        
       , 
      
        
          
            Ola 
            Engkvist
, 
      
        
          
            Ola 
            Engkvist
          
        
       and 
      
        
          
            Jon Paul 
            Janet
 and 
      
        
          
            Jon Paul 
            Janet
          
        
       *
*
      
Molecular AI, Discovery Sciences, R&D, AstraZeneca, 431 50, Gothenburg, Sweden. E-mail: jonpaul.janet@astrazeneca.com
    
First published on 8th February 2024
Reinforcement learning (RL) is a powerful and flexible paradigm for searching for solutions in high-dimensional action spaces. However, bridging the gap between playing computer games with thousands of simulated episodes and solving real scientific problems with complex and involved environments (up to actual laboratory experiments) requires improvements in terms of sample efficiency to make the most of expensive information. The discovery of new drugs is a major commercial application of RL, motivated by the very large nature of the chemical space and the need to perform multiparameter optimization (MPO) across different properties. In silico methods, such as virtual library screening (VS) and de novo molecular generation with RL, show great promise in accelerating this search. However, incorporation of increasingly complex computational models in these workflows requires increasing sample efficiency. Here, we introduce an active learning system linked with an RL model (RL–AL) for molecular design, which aims to improve the sample-efficiency of the optimization process. We identity and characterize unique challenges combining RL and AL, investigate the interplay between the systems, and develop a novel AL approach to solve the MPO problem. Our approach greatly expedites the search for novel solutions relative to baseline-RL for simple ligand- and structure-based oracle functions, with a 5–66-fold increase in hits generated for a fixed oracle budget and a 4–64-fold reduction in computational time to find a specific number of hits. Furthermore, compounds discovered through RL–AL display substantial enrichment of a multi-parameter scoring objective, indicating superior efficacy in curating high-scoring compounds, without a reduction in output diversity. This significant acceleration improves the feasibility of oracle functions that have largely been overlooked in RL due to high computational costs, for example free energy perturbation methods, and in principle is applicable to any RL domain.
A variety of computational models is available to assess hits in VS, from simple data-driven methods (quantitative-structure–activity relationships, QSAR) to physics-based computation via pharmacophore matching methods or molecular docking, whereby a putative binding pose of the molecule is generated and scored7 for compatibility with a target protein. Such methods have been successfully applied to VS efforts,6,8,9 although the incorporation of docking already imposes a substantial computational burden when screening large libraries. For the largest virtual screening efforts reported, the computational effort expended can be extreme, for example, Gorgulla et al.5 report expending 100 s of CPU-years to dock 1.4 billion molecules, while Acharya et al.6 describe a pipeline for conducting billion-molecule scale virtual screening with docking on the Summit supercomputer. Recent developments of high-accuracy, high-computational-cost binding affinity prediction methods with molecular dynamics such as free-energy perturbation10,11 (FEP) or non-equilibrium switching,12 have become the new gold-standard for affinity prediction,13 but are prohibitively computationally expensive and cannot be directly applied to large VS libraries.
Although an old idea,14,15 active learning (AL) methods have recently gained increasing attention for accelerating VS,16 either to enable screening very large libraries with docking,17 or screening smaller libraries with binding energy prediction18–20 or quantum chemical methods.21–23 VS–AL methods generally sample a small subset of compounds to evaluate with a desired scoring function, or oracle, and construct a surrogate model to predict the oracle score of as-yet unevaluated candidates. This model is used to select new candidates to screen, based on an acquisition function which might depend on the surrogate predictions, and their associated uncertainties. Evaluated molecules are used to retrain the surrogate model and the approach is iteratively repeated. Such approaches regularly claim twenty-fold or more accelerations over brute-force VS in terms of oracle calls needed to retrieve top-scoring hits.17–23
As an alternative to traditional VS, deep generative methods have transitioned from research protypes to practical and powerful tools for computational drug design.24–26 Such models are responsible for the design of multiple experimentally validated hits, including potent small molecule inhibitors for a variety of targets27–32 and PROTACs.33 Rather than screening a fixed, finite library, generative chemical models34 propose novel molecules based on probabilistic principles, allowing them to address very large chemical design spaces.35 These de novo design models consist of a generative component responsible for sampling molecules and a mechanism for steering the design to molecules with target properties. Existing methodologies for the generative component could include text- or graph-based variational autoencoders,36–38 generative adversarial networks,39,40 sequence/recurrent models,41–44 transformers45–48 or diffusion models,49,50 while the steering mechanism typically involves either conditional generation (i.e. on a target or profile) or an optimization method such as reinforcement learning (RL).
Here, we consider REINVENT,41,51 a SMILES52-based (Simplified Molecular Input Line Entry System) recurrent-neural network (RNN) molecule generator that utilizes policy-gradient RL to iteratively improve suggested molecules according to a flexible scoring function that can be composed of a variety of scoring components including docking53,54 and ROCS.55 The MPO score is calculated from the average of the normalized scores (between 0–1) of all scoring components. The relative simplicity and flexibility of REINVENT makes it a popular testbed for experiments with molecular RL.56–58
One major concern with RL methods in the real-world is sample efficiency,59i.e. the number of oracle calls needed to reliably learn the desired output. REINVENT has been identified as one of the most sample-efficient generative chemical models; both in benchmarks which do not consider compound chemistry relative to the pre-training data60 as well as benchmarks which do,60 however the model still requires thousands of oracle evaluations to learn to produce favorable molecules. While this may compare favorably with the cost of brute-force VS on large libraries, the incorporation of higher-cost simulations remains prohibitive. We recently introduced a curriculum-learning approach whereby a simpler, physically-motivated oracle function is learned first, for example learning a ROCS query before starting docking, which can substantially reduce the number of expensive oracle evaluations.61 However, this approach depends on identifying physically-motivated intermediate objectives that are correlated with the desired oracle.
Here, we instead investigate accelerating RL for molecular design in an oracle agnostic manner using AL (RL–AL). The RL–AL setting poses unique challenges for surrogate-model based AL relating to the inherent feedback loops in the generative RL setting. We begin with a motivating example that illustrates some of these unique difficulties and general implications of RL–AL compared to traditional AL. We then systematically examine the components of the RL–AL system and design a strategy that can accelerate generative molecular design by a ∼5–66-fold increase in hits and leads to a ∼4–64-fold reduction in CPU time. Next, we introduce a new acquisition strategy that is compatible with the MPO nature of the RL process. Finally, we demonstrate the transferability of our approach across oracle functions and quantify the computational- and wall-time saving of our method.
Here, we study and judge generative models exclusively by their ability to satisfy their reward functions efficiently, but consideration of the types of chemistry generated and the eventual synthetic accessibility of the proposed molecules is also crucial.62 We attempt to capture information about the types of chemistry generated through use of multiparameter scoring functions, i.e. we do not only consider the oracle score in the reward, and also provide some samples of generated molecules in the ESI.†
![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 000 molecules from the REINVENT prior, which is trained to mimic the ChEMBL63 database. We call this set our library and evaluate the this library with two distinct computational models/oracles (methods):
000 molecules from the REINVENT prior, which is trained to mimic the ChEMBL63 database. We call this set our library and evaluate the this library with two distinct computational models/oracles (methods):
        (1) A shape and color pharmacophore query for Cyclooxygenase-2 (COX2) (PDB ID: 1CX2)64 based on the native SC-558 ligand implemented using ROCs.65
(2) A docking protocol for the Retinoic Acid Receptor Alpha (RXRα) (PDB ID: 7B88),66 implemented using AutoDock-Vina.67
We judge hits for each oracle based on having a greater predicted affinity than obtained for their respective native ligands (0.6 for ROCs and −11.4 kcal mol−1 for the docking oracle respectively, methods). With exhaustive screening of the library, we identify 30 unique Murcko scaffolds68 and 41 unique hit SMILES for ROCS (364 and 369 for docking with ADV), for an oracle-call-efficiency of 0.03%/0.04% (0.36%/0.37% for docking).
Initially we test active learning for virtual screening (VS–AL). In agreement with previous studies17 we obtain a substantial increase in oracle efficiency compared to brute-force screening, recovering 42 ± 4.0% and 35 ± 4% of hit SMILES with only 5000 oracle calls (Fig. 1), for a oracle-call-efficiency of 0.25% and 2.54% for ROCS and docking respectively, a 7–11 fold improvement. The VS–AL recovers 100% of hits after 8927 ± 1750 oracle calls for ROCS and 96% of hits (∼356) after 30![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 000 oracle calls for docking.
000 oracle calls for docking.
Next, we compared VS–AL performance to an RL approach with REINVENT.41,51 We used a standard RL configuration with a batch size of 128 (methods) where we sought to improve the oracle score of the generated compounds along with a few commonly used metrics for molecule quality (methods) to provide a realistic MPO setting. We performed each experiment in triplicate.
Initially, the hit rate of the RL agent is comparable to VS. However, after ∼15![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 000 oracle calls, the RL agent learns to reliably generate hits, after which time the productivity of the generative model grows rapidly, and by 30
000 oracle calls, the RL agent learns to reliably generate hits, after which time the productivity of the generative model grows rapidly, and by 30![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 000 calls, the RL method has produced more unique hits (79 ± 18) and scaffolds (35 ± 7) for ROCs than contained in the virtual library (Fig. 1). The final overall oracle-call-efficiency is 0.12% for ROCs. For docking, 37 ± 8 hits are generated (all unique scaffolds), a lower number than contained in the virtual library, for a docking oracle-call-efficiency of 0.12% (ESI Fig. S1†). The compounds generated via RL exhibit higher average MPO scores compared to VS (ESI Fig. S2†).
000 calls, the RL method has produced more unique hits (79 ± 18) and scaffolds (35 ± 7) for ROCs than contained in the virtual library (Fig. 1). The final overall oracle-call-efficiency is 0.12% for ROCs. For docking, 37 ± 8 hits are generated (all unique scaffolds), a lower number than contained in the virtual library, for a docking oracle-call-efficiency of 0.12% (ESI Fig. S1†). The compounds generated via RL exhibit higher average MPO scores compared to VS (ESI Fig. S2†).
We added a naïve AL component into the RL loop based on the VS–AL model (Reinforcement Learning with Active Learning in Methods). In contrast to the immediate, rapid increase in hit rate observed in VS–AL, the RL–AL approach barely improves the hit rate obtained in the early phase and leads to moderate increases in total productivity by 30![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 000 epoch calls, resulting in 134 ± 23 unique hits and 49 ± 5 unique scaffolds for ROCS (48 ± 7 and 47 ± 5 scaffolds, for docking) yielding an oracle-call-efficiency of 0.16% for both ROCs and docking respectively. While an increase over the standard RL case, this is far from realizing the benefits observed in VS–AL.
000 epoch calls, resulting in 134 ± 23 unique hits and 49 ± 5 unique scaffolds for ROCS (48 ± 7 and 47 ± 5 scaffolds, for docking) yielding an oracle-call-efficiency of 0.16% for both ROCs and docking respectively. While an increase over the standard RL case, this is far from realizing the benefits observed in VS–AL.
We identify some key factors that make RL–AL uniquely challenging:
(1) Non-stationary distribution: as RL proceeds, the underlying generative model is updated, and the scores and distribution of molecules generated in later epochs are distinct from preceding steps (ESI Fig. S3†). We observed that a surrogate model trained on fixed data will consistently lose predictive accuracy on later epochs (ESI Fig. S4†), limiting the utility of persisting data collected during the run.
(2) Feedback loops and robustness: because the scores produced by the surrogate model directly influence the molecules generated in the next epoch, incorrect scores in the early epochs interfere with the learning of the RL agent. To illustrate this, we ran RL but added Gaussian noise to the oracle and observed a noise-level-dependent decrease in oracle-call-efficiency (ESI Fig. S5†). Furthermore, REINVENT is already an iterative process of RL updates, and previous studies have established the sensitivity of the RL process to this update frequency.58 Introducing AL creates a second internal loop, with additional hyperparameters relating to the relative frequency of the RL and AL updates.
In the following sections, we derive experiments to investigate various strategies for optimally leveraging the benefits of AL to accelerate RL and obtain drastically improved oracle call efficiency.
|  | ||
| Fig. 2 Schematic of an RL–AL system for drug design. REINVENT51,57,69 generates drug-like compounds encoded as SMILES strings.52 The generated SMILES are input to the surrogate model which predicts the oracle scores for each compound. Based on the specified acquisition function, a subset of compounds is sent for ground-truth label acquisition using the oracle function, while the non-acquired compounds use the surrogate-predicted labels. The oracle-labelled compounds are pooled and used to retrain the surrogate model. The predict, split, label, and train cycle is repeated for Nloops (inner loop). The combined set is then passed to back REINVENT for computing the appropriate RL update (the outer loop). | ||
(1) REINVENT generates Nbatch compounds based on the current agent state.
(2) The current surrogate model predicts oracle labels for each generated compound.
(3) An acquisition function is used to select a subset of the generated batch XA ⊂ X, such that 
(4) The surrogate model is updated based on the oracle scores (labels) for the Ntrainingpool most recent compounds in a sliding-window scheme.
(5) Steps 3–4 are repeated Nloops times to acquire exactly Nacquired compounds per epoch.
(6) The RL agent is updated based on the oracle assigned labels where they exist and the surrogate predicted values otherwise, potentially with a different weighting.
Full details of this process are provided in methods. Here, we investigate the impact of varying factors related to how the RL policy update is performed and how AL is conducted to balance between RL and AL updates.
As in the naïve RL–AL experiment, we use Nbatch of 256 and Nacquired of 128 and a single inner loop with UCB acquisition (methods) as a baseline case. We use the ROCS oracle for experimentation as it is computationally cheaper than docking.65 A summary of all configuration parameters is provided in the ESI (Table S1†). We calculate the oracle-call-efficiency of each configuration from the perspective of number of unique hit scaffolds, which are compounds that have unique Murcko scaffold with a predicted affinity/overlay score greater than the native ligand, generated and acquired per oracle call over 30![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 000 oracle calls and the diversity of the resulting generated hit scaffolds (Fig. 3).
000 oracle calls and the diversity of the resulting generated hit scaffolds (Fig. 3).
Compared to RL–AL with equal weights, we observe a substantial increase in the number of hit scaffolds generated when down-weighting surrogate model predictions. Despite reasonable predictive performance of the surrogate model (look-ahead mean average error, MAE, of 0.046 for the AL ROCS case), the RL-process with zero weights for surrogate prediction results in 147 ± 14 hits after 30![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 000 epochs vs. 48 ± 6 for the equal-weights RL–AL model (Fig. 3A). The RL agent is not updated based on the surrogate predictions at all in this case; instead, the AL-subprocess is effectively curating high-scoring compounds for the RL update.
000 epochs vs. 48 ± 6 for the equal-weights RL–AL model (Fig. 3A). The RL agent is not updated based on the surrogate predictions at all in this case; instead, the AL-subprocess is effectively curating high-scoring compounds for the RL update.
All RL–AL interventions show marginally lower hit scaffold diversity (Fig. 3B), with an average pairwise similarity of 0.212 ± 0.017 for the zero-weight update scheme vs. 0.178 ± 0.003 for the equal weights RL run. Since the impact of updating with zero-weights for surrogate compounds is so drastic, we perform all future experiments with this updating scheme (defined as RL–AL baseline, indicated by asterisks in Fig. 3).
 the closer the result will be to baseline RL (ratio = 1) and the lower the potential for reducing oracle calls.
 the closer the result will be to baseline RL (ratio = 1) and the lower the potential for reducing oracle calls.
        (1) First, with a fixed Nbatch = 256 compounds and with a fixed total Nacquired = 128, we vary Nloops from 1 to 8, iteratively extending the training set and corresponding to a fixed AL/RL ratio  (Fig. 3C and D).
 (Fig. 3C and D).
(2) Second, we increase the size of Nbatch up to 2048 compounds with Nacquired = 128 and Nloops = 1, representing AL/RL ratios from 0.0625 to 0.5 (Fig. 3E and F).
(3) Next, we vary Nacquired from 32 to 128 at fixed Nbatch = 256 and Nloops = 1, corresponding to AL/RL ratios from 0.125 to 0.5 (Fig. 3G and H).
(4) Finally, we test the cross-dependence of these factors, by varying Nacquired in [32, 64, 128], Nbatch in [128, 256, 512, 1024, 2048], and Nloops in [1, 2, 4, 8] (Fig. 4) full results available in (ESI Table S3†).
Comparing the impact of these parameters on the number of hit scaffolds generated (Fig. 3C, E and G), we observe that simply increasing Nbatch has a dramatic impact on oracle-call-efficiency, resulting in up to 537 ± 67 hits for the largest batch size of 2048, a 3.83-fold increase over the RL–AL baseline case and 15.34-fold increase over pure RL (Fig. 3E). Importantly, runs with larger Nbatch also become productive much earlier, requiring only approximately 5000 oracle calls to become productive. The diversity of generated hits at the largest batch size (2048) is slightly reduced to 0.252 ± 0.032 compared with 0.199 ± 0.016 and 0.178 ± 0.028 for RL–AL- and RL-baseline respectively (Fig. 3F), but the significantly larger number of generated hits likely offsets this in practice. Reducing Nacquired has the same effect as increasing Nbatch, resulting in faster lift-off and up to 582 ± 253 hits identified for the smallest Nacquired of 32 (Fig. 3G). The resulting hit diversity for this case is 0.255 ± 0.022 (Fig. 3H). The similar behavior of these extreme cases can be rationalized by their similar, low AL/RL ratios – 0.0625 and 0.125 respectively. Variation of Nloops at a fixed AL/RL ratio has a much more modest impact, with a larger number of loops leading to modest improvements in terms of hits found (167 ± 7 for 8 loops vs. 140 ± 13 for 1 loop) (Fig. 3C) and unchanged hit diversity (0.19 ± 0.003 vs. 0.208 ± 0.019) (Fig. 3D).
Following systematic experimentation of the RL–AL parameters Nacquired, Nbatch, and Nloops, it was determined that a synergistic benefit is not achieved by simultaneously decreasing Nacquired to its lowest extreme (32) and increasing Nbatch to its highest extreme (2048). While individually increasing both parameters enhances hit efficiency, Nacquired of 64 averages the greatest yield when considering all conditions, outperforming values of 32 or 128 after 30![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 000 oracle calls (Fig. 4). The relative diversity of each condition is largely unaffected by the selected settings. The highest average Tanimoto similarity was recorded as 0.28 ± 0.026, with the lowest at 0.23 ± 0.018. Detailed results for each condition can be found in ESI Table S3.†
000 oracle calls (Fig. 4). The relative diversity of each condition is largely unaffected by the selected settings. The highest average Tanimoto similarity was recorded as 0.28 ± 0.026, with the lowest at 0.23 ± 0.018. Detailed results for each condition can be found in ESI Table S3.†
Drawing from these experimental findings, we suggest an optimized RL–AL configuration: zero weight updates for surrogate-predicted compounds, Nloops set to 2, Nacquired set to 64, and Nbatch adjusted to 512, for a final AL/RL ratio of 0.125.
![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 000 oracle calls, for ROCs and docking respectively (Fig. 5A and C). The solution diversity is slightly reduced, 0.251 ± 0.018 and 0.286 ± 0.005 for RL–AL compared to 0.178 ± 0.003 and 0.162 ± 0.002 for the baseline, which is more than compensated by the higher number of hit scaffolds found.
000 oracle calls, for ROCs and docking respectively (Fig. 5A and C). The solution diversity is slightly reduced, 0.251 ± 0.018 and 0.286 ± 0.005 for RL–AL compared to 0.178 ± 0.003 and 0.162 ± 0.002 for the baseline, which is more than compensated by the higher number of hit scaffolds found.
        In addition to greater numbers of hit scaffolds found over 30![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 000 oracle calls, the number of oracle calls before the agent becomes productive (i.e., wasted calls) is greatly reduced, with the docking RL–AL system producing 10 hits after only 2990 ± 251 oracle calls compared to 13
000 oracle calls, the number of oracle calls before the agent becomes productive (i.e., wasted calls) is greatly reduced, with the docking RL–AL system producing 10 hits after only 2990 ± 251 oracle calls compared to 13![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 532 ± 4459 oracle calls for the baseline RL case.
532 ± 4459 oracle calls for the baseline RL case.
However, inspecting the evolution of the MPO score for generated compounds reveals differences in the oracle behavior that are not present in the RL-only baseline (Fig. 5B and D). Without RL–AL, the average MPO score for runs with both oracles increases steadily as a function of oracle calls to a final value between 0.5 and 0.75. In the case of the ROCs oracle, the MPO score increases rapidly and levels off near 0.9 after ∼10![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 000 oracle calls. However, in the case of RL–AL with ADV, the MPO score rapidly increases to ∼0.7 and levels off, comparable to the baseline RL state. Investigating this difference, we determined that, while the docking score was rapidly optimized there was a commensurate loss of other components, such as QED, that prevented effective improvement of the MPO score. We hypothesize that this is due to conflicts in the scoring function, for example the addition of hydrogen bond donors might improve docking sores by providing more interactions with the target but push the compound out of the suitable range defined in the QED component.
000 oracle calls. However, in the case of RL–AL with ADV, the MPO score rapidly increases to ∼0.7 and levels off, comparable to the baseline RL state. Investigating this difference, we determined that, while the docking score was rapidly optimized there was a commensurate loss of other components, such as QED, that prevented effective improvement of the MPO score. We hypothesize that this is due to conflicts in the scoring function, for example the addition of hydrogen bond donors might improve docking sores by providing more interactions with the target but push the compound out of the suitable range defined in the QED component.
While this is expected, the goal of the RL process is to improve the MPO score, rather than any specific component. In the case of the zero-weights applied to surrogate predictions, the AL component is effectively selecting which compounds to use in RL likelihood updates. Recognizing this interplay, we propose a final alternative strategy that interprets the MPO score as a probabilistic function of random-valued scoring components and uses this score for acquisition.
For common acquisition functions (UCB, expected improvement15etc.) we require access to both the expectation and variance of the quantity of interest. This motivates considering the MPO scoring function as a non-deterministic aggregation of scoring components that themselves are random variables. By using the surrogate score as a proxy for the oracle score, we generate a distribution of MPO scores by Monte-Carlo sampling of all score components (methods).
We explore the effect of MPO acquisition on generated compounds (Fig. 5). In the RL–AL MPO case the fold enrichment compared to RL baseline was 19.94 and 5.49 for ROCS and docking and 20.00 and 66.46 for the RL–AL optimized conditions. The average MPO score for all compounds generated and scored across 30![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 000 oracle calls were 0.83 ± 0.1 and 0.7 ± 0.14 for the RL–AL optimized and 0.7 ± 0.14 and 0.53 ± 0.27 for the RL baseline for ROCS and ADV, respectively (Fig. 5B and D). In the RL–AL MPO case for ROCS and ADV respectively, we yield a 1.87 and 1.56-fold improvement in the MPO score, relative to baseline, and 1.06 and 1.20-fold improvement in the MPO score relative to RL–AL optimized conditions. To ascertain if the increase in average MPO score results in an increase in the MPO score for hits, we plot the cumulative density of MPO scores for hits (Fig. 5E). We observe that RL–AL optimized produces a higher density of hits at a lower MPO score for ADV, a pattern not observed for ROCS. For both oracles in the RL–AL MPO case all generated hits are found between the MPO score range [0.85, 1], leading to enrichment of high scoring hits relative to the RL-baseline and optimized case where hits are distributed between [0, 1].
000 oracle calls were 0.83 ± 0.1 and 0.7 ± 0.14 for the RL–AL optimized and 0.7 ± 0.14 and 0.53 ± 0.27 for the RL baseline for ROCS and ADV, respectively (Fig. 5B and D). In the RL–AL MPO case for ROCS and ADV respectively, we yield a 1.87 and 1.56-fold improvement in the MPO score, relative to baseline, and 1.06 and 1.20-fold improvement in the MPO score relative to RL–AL optimized conditions. To ascertain if the increase in average MPO score results in an increase in the MPO score for hits, we plot the cumulative density of MPO scores for hits (Fig. 5E). We observe that RL–AL optimized produces a higher density of hits at a lower MPO score for ADV, a pattern not observed for ROCS. For both oracles in the RL–AL MPO case all generated hits are found between the MPO score range [0.85, 1], leading to enrichment of high scoring hits relative to the RL-baseline and optimized case where hits are distributed between [0, 1].
To visualize the chemical space coverage of generated compounds we compute a UMAP (Uniform Manifold Approximation and Projection),70 for all hits, using a UMAP model trained on the reference library (100![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 000 compounds sampled from REINVENT's prior) (ESI Fig. S10†). With ADV as the oracle, using RL–AL MPO reduces sample diversity relative to RL-baseline, however, we see a compensatory uptick in the sampling of compounds with enriched MPO scores. For ROCS we observe that the MPO strategy increases both the MPO score and the sample diversity of hits compared to baseline.
000 compounds sampled from REINVENT's prior) (ESI Fig. S10†). With ADV as the oracle, using RL–AL MPO reduces sample diversity relative to RL-baseline, however, we see a compensatory uptick in the sampling of compounds with enriched MPO scores. For ROCS we observe that the MPO strategy increases both the MPO score and the sample diversity of hits compared to baseline.
For both oracles, MPO based acquisition leads to a higher number of hits found per second of computation time compared to baseline. RL–AL MPO (ROCS) was ∼3-fold faster than RL baseline at generating 10 and 35 hits. RL–AL MPO (ADV) was ∼1.1-fold and 1.7-fold faster than RL baseline when generating 10 and 35 hits.
The RL–AL baseline, in the ROCS and ADV case, finds a maximum of 35 and 37 scaffolds respectively, therefore the time to find more scaffolds cannot be compared directly. In the time it took the RL baseline to find 35 scaffolds RL–AL MPO identifies ∼416 & 172 scaffolds for a ∼12 and ∼5-fold enrichment (Fig. 5F), for ROCS and ADV respectively. For ROCS, selecting either MPO or optimized configurations did not significantly change the number of hits found per second. For RL optimized with ADV, there was a 64- & 12-fold improvement in hits found per second compared to the RL-baseline & RL–AL MPO (computed from the assumption ALCPUtime − MPOtime ≈ ALCPUtime, see ESI Text S1† for hardware information).
All runs have been performed with diversity filters and experience replay (inception) enabled as we believe these reflect the practical recommendations and general use case of REINVENT. However, as a test case we also applied the RL–AL optimized policy without diversity filters, or without diversity filters and without experience replay, and find decreased model sample-efficiency in all cases. In particular, we note an increased learning rate in the early epochs, followed by a sharp decline in the sample-efficiency in later epochs, indicating the importance of the diversity filter for generating unique scaffolds beyond the initial hits (ESI Fig. S11†). Experience replay also increases the productivity of the RL–AL system, although the impact is less pronounced.
In addition, to check the chemical feasibility of the generated compounds we visualize some of the generated molecules (ESI Fig. S12†) and docked poses/ROCs overlays (ESI Fig. S13†), observing that (1) plausible binding modes and shape overlays are produced and (2) MPO-based sampling results in less lipophilic compounds for docking and fewer terminal halogens for ROCs.
When optimizing parameters for this process, we identified that the speed of RL optimization is highly sensitive to corruption of the oracle with random noise, which was also reflected in decreased performance when using surrogate-predicted values in place of oracle labels for the RL update. Indeed, we showed increased efficiency by down-weighting surrogate predictions all the way to zero, meaning that we found no benefit from incorporating these predictions into the RL update at all.
Despite not directly using the surrogate model for RL updates, our RL–AL system still demonstrates enormous acceleration in oracle-call-efficiency. We believe that the reason for this improvement is related to curation of the designs that are screened by the oracle and therefore used for the RL update – the inner AL loop is serving as a filter, using only the most promising ideas to update the RL agent. Since the only mechanism REINVENT-type systems can use to steer molecular generation is to increase the likelihood of generating favorable sequences, it is reasonable that increasing the proportion of positive examples improves convergence, as has been demonstrated in so-called “double loop” reinforcement learning,58 augmented memory,71 and augmented hill climbing.57 This intervention did not significantly reduce diversity of the generated leads, possibly due to the relatively permissive nature of the oracle functions. While direct comparisons in our current study are complicated by the variation in scoring functions and experimental design across different works, future research should aim to employ a unified benchmark for comparing various interventions that improve RL by maximizing exposure to positive examples. This would enable a more systematic assessment of the most effective techniques and explore the potential for these methods to complement each other.
This complex system depends on several hyperparameters related to the relative size of the RL and AL loops (Nbatch and Nacquired), and our survey of these options demonstrated interdependence between these factors. Generally, the lower the ratio of  the more scope the AL system has to improve upon baseline RL, in rough analogy to the size of virtual library in VS–AL methods.17 While our optimal configuration provided good performance on both oracle functions tested, it is likely that bespoke optimization of the configuration for the oracle at hand would lead to even better results.
 the more scope the AL system has to improve upon baseline RL, in rough analogy to the size of virtual library in VS–AL methods.17 While our optimal configuration provided good performance on both oracle functions tested, it is likely that bespoke optimization of the configuration for the oracle at hand would lead to even better results.
We did observe different behavior in terms of MPO score for the ROCS and ADV oracle, with the ROCS oracle behaving cooperatively with other MPO components. Improving the ROCS score via RL–AL led to commensurate improvements in the MPO score. ADV demonstrated competitive behavior with the other MPO components – anecdotally a common occurrence when using docking-based scoring. In this case, the RL–AL system's acceleration of learning high-scoring components according to the oracle did not lead to any improvements in average MPO score relative to the baseline (although many more “hits” were found).
To address this issue, and incorporate our understanding of the role of the AL loop as a curator of “good examples”, we introduced a novel MPO acquisition function and probabilistic formulation of the REINVENT score aggregation. This system was able to provide massive improvements in MPO optimization, nearly doubling the average MPO score compared with the RL baseline in the ADV case, at the expense of the raw number of hits produced relative to using only the oracle in the acquisition strategy. For the ROCS oracle, the MPO-acquisition strategy performed interchangeably with the oracle-only acquisition strategy.
While simple, our probabilistic reformulation of the REINVENT MPO criteria is highly flexible and we believe it provides a principled way to handle multiple scoring components with varying degrees of uncertainty in RL scoring functions, whether they come from machine learning model confidence (as in this case) or another source (for example, Bennet error72 in free energy prediction). Although the total enrichment of predicted active compounds with MPO compared to the non-MPO case was lower or similar (19.94 and 5.49 for ROCS and ADV respectively), we would recommend utilizing the MPO approach due to its capacity to better satisfy the more balanced MPO profile. Unstable or unsynthesizable suggestions do not add much value in practice.
Overall, RL–AL provides a self-contained method for accelerating de novo molecular design with RL methods and will provide a substantial reduction in compute resources required to produce the same number and quality of hits. Hopefully this improved sample-efficiency will permit the incorporation of even more accurate and expensive physics-based methods such as alchemical binding free energy predictions, which have shown great promise in VS-based methods.16,17,19,20 Connecting RL to these methods would enable on-the-fly generation of molecules according to state-of-the-art simulation workflows.
While we have focused on the well-studied application of RL to molecular design, the framework developed here does not explicitly depend on the application setting and offers a promising method to accelerate RL in other domains where oracle experiments or simulations are costly or time-consuming.
| log ![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) Paug(x) = log ![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) Pprior(x) + αMPO(x) | (1) | 
![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) Pprior(x) is the likelihood of the generated sequence conditioned on the prior model's parameters, i.e. the initial state of the agent, which serves a chemical regularizer since the prior is trained to reproduce real molecules from ChEMBL. S(x) is the MPO score assigned to the sequence in (1), computed according to eqn (3), and α is a constant (here, 128) that controls the balance between optimization and retention of the prior. At each epoch, we update the weights of the REINVENT agent to minimize the following modified loss function, termed the weighted difference between augment and prior (wDAP) loss,82 averaged over all molecules x in the batch X:
Pprior(x) is the likelihood of the generated sequence conditioned on the prior model's parameters, i.e. the initial state of the agent, which serves a chemical regularizer since the prior is trained to reproduce real molecules from ChEMBL. S(x) is the MPO score assigned to the sequence in (1), computed according to eqn (3), and α is a constant (here, 128) that controls the balance between optimization and retention of the prior. At each epoch, we update the weights of the REINVENT agent to minimize the following modified loss function, termed the weighted difference between augment and prior (wDAP) loss,82 averaged over all molecules x in the batch X:|  | (2) | 
REINVENT includes two non-standard elements that modify the RL process and are used here without modification. Firstly, REINENT uses a memory mechanism called a “diversity filter”84 to encourage exploration of the chemical space. A record of molecules with scores greater than a user defined minimum is maintained during RL and scores of new molecules in excess of “bucket size” that are too similar to existing ideas are set to zero. Here, we use identical Murcko scaffolds,68 a bucket size of 100 and minimum score of 0.2.
Secondly, REINVENT uses an experience replay mechanism whereby a buffer of the top scoring 100 SMILES is maintained and a random sample of 10 SMILES are sampled from this buffer in each RL step and added to the current batch of compounds when computing the loss function and weight updates.
|  | (3) | 
Note than penalty components return values between 0 and 1 so are not transformed. In this case, “custom alerts” is the only penalty component, and we use wi = 1 for all other components.
(1) Quantitative estimate of drug-likeness (QED):85 a simple metric for drug likeness. QED is a floating point number between 0–1 computed as an average of several common molecular properties.85 QED score was implemented via RDKit and used without transformation in RL.
(2) Molecular weight is used to constrain the size of the generated molecules in the range 200 to 500 Da. The molecular weight for compounds was computed using RDKit and was transformed using a double sigmoid in REINVENT with parameters “low” = 200, “high” = 550, “divisor coefficient” = 550, “sigmoid steepness coefficient” = 20.
(3) The number of hydrogen bond donors (HBD) is limited to be less than 6 to curtail exploitation of the oracle by undesirable compounds (i.e., adding donors generally improves docking score53). The number of hydrogen bond donors (HBD) were calculated using RDKit and transformed using a reverse sigmoid with parameters “low” = 2, “high” = 6 and “k” = 0.5.
(4) A set of “custom alerts” predefined in REINVENT, which prevent generation of unphysical ring sizes and unstable reaction groups (ESI Table S4†).
(5) The oracle function (docking or ROCS), with an appropriate score transformation to convert the result to a range between 0 and 1. We transform the ROCS score using a sigmoid function in REINVENT with parameters “low” = 0.3, “high” = 0.65 and “k” = 0.3, chosen such that the reference ligand score 0.6 receives a score of 0.92. Because desirable docking scores are negative numbers (indicating increasing free energy of binding), we transformed raw docking scores for use in RL using a reverse sigmoid function in REINVENT with parameters “low” = −13.5, “high” = −6, and “k” = 0.2. This is chosen so the reference ligand with docking score −11.4 receives a transformed score of 0.73.
 molecules at a time to be screened by the oracle scoring component (docking or ROCS). Each batch of compounds scored by the oracle is added to the training pool and a surrogate regression model (see ESI Text S3† for description of models) is trained to predict the oracle scores for all molecules in the training pool. This process is repeated Nloops times per RL step to evaluate exactly Nacquired compounds from generated Nbatch designs with the oracle. Since the acquisition functions depend in general on the surrogate model, we retrain the model on each update. For the first iteration, where there is no data in the training pool, the random acquisition function is used.
 molecules at a time to be screened by the oracle scoring component (docking or ROCS). Each batch of compounds scored by the oracle is added to the training pool and a surrogate regression model (see ESI Text S3† for description of models) is trained to predict the oracle scores for all molecules in the training pool. This process is repeated Nloops times per RL step to evaluate exactly Nacquired compounds from generated Nbatch designs with the oracle. Since the acquisition functions depend in general on the surrogate model, we retrain the model on each update. For the first iteration, where there is no data in the training pool, the random acquisition function is used.
        The oracle scores (for the acquired compounds) and the predicted scores based on the surrogate model (for un-acquired compounds) are used to compute the MPO score for the compounds in the batch. These scores are used to update the model's policy in accordance with the original paper (methods, REINVENT generative model).
Construction of appropriate predictive models for molecules is richly studied,86,87 but not the focus of the current work. Based on retrospective testing on a standard RL run, we evaluated several classical (RF, support vector regression, gradient boosting, k-nearest neighbors, Gaussian Process regression) and deep learning approaches (ChemProp) and found limited impact of model choice on surrogate model accuracy (ESI Fig. S14†). Additionally, we observed limited impact for featurization method (ESI Fig. S15†). We use RF with RDKit physchem descriptors as a prototypical surrogate model with a 1000-compound sliding window training set based on the most recently sampled oracle results, with fixed hyperparameters, optimized by retrospective analysis (ESI Fig. S16†).
(1) Sample S realizations from the distributions of probabilistic scoring components conditioned on x.
(2) Compute the MPO score that would result for each of these samples.
(3) Average the MPO scores from the samples in 2.
We formulate this equation as follows
|  | (4) | 
We find adequate Monte-Carlo convergence with S = 1000 samples (ESI Fig. S17†). An identical approach is used to estimate the standard deviation of the MPO score for use with UCB acquisition.
![[f with combining tilde]](https://www.rsc.org/images/entities/i_char_0066_0303.gif) , over the acquired batch. With UCB, we linearly balance the expected score and compound-specific uncertainty of the surrogate model, σ(x), with constant β (here, 1)
, over the acquired batch. With UCB, we linearly balance the expected score and compound-specific uncertainty of the surrogate model, σ(x), with constant β (here, 1)|  | (5) | 
Note that in the case of docking we instead minimize (since lower docking scores are better), and that greedy strategy is recovered in the case β = 0.
 where u is the mean of the dataset and σ is one standard deviation. Features that are invariant across all compound vectors are removed prior to training and inference.
 where u is the mean of the dataset and σ is one standard deviation. Features that are invariant across all compound vectors are removed prior to training and inference.
        | Footnote | 
| † Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d3sc04653b | 
| This journal is © The Royal Society of Chemistry 2024 |