Open Access Article
Pablo Masab,
Bruno Filoche-Romméb,
Marc Bianciotto
*b and
Rodolphe Vuilleumier
*a
aChimie Physique et Chimie du Vivant, École Normale Supérieure, PSL Université, Sorbonne Université, CNRS, 75005 Paris, France. E-mail: rodolphe.vuilleumier@ens.psl.eu
bIntegrated Drug Discovery, R&D, Sanofi, 94400 Vitry-sur-Seine, France. E-mail: marc.bianciotto@sanofi.com
First published on 9th June 2026
Drug discovery requires traversing vast chemical spaces to identify compounds exhibiting favorable potency, selectivity and absorption–distribution–metabolism–excretion–toxicity (ADMET) profiles. During this process, the synthetic and assay throughput is generally markedly lower than the ideation of new propositions by the project team members, so that the prioritization of new syntheses is indispensable. We herein introduce a framework for simulating the outcome of multi-objective prioritization strategies during lead optimization. Based on the Design–Make–Test–Analyze (DMTA) paradigm, historical discovery programs are replayed round by round using user-defined compound selection strategies. We develop qualitative and quantitative tools to assess their performance in retrieving the best compounds and exploring the project's chemical space. We demonstrate our pipeline using four industrial datasets, each containing chemical structures, assay values and time stamps. Multiple selection strategies are implemented, including approaches inspired by active learning (AL), multi-criteria decision analysis (MCDA), and medicinal chemistry heuristics, that display distinct behavior in the selection of compounds. Retrospective analysis provides a rigorous, low-cost test bed for investigating selection strategies in lead optimization and could help reduce the cost, duration and risk of lead optimization projects.
Active learning (AL), a subfield of ML, uses an acquisition function to selectively query unlabeled samples from a pool for training in order to improve the model's performance with minimal labeling effort by focusing on the most informative examples. It has demonstrated significant utility in data-scarce settings across various domains.9–11 Consequently, AL has gained traction in early-stage drug discovery, particularly for hit identification. Most applications focus on leveraging AL to enhance Quantitative Structure Activity Relationship (QSAR) and Quantitative Structure Property Relationship (QSPR) modeling with minimal data, thereby improving the efficiency of virtual screening and high-throughput screening campaigns.12–18
The strong interest in applying AL to hit identification stems from its critical role in the drug discovery pipeline, as well as the availability of large screening datasets19–21 and commercial compound libraries. In these cases, docking software often serves as a low-cost oracle, aiding in the benchmarking of AL-based selection strategies. However, extending AL methodologies to later stages of drug discovery such as hit-to-lead and lead optimization poses significant challenges.
Unlike hit identification, where the molecular pool remains static, subsequent stages involve iterative hypothesis generation within the Design–Make–Test–Analyze (DMTA) cycle. This process causes shifts in the chemical distribution over time, requiring predictive models to adapt accordingly. Additionally, lead optimization involves multi-objective optimization, where potency, selectivity, and ADMET properties must all be taken into account, further complicating the application of AL.
To rigorously benchmark AL strategies in lead optimization, one would ideally conduct real DMTA cycles. In such cycles, the design of molecules, whether by medicinal chemists or AI, would be coupled with an acquisition function to guide which compounds to synthesize and test. Although automated synthesis platforms have emerged,22–24 this approach remains costly and difficult to implement, especially when attempting to benchmark multiple strategies in parallel.
An alternative is to conduct fully virtual DMTA scenarios. Such an approach has been tested but only with one binary objective, with molecules being labeled as active or inactive based on a simple oracle using various ratios between the number of carbons, oxygens and nitrogens.25 Beyond the fact that the realism of this oracle is limited, this way of evaluation does not take into account the multi-objective nature of lead optimization.
In this work, we investigated another approach that is the retrospective analysis of legacy drug discovery projects through virtual scenarios in which an acquisition function guides compound selection (in this paper, the terms acquisition function and selection strategy will be used interchangeably). Unfortunately, we are not aware of publicly available lead optimization datasets that document the temporal evolution of a project, making this kind of retrospective study difficult to conduct outside of industry.
In response to these challenges, we developed a multi-objective molecular prioritization toolkit that incorporates selection strategies inspired by AL algorithms, multi-criteria decision analysis (MCDA) methods, and classical medicinal chemistry approaches. We also devised a methodology for running retrospective studies on legacy drug discovery projects, along with analytical tools to assess the exploration and exploitation performance of these strategies. By combining the toolkit with this retrospective methodology, we can run virtual “what if” scenarios to evaluate how various selection strategies might have performed historically. When conducting these simulations, we only have access to experimental data for the compounds actually tested during the project (not for those that were designed but never tested, as that would require an oracle). Thus, each simulation involves selecting a fraction of the real project's molecules and comparing the outcomes with the project's actual results. Acquisition functions are evaluated for their ability to match the real project outcomes (both in terms of exploitation and exploration of the chemical space) while selecting less compounds.
The aim of this paper is to present the prioritization toolkit, retrospective methodology and analytical tools, based on four partial datasets from previous projects.26–28 It is not intended to show which selection strategy is best but to present tools to benchmark and compare them. Additional insights and conclusions gleaned from larger and more consistent datasets will be presented in future work.
In summary, this work makes the following contributions:
• We created a molecular prioritization toolkit, drawing on selection strategies from active learning, multi-criteria decision analysis, and classical medicinal chemistry approaches.
• We developed a methodology for conducting virtual scenarios on legacy drug discovery projects.
• We introduced an analysis framework to benchmark strategies for both exploiting and exploring the chemical space.
• We utilized datasets comprising previously disclosed molecular structures15,26–33 to illustrate our simulation procedures and analytical methodologies. In this work, we enrich these datasets by including experimental timestamps for each compound and extend one of them with data for two additional experimental endpoints beyond the primary potency assay.
The first dataset is derived from a project targeting Factor Xa (FXa),26–30 the activated form of coagulation factor X, which plays a pivotal role in blood clot formation within the coagulation cascade. This dataset, spanning just over six years, comprises 1015 compounds synthesized and evaluated for potency against FXa, as well as for solubility and LogD. It serves as the primary demonstration dataset for showcasing the analytical tools developed in the Results section.
The second dataset originates from a project focused on Matrix Metalloproteinase-8 (MMP-8), an enzyme involved in degrading extracellular matrix components, implicated in inflammatory disorders, cancer metastasis, and periodontal disease. This dataset spans four years and includes 430 compounds.
The third dataset pertains to a project targeting Peroxisome Proliferator-Activated Receptor delta (PPARδ), a nuclear receptor crucial for metabolic and inflammatory regulation, implicated in metabolic disorders such as diabetes and cardiovascular diseases. It covers a period of nearly six years, comprising 498 compounds.
Finally, the fourth dataset comes from a project targeting renin,31–33 an enzyme essential for blood pressure and electrolyte balance regulation, primarily associated with hypertension and related cardiovascular disorders. This dataset spans over two years, with 388 compounds being evaluated.
| Dataset | Endpoint | Trend | LTV | HTV | Weight |
|---|---|---|---|---|---|
| FXa | pKi | H | 8.5 | 9 | 3 |
| FXa | Solubility (pH 7.40) (µM) | H | 50 | 100 | 1 |
| FXa | LogD (pH 7.40) | V | 1.5 | 3.5 | 1 |
| Renin | pIC50 | H | 8 | 9 | 1 |
| PPARδ | pEC50 | H | 8 | 9 | 1 |
| MMP-8 | pIC50 | H | 8.5 | 9 | 1 |
• Optimization trend: specifies whether the property should be maximized (higher is better, denoted as H), minimized (lower is better, denoted as L), or maintained within a specified range (interval trend, denoted as V).
• Lowest threshold value (LTV): for the higher is better trend defines the minimum value for a compound to be acceptable. For the lower is better trend defines the minimum value below which compounds are not considered “more acceptable” (i.e. utility of 1).
• Highest threshold value (HTV): for the higher is better trend defines the maximum value after which compounds are not considered “more acceptable” (i.e. utility of 1). For the lower is better trend defines the minimum value below which compounds are considered acceptable.
• Weight: a numerical value indicating the relative importance of the property in comparison to others. Weights can be assigned arbitrarily by the project team or medicinal chemists, or determined using MCDA methods, such as the Analytical Hierarchy Process,34 by defining a priority order among different endpoints. A weight of 0 can be assigned to endpoints that are monitored but not actively optimized.
The simulation results presented in the following sections were made with this fixed blueprint. However, the blueprint can be dynamic and may evolve throughout the course of a project. Weights assigned to different endpoints can shift as priorities change over time, and new properties may be added as additional data become available or new challenges arise. The framework enables changing the blueprint during a simulation if needed.
• Starting date (t0): the simulation can begin when the conditions for building a QSAR model on the primary endpoint are met (see below): at least 100 compounds with experimental data for the primary endpoint have been synthesized and tested, with at least one of them having an activity beyond the threshold on that endpoint.
• Ending date (tf): the simulation terminates on the same calendar date as the corresponding real-world project, allowing a direct, time-aligned comparison between virtual and real scenarios.
• Timestep (Δt): Δt represents the standard duration of a DMTA cycle, defined as the interval between consecutive project meetings for compound prioritization. Typical values for Δt range from 1 to 3 months.
• Memory effect: to simulate the natural de-prioritization of untested hypotheses over time, we introduced a memory parameter. This mechanism controls the retention of candidates in the pool. It can be configured for infinite retention (where compounds remain available indefinitely) or assigned a specific expiration threshold (e.g., removing unselected compounds after 6 consecutive iterations).
• Batch size: the batch size can be expressed as either an integer or a fraction. If an integer is provided, exactly that many compounds are selected at each iteration, provided a sufficient pool of virtual candidates exists. If a fraction is supplied, a proportionate subsample of number of compounds synthesized and tested in the real project during each Δt is selected.
• QSAR/QSPR models: at each iteration, Quantitative Structure–Activity Relationship (QSAR) or Quantitative Structure–Property Relationship (QSPR) models are trained for every endpoint that satisfies two prerequisites: (i) a minimum of 100 data points, ensuring baseline model reliability and (ii) at least one compound within the endpoint's optimal range. Feature extraction is performed using Extended Connectivity FingerPrints35 (ECFP, 2048 bits, radius 2, with chirality and count) fed into the algorithms shown in Table 2. We deliberately utilize default hyperparameters to maintain a neutral baseline and reduce computational overhead, and the objective of this work is to explain the methodology and framework workflow. However, in a real-world application, model-specific fine-tuning would be required to optimize predictive performance.
| Model | Uncertainty estimation methods |
|---|---|
| Random forest | Standard deviation of trees or bootstrap |
| LightGBM | Quantile regression or bootstrap |
| XGBoost | Quantile regression or bootstrap |
| Gaussian process | Posterior variance |
• Acquisition function: the acquisition function governs compound selection at every iteration and is the principal variable under investigation in the simulations. It may favor exploitation, exploration, or a balance of both. In this study, each simulation employs a single, fixed acquisition function; neither mixtures of functions nor dynamic switching strategies are considered in this work. Further details about available strategies are provided in a following section and in the SI.
The initial pool of molecules available for selection consists of those appearing between t0 and t0 + Δt (Fig. 1A). Molecules that appear after t0 + Δt are not considered part of the current hypothesis space and are therefore unavailable for selection. From this pool, a batch of molecules is selected based on the specified acquisition function, after which the simulation proceeds to the next iteration.
For subsequent iterations, previously selected compounds are added to the training set and removed from the pool. The training set is also updated with any newly experimental data that appeared for its molecules. The pool is then updated to include compounds appearing between t and t + Δt, as well as any remaining unselected compounds (Fig. 1B), leaving apart the molecules appearing after t + Δt. The updated pool set is evaluated using the acquisition function, and a new set of compounds is selected. The training set is then updated and the simulation continues to the following iteration.
Upon reaching the final date, every molecule will have been made available for selection in at least one iteration (Fig. 1C). Note that a molecule can remain in the pool indefinitely if it is never selected by the acquisition function.
First, for each documented endpoint of a molecule, experimental assay values are normalized between 0.05 and 1 to define a utility value, thanks to a utility function d inspired by Cummins and Bell.37 Utility functions, illustrated in Fig. 2, are smooth sigmoidal functions related to the trend of optimization and parametrized by the lowest threshold value (LTV) and the highest threshold value (HTV), along with shift (k) and steepness (η) parameters, which default to 0.05 and 10, respectively. When passed into a utility function, an experimental assay value x from endpoint i is normalized according to whether it should be minimized (eqn (1), Fig. 2A), maximized (eqn (2), Fig. 2B), or kept within a designated interval (eqn (3), Fig. 2C).
![]() | (1) |
![]() | (2) |
![]() | (3) |
Next, after computing the utility for each documented endpoint, the values are aggregated into a final desirability score (Dscore) using a weighted geometric mean (eqn (4)), with weights ωi defined in the blueprint. Previously setting the shift to 0.05 instead of 0 ensures that a poor property does not reduce the final score to zero.
![]() | (4) |
With each molecule in the project assigned a desirability score, various “top categories” can be defined. One such category, “Fixed (t = 0.5)” includes those with a desirability score exceeding a fixed threshold of 0.5. While this threshold-based approach is useful for monitoring progress in live projects, it presents challenges when applied to legacy data. A strict threshold may exclude all molecules, whereas a lenient one could classify nearly every molecule as “good,” rendering the classification ineffective.
To address these limitations, we introduce additional top categories based on score distribution. For instance, the Top DScore (X%) category include compounds within the highest X% of desirability scores, offering a more nuanced approach to the analysis. However, this method always selects a fixed proportion of molecules, irrespective of their absolute scores. To enhance classification flexibility, we also define categories based on deviations from the mean: Top DScore (nσ) includes molecules with scores at or above µ + nσ, where µ is the mean desirability score, σ is the standard deviation and n is a number greater than 0. This approach provides a more adaptive and context-sensitive way to identify promising molecules.
| Acquisition function | Objective | Selection | Model |
|---|---|---|---|
| Coverage score39 | Exploration | Batch | No |
| Desirability37 | Exploitation | Independent | Yes |
| Desirability with AD40 | Exploitation | Independent | Yes |
| Desired coverage * | Trade-off | Batch | Yes |
| Desired spread * | Trade-off | Batch | Yes |
| Dissimilarity-to-known | Exploration | Independent | No |
| Dissimilarity-to-known-good | Exploration | Independent | No |
| GRA41 | Exploration | Independent | Yes |
| Greediverse42 | Trade-off | Batch | Yes |
| K-means43 | Exploration | Batch | No |
| K-medoids44 | Exploration | Batch | No |
| Random | Exploration | Independent | No |
| Similarity-to-known | Exploration | Independent | No |
| Similarity-to-known-good | Exploration | Independent | No |
| Spread45 | Exploration | Batch | No |
| TOPSIS46 | Exploitation | Independent | Yes |
| Uncertainty * | Exploration | Independent | Yes |
| U/D harmonic * | Trade-off | Independent | Yes |
| U/D ratio * | Trade-off | Independent | Yes |
First, acquisition functions can rely on predictive (QSAR/QSPR) models. For example, some strategies use predicted values or prediction uncertainty to select molecules, such as Desirability, Desired Spread or Uncertainty. Other strategies, such as Coverage or K-means do not require predicted values and only rely on structural information. As discussed previously, Table 2 lists machine learning models and the associated uncertainty estimation methods that are available to choose from in the framework for building predictive models.
Second, acquisition functions can be distinguished by their primary objective. Some are designed to retrieve promising molecules, hence prioritizing exploitation, and others tend to foster exploration by selecting compounds all around the chemical space. Trade-off strategies also exist, balancing more or less equally exploitation and exploration.
Finally, acquisition functions can be categorized by their operational approach: independent strategies, in which each compound is chosen independently, versus batch strategies, in which the choice of one compound is influenced by the others selected. Notably, some batch selection strategies can be effectively approximated through sequential selection.38
Visualization of the chemical space across iterative cycles facilitates qualitative assessment of different acquisition functions. Commonly used visualization techniques include Principal Component Analysis (PCA), t-distributed Stochastic Neighbor Embedding (t-SNE), Uniform Manifold Approximation and Projection (UMAP),47 and Tree MAP (TMAP).48 In this study, TMAP was selected due to its tree-like structure, which effectively preserves both global and local similarity. TMAP produces a two-dimensional (2D) graph where each node corresponds to a molecule. Extending this methodology, we developed the Kernel Density Estimate TMAP (KDE TMAP). KDE TMAP applies a Gaussian kernel to the 2D embeddings, transforming the discrete chemical space into a continuous representation that highlights areas with varying molecular densities, making interpretation quicker and simpler.
Given a set of data points x1, x2, …, xn, the kernel density estimate at point x is defined by:
![]() | (5) |
![]() | (6) |
To illustrate how a virtual scenario populates the chemical space, we first compute the TMAP for all molecules in the project (Fig. 3A) and subsequently transform it into a KDE TMAP (Fig. 3B). We then derive the KDE TMAP envelope (Fig. 3C), serving as a reference chemical space progressively populated across simulation iterations.
Fig. 4 shows the projection of different sets of molecules onto the project's envelope across iterations. The figure offers an intuitive way to gauge how effectively the simulation balances exploration of new chemical regions and exploitation of high-value areas. After the final iteration, the plot provides a holistic view of the sampled chemical space. Zones populated by compounds in the final training set correspond to regions that are explored by the acquisition function, whereas areas still containing only molecules from the pool set mark the chemical space that remains unvisited. The arrangement of the plotted markers thus also reveals the exploitative capability of the strategy, highlighting how well it selected the most promising compounds.
![]() | (7) |
This monotonic metric can be computed at each iteration of a simulation to gauge how effectively an acquisition function selects the most desirable molecules. A PTMR of 1 indicates that a strategy has retrieved all top molecules, whereas a value of 0 indicates that it has not retrieved any.
![]() | (8) |
Despite its simplicity and popularity, Internal Diversity has notable limitations.52 High internal diversity may result from structurally disparate molecules that do not necessarily exhibit drug-like properties or align with specific biological targets. Additionally, it does not explicitly measure the coverage of chemical space or biological relevance, highlighting the necessity for complementary metrics.
![]() | (9) |
We define the Neighborhood Coverage (NC), a distance-based metric measuring the fraction of molecules in R that have at least one neighbor in S within a specified Tanimoto-distance threshold (τ). This metric employs an indicator function I:
![]() | (10) |
Then, Neighborhood Coverage is computed as:
![]() | (11) |
The proposed metric utilizes a distance threshold of 0.3 (corresponding to a Tanimoto similarity of 0.7) based on 2048 bit ECFP4 binary fingerprints. This value aligns with established benchmarks for structural similarity55,56 and has been shown to provide qualitatively consistent results for medicinal chemists.57 Nevertheless, threshold selection remains inherently subjective, as its appropriateness depends on the specific fingerprint architecture and the requirements of the domain expert.58
![]() | (12) |
In our implementation, the Neighborhood Coverage is computed at discrete thresholds (e.g., τ = 0, 0.05, 0.10, …, 1.0), followed by numerical integration (e.g., using the trapezoidal rule) to estimate the area under the curve.
To reflect the operational cycle of a typical drug discovery project, simulations were conducted with a one-month timestep. The memory effect was disabled by default, ensuring that unselected compounds remained within the candidate pool for subsequent iterations. For comparison, results for simulations incorporating a six-month memory decay are provided in the Supplementary Information.
The batch size is set to a ratio of 0.5 to demonstrate the potential for reducing experimental effort; this means that at each iteration, the acquisition function selects 50% of the number of new molecules synthesized in the real project (Fig. 6A). Over 60 total iterations, the simulation selects 58% of the total molecules evaluated in the historical project (15% via initialization and 43% via the acquisition function; Fig. 6B).
•Random: randomly selects compounds from the pool set. Often treated as a dummy baseline in active learning studies, a purely random selection is informative in our retrospective context. Because each simulated batch contains fewer molecules than the corresponding experimental batch, random sampling provides a realistic estimate of project performance at comparable batch size. To obtain reliable statistics, Random simulations are repeated 100 times. All the following acquisition functions are repeated 10 times.
• Desirability: a purely exploitative strategy in which, at every iteration, the algorithm selects the molecules with the highest predicted desirability scores.
• Desirability (with applicability domain): this strategy extends the Desirability approach by restricting selection to molecules within the predictive models' applicability domain (AD). This reflects standard practice, as predictions are generally considered reliable only within the validated chemical space. We employ a common AD definition based on a similarity criterion:57,59 a molecule from the pool set is included only if its similarity to at least one training set compound exceeds 0.7. This threshold ensures consistency with the value defined for Neighborhood Coverage (Section 3.3.3.3). We encourage people that will use this framework to implement other applicability domain definitions.
• Similarity-to-known-good: emulates a popular strategy in medicinal chemistry where compounds are selected based on their structural similarity to the best molecules from the training set. Here, the best molecules are defined as those whose experimental desirability score is above 0.5.
• Greediverse: a strategy originally developed in-house for de novo drug design42 which optimizes batch desirability, while imposing a penalty, scaled by the parameter λ, on each pair of molecules whose similarity surpasses a predefined similarity threshold in order to introduce some diversity in the selection.
• Uncertainty: a widely used active learning strategy which selects the molecules with the most uncertain predictions, adapted here to the multi-objective case by aggregating the predicted uncertainties of each individual endpoint. As in the Desirability approach, all scores are first normalized—here with MinMax scaling—and subsequently combined by taking their geometric mean to yield a single uncertainty score. Molecules with the highest resulting score are then prioritized for selection.
• Coverage: a model-agnostic strategy leveraging Bayesian statistics and information entropy to select informative subsets.39
• U/D harmonic: a trade-off approach that prioritizes molecules with the highest harmonic mean between desirability and uncertainty scores.
• Desired spread: a hybrid exploration–exploitation strategy inspired by the demerit spread design method.60 It integrates predicted desirability scores with a diversity criterion, where desirability acts as a scaling factor for the compound's distance to its nearest neighbor among the training set or previously selected molecules.
•Oracle: this baseline selects the compounds with the highest experimental desirability scores, providing an upper bound for pure exploitation.
If a tie occurs where multiple compounds share identical acquisition scores but exceed the remaining batch capacity, selection is prioritized by the best predicted value of the highest-weighted endpoint defined in the blueprint.
Fig. 7B overlays the main chemical series, revealing distinct yet partially overlapping domains. Fig. 7C displays some of the best structures and their location in the tree-graph layout, revealing patterns of structural similarity and diversity among them and providing insight into how features transition across the chemical space within a series or between two similar series. For instance, the close spatial proximity between the azole and indole series is explained by their high scaffold similarity, a relationship mirrored between the aminoquinoline and ketopiperazine series.
Exploration-focused strategies such as Coverage (Fig. 8B) and Uncertainty (Fig. 8C) effectively sample compounds across the entire chemical space, leaving few areas unrepresented. However, these strategies often underperform in terms of exploitation, recovering only a small number of top-tier compounds.
Conversely, exploitation-oriented strategies display an inverse pattern. The Desirability strategy (Fig. 8E), which targets molecules with the highest predicted desirability scores, successfully retrieves many top-performing compounds but neglects regions of the chemical space. When the applicability domain constraint is applied (Fig. 8F), exploration is even more limited, resulting in missed opportunities to capture promising compounds from series such as aminoquinoline or ketopiperazine. Likewise, the Similarity-to-known-good strategy (Fig. 8D) shows excellent exploitation within the azole and indole series, successfully identifying all top-ranked molecules in these series, but it fails to explore beyond them, overlooking entire chemical series.
Strategies where the exploration–exploitation balance is built-in offer a more favorable trade-off. Greediverse (Fig. 8G) moderately improves exploration, particularly in the β-amino acid and benzamidine series. Likewise, the Desired Spread (Fig. 8H) preserves the strong exploitative performance of Desirability while significantly enhancing the exploratory coverage, approaching that of exploration-focused methods.
Fig. 9 displays the evolution of these metrics over iterations and creates a coherent image. In early iterations, most acquisition functions cluster together, but Desirability (with AD) and Similarity-to-known-good quickly fall behind and remain the weakest throughout. Exploration-oriented Uncertainty and Coverage consistently top the curves but the trade-off Desired Spread strategy often matches or surpasses them, suggesting that judicious mixing of exploration and exploitation can pay off for exploration. Neighborhood Coverage and its AUC variant offer a very fine decomposition between the different scenarios. The non-monotonic nature of Internal Diversity enables highlighting phases when the chemical space is being focused with consistent drops such as the one between iterations 2 and 7. For #Circles, plateau phases obscure fine differences during simulations, but Desired Spread again secures the highest final score, followed closely by Uncertainty and U/D Harmonic.
Fig. 10 shows the exploitation performance of the different acquisition functions by monitoring the PTMR over iterations. All acquisition functions but Coverage perform similarly or outperform the Random baseline for both the top 5% and top 1% categories. Purely exploratory acquisition functions—Coverage and Uncertainty—have a PTMR of ≈0.55, mirroring Random. Desirability (with AD) and Similarity-to-known-good have similar performance. In contrast, exploitation-oriented methods excel. In the top 5% category (Fig. 10A), Desirability, Greediverse and Desired Spread stay close to the Oracle throughout optimization and finish with a PTMR ≈0.85. The distinction becomes sharper in the top 1% category (Fig. 10B). Desirability and Greediverse retrieve 90% of the top molecules and Desired Spread is almost indistinguishable from Oracle, ultimately retrieving all the top molecules. By contrast, Coverage and Uncertainty remain similar to random, with a final PTMR ≈0.5.
The tree-based TMAP visualization helps understand structural relationships among project compounds, thanks to its tree-based nature. The toolkit augments this view with KDE TMAP plots that qualitatively track the exploration–exploitation balance across iterations (see e.g., Fig. 8). TMAP visualization enable a quick and intuitive assessment of different strategy behaviors.
For exploration, the four proposed metrics are complementary. Neighborhood Coverage is a monotonic, reference-based measure well-suited to post-hoc benchmarking against the complete project set; with an appropriately chosen distance threshold, it finely discriminates among strategies (Fig. 9A). Its threshold-free variant, Neighborhood Coverage AUC, shares these advantages but tends to smooth inter-strategy differences (Fig. 9B). Internal Diversity and #Circles are reference-free, making them applicable to live projects in which the final chemical space is unknown. A drop in Internal Diversity signals a concentration on a limited region of chemical space (Fig. 9C), whereas #Circles provides an interpretable count of newly explored regions (Fig. 9D). Collectively, these metrics yield consistent, interpretable rankings of the tested strategies.
In Fig. 11, we plot an exploration metric (Neighborhood Coverage) against an exploitation metric (PTMR within the Top 5% DScore) to reveal the trade-off between the two objectives. On the FXa dataset, the Desired Spread strategy exhibits both strong exploration and strong exploitation performance.
Plotting only the final points for each strategy further streamlines comparisons. Fig. 12 illustrates this endpoint trade-off for all implemented strategies across the FXa, renin, PPARδ, and MMP-8 datasets. The first three datasets exhibit similar patterns, with each strategy occupying approximately the same region of chemical space relative to the Random baseline strategy. Conversely, the fourth dataset (MMP-8) demonstrates notably different outcomes, likely due to its distinct and sparser experimental timeline (see the MMP-8 section in Supplementary Information). Based on their positions relative to the Random baseline, the strategies we evaluated can be categorized as follows:
• Strategies demonstrating worse exploitation than random.
• Strategies with improved exploitation—which is the primary objective in lead optimization—but reduced exploration.
• Strategies showing superior performance in both exploitation and exploration; these represent the optimal ones.
First, the datasets utilized represent only a subset of real-world projects, particularly regarding compound diversity, readouts, and design blueprints. Consequently, these should be viewed primarily as demonstration datasets intended to illustrate our simulation methodology and analytical framework. The results presented here are not intended to establish universal rules or general principles of lead optimization. Larger-scale studies involving dozens of comprehensive datasets are currently underway.
Second, our simulations relied on a single combination of ECFP + Random Forest model using standard deviation for uncertainty estimation. While model selection can influence results, our sensitivity analyses (detailed in the SI) demonstrate that performance differences across various predictive architectures are minimal for a given acquisition strategy.
Third, the primary simulations presented in this manuscript do not explicitly model the “memory effect,” which is the practical tendency for untested hypotheses to be de-prioritized or discarded over time. This simplification appears to have a negligible impact on the study's conclusions. As detailed in Section 3.1, 4.8, 5.8 and 6.8 of the Supplementary Information, introducing a six-iteration memory constraint yields results comparable to those obtained in memory-free simulations.
Fourth, retrospective simulations assume that all compounds historically available at time t are immediately selectable, overlooking synthetic lineage. In practice, chemical synthesis is a dependency chain: molecule A may be a required intermediate for B, which in turn is necessary for C. If a simulation selects A and C while bypassing B, it artificially inflates exploitation metrics. Correcting this is difficult because the historical logs needed to reconstruct these dependencies are often buried in physical laboratory notebooks, making them nearly inaccessible for older projects.
All simulations in this study employed a fixed acquisition strategy throughout the iterations. In contrast, real-world projects often begin with broad exploration to identify promising regions, then gradually shift toward focused exploitation as structure–activity insights accumulate. Because our framework supports dynamic switching or blending of strategies, future investigations will explore adaptive acquisition policies that evolve over time, better mirroring the iterative reasoning used by project teams.
Finally, the blueprint used in this work only consider endpoint-related criteria. However, additional constraints could be incorporated such as synthetic accessibility,61–63 estimated synthesis cost, or computable molecular properties like molecular weight, topological polar surface area (TPSA), or cLogP.
Supplementary information: datasets information, simulations setup, detailed description of the acquisition functions, additional figures for the FXa dataset, and the full set of corresponding figures for the other datasets. See DOI: https://doi.org/10.1039/d5dd00387c.
| This journal is © The Royal Society of Chemistry 2026 |