Open Access Article
Satya Pratik
Srivastava
a,
Rohan
Gorantla
bc,
Sharath Krishna
Chundru
a,
Claire J. R.
Winkelman
c,
Antonia S. J. S.
Mey
*c and
Rajeev Kumar
Singh
*a
aShiv Nadar University, Delhi-NCR, India. E-mail: rajeev.kumar@snu.edu.in
bSchool of Informatics, University of Edinburgh, Edinburgh EH8 9AB, UK
cEaStCHEM School of Chemistry, University of Edinburgh, Edinburgh EH9 3FJ, UK. E-mail: Antonia.mey@ed.ac.uk
First published on 23rd December 2025
Active learning (AL) prioritises which compounds to measure next for protein–ligand affinity when assay or simulation budgets are limited. We present an explainable AL framework built on Gaussian process regression and assess how molecular representations, covariance kernels, and acquisition policies affect enrichment across four drug-relevant targets. Using recall of the top active compound, we find that dataset identity which is a target's chemical landscape sets the performance ceiling and method choices modulate outcomes rather than overturn them. Fingerprints with simple Gaussian process kernels provide robust, low-variance enrichment, whereas learned embeddings with non-linear kernels can reach higher peaks but with greater variability. Uncertainty-guided acquisition consistently outperforms random selection, yet no single policy is universally optimal; the best choice follows structure–activity relationship (SAR) complexity. To enhance interpretability beyond black-box selection, we integrate SHapley Additive exPlanations (SHAP) to link high-impact fingerprint bits to chemically meaningful fragments across AL cycles, illustrating how the model's attention progressively concentrates on SAR-relevant motifs. We additionally provide an interactive active learning analysis platform featuring SHAP traces to support reproducibility and target-specific decision-making.
Active learning (AL), a subset of machine learning, has emerged as a framework to address this challenge.2,9,15 By training a surrogate model, quantifying predictive uncertainty, and iteratively prioritising the next most informative compounds, AL maximises information gain from a limited number of experimental assay measurements or physics-based computations, enabling efficient enrichment without relying on brute-force screening.9,15–18 In practice, AL balances exploitation that is refining known high-activity scaffolds, against exploration that probes novel chemotypes that may unlock new structure–activity relationships (SARs). This trade-off is controlled by the acquisition strategy.19,20 As a result, AL has been deployed for ligand binding affinity prediction and multi-property lead optimisation under assay- or simulation-constrained budgets.2,9,15,18,20 Notwithstanding its potential, AL is not a “one-size-fits-all” solution.15,21 Its performance is significantly dependent on a complex interplay of methodological choices, including the underlying machine learning model, the molecular representation, the kernel function, and the acquisition protocol.19,21 Outcomes vary with the chemical landscape of the library, the molecular representation, the surrogate model choice, and the acquisition protocol.15,19,21 Moreover, surrogate models and representations from deep learning models can behave as “black boxes,” limiting chemical intuition and trust in recommendations.22–24 AL has been applied successfully on individual targets2,25,26 and specific workflows,10,27,28 and recent efforts have begun to systematically explore different strategies and parameters.15,21 Open questions remain around clarifying when different AL designs are most effective, why performance varies across chemical spaces, and finding ways to incorporate explainability into the selection process of the AL cycles to help with guiding design choices that can be experimentally verified.
In this work we combine explainability while exploring seven acquisition protocols with five Gaussian-process kernels and three molecular representations (ECFP4, MACCS, and ChemBERTa) in a fixed budget-setting for pharmaceutically relevant targets taken from literature (TYK2, USP7, D2R, and MPro). We show that the inherent chemical landscape of each target substantially dictates achievable enrichment, and that the choice of representation–kernel combination presents a trade-off between robustness (e.g., fixed fingerprints with simple kernels) and peak performance (e.g., learned embeddings with non-linear kernels). To move beyond black-box selection, we integrate SHapley Additive exPlanations29 (SHAP) to map high-impact fingerprint bits to chemically interpretable fragments over AL cycles, revealing how model focus sharpens onto SAR-relevant motifs. To allow easy visualisation and analysis of various AL strategies in combination with the SHAP analysis, we provide an active learning analysis platform. This platform provides a way to visualise this comprehensive analysis across all diverse settings and targets and integrates SHAP traces to support reproducibility and target-specific decision-making. It can easily be adapted to different protocols and targets to provide a comprehensive and interactive understanding of different AL strategies and their impact on chemical space. Our code is available at https://github.com/meyresearch/explainable_AL.
![]() | (1) |
| f(x) ∼ GP(m(x),k(x, x′)) | (2) |
Gaussian functions can model the unknown affinity function f(x) on a distribution of functions, and they are incredibly adaptable at approximating nonlinear functions, which are needed to traverse the vast chemical space.
| sacq(x) = α µ(x) + β σ(x) | (3) |
The seven distinct active learning acquisition protocols in this study were designed to systematically probe the trade-off between exploration and exploitation. Each protocol began with an initial random batch of 60 compounds to seed the model, followed by 10 acquisition cycles of 30 compounds each. The exploration-exploitation balance was controlled by dynamically varying the α and β parameters in the generalized Upper Confidence Bound (UCB) acquisition function: sacq(x) = α µ(x) + β σ(x).
This framework allows for three primary modes: pure exploration (α = 0, β = 1), which prioritizes molecules with the highest uncertainty (σ(x)); pure exploitation (α = 1, β = 0), which selects the most promising estimated affinity (µ(x)); and a balanced strategy (α = 0.5, β = 0.5). The specific schedules for each protocol are summarised in Table 1.
| Protocol name | Acquisition schedule (10 cycles of 30 compounds) |
|---|---|
| Random baseline | [R(30)] × 10 |
| UCB-balanced | [B(30)] × 10 |
| UCB-alternate | [E(30), X(30)] × 5 |
| UCB-sandwich | [E(30)] × 2 + [X(30)] × 6 + [E(30)] × 2 |
| UCB-explore-heavy | [E(30)] × 7 + [X(30)] × 3 |
| UCB-exploit-heavy | [X(30)] × 7 + [E(30)] × 3 |
| UCB-gradual | [E(30)] × 3 + [B(30)] × 4 + [X(30)] × 3 |
Beyond simple baselines like the random and UCB-balanced protocols, we designed several dynamic strategies to model different discovery campaign philosophies:
The choice of kernel function is fundamental to the GP's ability to model correlations between data points based on their similarity. We explore five distinct covariance kernel functions viz., Tanimoto, linear, Radial Basis Function (RBF), Rational Quadratic (RQ), and Matérn (ν = 1.5). Please refer to the SI for further details. For all kernels that include hyperparameters (i.e., linear, RBF, RQ, and Matérn), these parameters (e.g., lengthscale ℓ, shape parameter α, outputscale s, and noise variance σ2n) were optimized by maximizing the marginal log-likelihood during model training.36,37 For further information please refer to the SI.
![]() | (4) |
![]() | (5) |
For each AL cycle, SHAP values were evaluated on 100 test molecules randomly sampled from the unqueried pool, using a background of 50 randomly sampled compounds from the training set to initialise the shap.KernelExplainer. The top ten features ranked based on the mean absolute SHAP value were retained for detailed analysis. The stability and robustness of these feature attributions were validated through quantitative analysis across different acquisition protocols.
For models trained on ECFP fingerprints, selected features were mapped back to molecular fragments using RDKit. To address the ambiguity of mapping ECFP bits (due to bit collisions or multiple environments), we implemented an affinity-prioritized algorithm. Atom environments corresponding to top-ranked fingerprint bits were first identified in all molecules containing the bit. These molecules were then sorted by descending affinity. The environment from the highest-affinity compound was extracted using Chem.FindAtomEnvironmentOfRadiusN, canonicalised to a SMILES string, and used as the representative fragment. These fragments were then ranked by a combined score of frequency and SHAP magnitude. This procedure ensures that the identified chemical substructures are those most strongly associated with the high-potency predictive signal and allows for a mechanistic interpretation of how AL reshapes the model's representation of structure–activity relationships.
The cycle is then repeated for each experiment, and parameter combinations undergo repeated cycles. Suitable steps for updating and acquisition are undertaken to allow for unbiased comparison across datasets.
In this work, a single “experiment” refers to one complete, 10-cycle active learning simulation for a specific combination of datasets, molecular representations, kernels, acquisition protocols, and random seeds.
For each dataset–representation–kernel combination, all seven acquisition strategies were evaluated, resulting in a total of 4 × 3 × 5 × 7 = 420 distinct experiments. The vast scope of the experiments poses a challenge to visualise and evaluate these results.
The entire computational study, including the training of all GP models, required approximately 4 hours of wall-clock time on a single NVIDIA RTX 4090 GPU. This demonstrates the practical feasibility of applying our comprehensive benchmarking framework.
The recall of top compounds (Rk) metric quantifies the fraction of truly high-affinity compounds (top k%) that are successfully identified by the active learning process, relative to the total number of such compounds present in the entire dataset. It is calculated using the following eqn (6),
![]() | (6) |
is the number of compounds found in the acquired set that belong to the top k% class, and
is the total number of compounds that actually belong in the top k% most active ones based on the observed activity in the entire dataset. Recall was computed for the top 2% (R2) and 5% (R5) of compounds.
To provide a more comprehensive and robust assessment of early enrichment performance, we also report two additional standard metrics. The Enrichment Factor (EFk) measures how many times more frequently active compounds are found within the top k% of a ranked list compared to a random selection. It is defined as give in eqn (7):
![]() | (7) |
An EFk of 1.0 corresponds to random performance. In this study, we report the EF at 1%, 2%, and 5%.
To mitigate the sensitivity to a fixed cutoff k, we also report the Boltzmann-Enhanced Discrimination of ROC (BEDROC) score.39 BEDROC is a metric that preferentially rewards the identification of active compounds at the top of a ranked list without requiring an arbitrary cutoff. It applies an exponential weight to each compound based on its rank, such that hits at the beginning of the list contribute much more to the final score than those found later. Following common practice for virtual screening, we use an α parameter of 20.0, which heavily focuses the evaluation on the top portion of the ranked list. The score ranges from 0 (no enrichment over random) to 1 (perfect ranking).
| Property | TYK2 | USP7 | D2R | MPRO |
|---|---|---|---|---|
| Binding measure | pKi | pIC50 | pKi | pIC50 |
| Ligands (total) | 9997 | 1799 | 2502 | 2062 |
| Scaffolds (unique) | 104 | 770 | 1034 | 934 |
| Std dev (p-value) | 1.36 | 1.31 | 1.44 | 0.91 |
| N/M ratio | 0.0104 | 0.428 | 0.413 | 0.452 |
We note that the datasets employ different affinity measures (pKi for TYK2 and D2R; pIC50 for USP7 and MPRO), as shown in Table 2. As these units are derived from different assay types and are not directly comparable, our study does not make direct, quantitative comparisons of the absolute affinity values across targets. Instead, our primary performance metric, recall of top compounds (Rk), is based on a relative, percentile-based threshold. For each dataset, the “top k%” active compounds are determined by internally ranking the molecules based on their specific affinity measure. This approach allows for a valid comparison of the enrichment efficiency of the AL strategies across the different chemical landscapes, without relying on a comparison of the raw activity scales.
Scaffold diversity, as determined by the ratio of unique scaffolds to total molecules (N/M) is the main differentiator between the datasets. TYK2 exhibits exceptionally low diversity (N/M ≈ 0.01), indicating a highly constrained chemical space dominated by few structural motifs. On the other hand, USP7, D2R, and MPRO exhibit significantly greater diversity (N/M ≈ 0.41 − 0.45), which is indicative of more structurally diverse compound collections.
Scaffold diversity directly impacts molecular similarity patterns within each dataset as evident in Fig. 2. For instance, TYK2's constrained chemical space is particularly evident with ECFP fingerprints, which show highly skewed similarity distributions with the majority of compound pairs exhibiting low Tanimoto similarities as evident from Fig. 2A. ChemBERTa embeddings and MACCS, on the other hand, display broader distributions centered at higher similarity values as evident from Fig. 2B and C demonstrating how different representations highlight structural homogeneity differently. In contrast, USP7, D2R, and MPRO show wider and more diverse internal similarity distributions across all three molecular representations—ECFP, MACCS, and ChemBERTa (Fig. 2A–C). ECFP fingerprints produce sharp peaks at low similarity values, whereas MACCS keys and ChemBERTa embeddings give more spread-out distributions because they capture the molecular structure in different ways.
Dataset diversity patterns have direct implications for active learning performance. While the more expansive chemical landscape of USP7, D2R, and MPRO offers more chance of strategic compound selection, constrained chemical space like TYK2 restricts the opportunity for diversified exploration. Further dataset diagnostics are provided in the SI.
Statistical analysis demonstrates that the intrinsic properties of the target dataset are the most dominant factor in determining achievable performance. To quantify the relative contributions of our methodological choices, we conducted a four-factor ANOVA (Type II Sums of Squares) on the final recall (Rk) values from all non-random protocols. The full model explained a substantial proportion of the variance in performance (R2 = 0.84; Adjusted R2 = 0.82).
To properly assess effect sizes, we computed omega-squared (ω2), an unbiased estimator of the population effect size, along with 95% bootstrap confidence intervals (1,000 iterations). Dataset identity exhibited the largest effect (ω2 = 0.31, 95% CI [0.28, 0.35]; F(3, 994) = 640.37, p < 0.001), confirming that the chemical landscape sets fundamental performance constraints. Notably, the interaction between datasets and kernel interaction showed a similarly large effect (ω2 = 0.31; F(12, 994) = 160.22, p < 0.001), demonstrating that kernel effectiveness is highly context-dependent.
Other factors made smaller but significant contributions: kernel choice (ω2 = 0.09, 95% CI [0.07, 0.11]; F(4, 994) = 135.27, p < 0.001), molecular representation (ω2 = 0.03, 95% CI [0.02, 0.04]; F(2, 994) = 91.47, p < 0.001), and the kernel × fingerprint interaction (ω2 = 0.04; F(8, 994) = 29.08, p < 0.001). The acquisition protocol, while statistically significant (F(5, 994) = 12.44, p < 0.001), had the smallest main effect (ω2 = 0.01, 95% CI [0.004, 0.019]), suggesting that its role is to modulate outcomes within the constraints imposed by the dataset and model architecture. This statistical evidence reinforces that optimal active learning strategies are highly context-dependent, with dataset characteristics and their interactions with methodological choices playing the dominant role.
The Post-hoc Tukey HSD analysis showed that all UCB-based protocols performed significantly better than random selection in terms of mean recall of top compounds Rk with all adjusted p-values less than 0.05, indicating strong statistical significance. However, there is no significant difference between the UCB protocols themselves, as all adjusted p-values were greater than 0.05. The practical impact of these improvements is measured using Cohen's d effect sizes, which were larger, ranging from 0.934 ucb-balanced vs. random to 1.308 ucb-explore-heavy vs. random, revealing that UCB strategies had a strong advantage over random selection.
No one set of Kernel function, acquisition technique or molecular representation worked optimally in every circumstance. The best configuration for each dataset highlights the range of possible Rk values from 0.5052 for TYK2 to 0.9942 for MPRO, indicating that different datasets require different optimal setups as shown in Fig. 3.
values for Dataset:Kernel interaction (65.92%) and Kernel:Fingerprint interaction (18.97%) in the ANOVA analysis.
ChemBERTa embeddings exhibited a high-variance performance profile characterized by exceptional peaks and notable failures. When optimally paired with non-linear kernels i.e. Matérn and RBF on USP7 and MPRO, ChemBERTa achieved the highest individual Rk of 0.99 on MPRO. This representation proved susceptible to significant performance loss under suboptimal conditions. On challenging datasets viz. D2R and TYK2, identical kernel combinations yielded dramatically lower mean Rk values, with some as low as 0.02 ± 0.01 and a mean BEDROC of 0.003 ± 0.01 for the Matérn kernel on TYK2, highlighting ChemBERTa's context-dependency and unpredictable efficacy.
MACCS fingerprints demonstrated the most consistent performance profile despite achieving the lowest overall mean Rk of 0.27 ± 0.18. This representation exhibited remarkably stable performance across different datasets, with substantially lower inter-dataset variance compared to ECFP or ChemBERTa. Even while MACCS rarely reached peak performance, its consistency makes it a reliable baseline when predictable results are prioritized over maximum performance. Notably, MACCS achieved competitive performance on D2R with Rk = 0.61 when paired with the Tanimoto kernel, demonstrating its potential for specific dataset-kernel synergies.
The linear and Tanimoto kernels delivered consistent, moderate performance across all tested conditions. Linear kernel achieved a mean Rk of 0.35 ± 0.14 on D2R and 0.29 ± 0.13 on TYK2, and a mean enrichment factor at 2% (EF2) of 17.1 ± 8.2. This EF2 value, indicating that the top 2% of compounds were identified at over 17 times the rate of random selection, stands in stark contrast to the near-random performance of the non-linear kernels on the same dataset (EF2 ≈ 1.1), while the Tanimoto kernel yielded 0.30 ± 0.12 and 0.26 ± 0.12 on the same datasets, respectively. These kernels maintained stable performance regardless of dataset difficulty or molecular representation. The Rational Quadratic (RQ) kernel consistently underperformed across all conditions, achieving a Rk as low as 0.12 ± 0.07, and EF2 of only 7.6 ± 4.0, on TYK2 and reaching only 0.26 ± 0.13 on MPRO. This demonstrates a trade-off wherein the non-linear kernels can offer high rewards but with high variability, while linear kernels offer reliable, moderate performance suitable for risk-averse applications as evident in Fig. 4.
Exploit-heavy strategies such as UCB-exploit-heavy, often designed for rapid prioritization, demonstrated effectiveness on USP7 and MPRO datasets, leading to rapid initial gains. Temporal SHAP analysis, which demonstrated top features for USP7 exploit-heavy strategies consistently peaking early in Cycles 2 or 3, indicates rapid initial SAR identification. In contrast, exploit-heavy strategies exhibited a noticeable ‘late spike’ in feature importance on datasets such as TYK2, suggesting that important SAR features are not immediately apparent, but are rather revealed after focused, persistent sampling in specific, high-reward regions of the chemical space. This ‘late spike’ reflects the model's attempt to progressively prioritize subtle features within a highly constrained or challenging SAR landscape as shown in Fig. 5.
On the other hand, explore-heavy strategies such as UCB-explore-heavy typical showed slower initial progress but could achieve higher long-term Rk on complex datasets like D2R, showing more consistent improvement patterns. This reflects a broader sampling approach and a more distributed learning of features across the chemical space, as evident by less pronounced temporal shifts in SHAP feature importance. This approach is advantageous where targets have more diffused SAR or where novel active regions need to be discovered beyond narrow, pre-defined areas. Balanced and adaptive protocols (e.g.,UCB-balanced and UCB-gradual) frequently achieved competitive performance and demonstrated robustness across varied complexities, providing reliable options when optimal configurations are not immediately apparent.
The importance of protocol choice varied significantly depending on the dataset selected. High-performing combinations such as Matérn + ChemBERTa achieved high Rk across most protocols with rapid convergence on datasets such as MPRO and USP7. On the other hand, protocol selection was more crucial for difficult datasets such as TYK2 and D2R which had significant Rk variation and demonstrated slow improvement beyond 300 compounds. This emphasizes how AL strategy effectiveness is highly dependent on dataset characteristics and the chosen kernel–fingerprint combinations, influencing the initial trajectory and overall performance outcome.
SHAP analysis consistently identified specific, chemically interpretable molecular fragments that were highly predictive of binding affinity, validating the model's ability to learn genuine SARs.42,43 Importantly, compounds containing these top-ranked features consistently exhibited high binding affinities (Fig. 5).
Our analysis demonstrates that the model learns stable and genuine SAR drivers. For the USP7 target, the set of the top 5, most important features was identical between the ucb-exploit-heavy and ucb-explore-heavy protocols, yielding a Jaccard index of 1.00. This perfect stability indicates that the model rapidly and consistently identified the core SAR. For the more challenging, low-diversity TYK2 dataset, the analysis still showed good stability with a Jaccard index of 0.43. While different protocols explored different nuances of the constrained chemical space, a core set of features (e.g., bits corresponding to cF and cNc fragments) were consistently ranked as the most important. This provides strong evidence that our model is learning genuine SARs rather than stochastic noise.
For USP7, prominent features were consistently associated with carbonyl groups such as Feature ID 2362, C
O and nitrogen-rich heterocycles such as Feature ID 3500, cnc for both protocols. These features with mean affinity for USP7 9.33–9.66 pIC50 are chemically relevant for deubiquitinase active sites, often involved in hydrogen bonding and electrostatic interactions.45,46 The identification of a complex fragment ID 875, i.e., nc1cncn(CC2(O)CCNCC2)c1=O suggests the model's capability to prioritize intricate patterns. USP7's top fragments were 100% aromatic, 24% nitrogen-containing, and 0% halogen-containing, aligning with DUB modulator characteristics.
O) maintained high ranks and consistent affinities across protocols. This robustness suggests that the identification of core binding motifs is stable, even if the sampling strategy influences the diversity of compounds explored around them.44 This consistency provides further confidence in the model's generalizability and its robust mechanistic understanding of binding, even when the underlying sampling strategies might aim for different balances of exploration and exploitation within the chemical space.
Our analysis revealed important trade-offs between different methodological choices. We discovered that simpler, explicit representations like ECFP fingerprints, paired with robust linear kernels, offer consistent and reliable performance across a wide range of dataset complexities. On the other hand, advance, pre-trained embeddings like ChemBERTa, when combined with flexible non-linear kernels such as Matérna and RBF, can achieve state-of-the-art peak performance; however, they are prone to catastrophic failures on difficult or mismatched chemical landscapes. Similar to this it was demonstrated the AL protocol selection is context-dependent. Exploit-heavy methods are better suited for rapid lead optimization within well-defined SARs, whereas explore-heavy strategies are beneficial for novel chemotype discovery in more diverse chemical spaces. Mechanistic insights from our SHAP analysis offer a framework for understanding why these choices matter, linking them to the model's dynamic learning of SARs throughout the AL cycles.
According to these results, there is no “one-size-fits-all” AL strategy that works in all circumstances. We proposed a context-aware framework for AL in drug discovery demonstrating promising results in terms of ease of their analysis. Practitioners should first analyze their dataset's chemical space, i.e., scaffold diversity and similarity to set reasonable expectations and select AL components accordingly. Challenging or unknown spaces may benefit from stable combinations such as ECFP with a linear kernel, while well-behaved SARs might justify using risky, high-reward methods like ChemBERTa with non-linear kernels.
While this study provides a robust framework, it has limitations, including its retrospective nature and the focus of SHAP analysis on ECFP models. Future work can focus on the prospective validation of these findings in real-world drug discovery campaigns. The most promising future direction, however, lies in the development of adaptive active learning frameworks. These systems could learn the characteristics of the chemical space in real-time and automatically select or adjust the molecular representation, kernel, and acquisition strategy during the campaign, moving beyond the static protocol choices. We can fully utilize active learning to speed up the development of novel medicine by balancing the performance improvement in ligand binding affinity prediction with explainability built in the model from the start. Further improvements could also be achieved by exploring more advanced surrogate models, such as warped Gaussian processes, which could allow the model to explicitly learn the non-Gaussian distribution of affinity data.
Supplementary information (SI) is available. See DOI: https://doi.org/10.1039/d5dd00436e.
| This journal is © The Royal Society of Chemistry 2026 |