Explainable active learning framework for ligand binding affinity prediction
Abstract
Active learning (AL) prioritises which compounds to measure next for protein–ligand affinity when assay or simulation budgets are limited. We present an explainable AL framework built on Gaussian process regression and assess how molecular representations, covariance kernels, and acquisition policies affect enrichment across four drug-relevant targets. Using recall of the top active compound, we find that dataset identity which is a target's chemical landscape sets the performance ceiling and method choices modulate outcomes rather than overturn them. Fingerprints with simple Gaussian process kernels provide robust, low-variance enrichment, whereas learned embeddings with non-linear kernels can reach higher peaks but with greater variability. Uncertainty-guided acquisition consistently outperforms random selection, yet no single policy is universally optimal; the best choice follows structure–activity relationship (SAR) complexity. To enhance interpretability beyond black-box selection, we integrate SHapley Additive exPlanations (SHAP) to link high-impact fingerprint bits to chemically meaningful fragments across AL cycles, illustrating how the model's attention progressively concentrates on SAR-relevant motifs. We additionally provide an interactive active learning analysis platform featuring SHAP traces to support reproducibility and target-specific decision-making.

Please wait while we load your content...