Open Access Article
Miguel
García-Ortegón†
*abc,
Srijit
Seal†
*cd,
Emily
Geddes
d,
Jenny L.
Littler
e,
Collette S.
Guy
e,
Jonathan
Whiteside
e,
Carl
Rasmussen
b,
Andreas
Bender
fcg and
Sergio
Bacallado
*a
aStatistical Laboratory, University of Cambridge, Wilberforce Rd, Cambridge, CB3 0WA, UK. E-mail: sb2116@cam.ac.uk; miguel.garcia.ortegon@gmail.com
bDepartment of Engineering, University of Cambridge, Trumpington St, Cambridge, CB2 1PZ, UK
cDepartment of Chemistry, University of Cambridge, Lensfield Rd, Cambridge, CB2 1EW, UK. E-mail: srijit@understanding.bio
dBroad Institute of MIT and Harvard, 415 Main Street, Cambridge, MA 02142, USA
eWarwick Antimicrobial Screening Facility, University of Warwick, Coventry CV4 7AL, UK
fCollege of Medicine and Health Sciences, Khalifa University of Science and Technology, Abu Dhabi, United Arab Emirates
gSTAR-UBB Institute, Babeş-Bolyai University, Cluj-Napoca, Romania
First published on 16th September 2025
The rise of antimicrobial resistance, especially among Gram-negative ESKAPE pathogens, presents an urgent global health threat. However, the discovery of new antibiotics is hampered by sparse publicly available antibacterial data, complex bacterial defenses, and weak economic incentives. Here, we introduce a transfer learning framework using deep graph neural networks (DGNNs) to identify antibacterials from ultra-large chemical libraries. DGNNs were first pre-trained on large molecular datasets of protein–ligand simulations, binding affinities, and physicochemical properties to learn generalizable chemical features, and then fine-tuned on limited antibacterial screening data. Compared to classical methods, transfer learning significantly improved enrichment factors and predictive performance in cross-dataset benchmarks. Applying this strategy to the ChemDiv and Enamine libraries, we virtually screened over a billion compounds and prioritized 156 candidates. Experimental testing against Escherichia coli revealed that 54% of compounds exhibited antibacterial activity (MIC ≤ 64 μg mL−1), with several demonstrating sub-micromolar potency and broad-spectrum efficacy against Gram-positive and Gram-negative pathogens, including three ESKAPE species. Of 18 broad-spectrum candidates, 15 showed minimal cytotoxicity and no hemolytic activity. These results validate our approach for navigating underexplored chemical space and identifying potent, low-toxicity compounds with antibiotic activity. We release open-source models and a scalable workflow to accelerate antibacterial discovery in the face of data scarcity.
Machine learning models are increasingly being utilized to predict the bioactivity and toxicity of compounds.8–12 Deep neural networks (DNNs) have recently arisen as a powerful tool for virtual screening to discover structurally distinct compounds with antibacterial activity. Notable examples are the work by Stokes et al., who used ensembles of deep graph neural networks (DGNNs) to discover Halicin, an antibiotic with a new mechanism of action,13 and Wong et al., who used similar models to find a novel structural class of antibiotics.14 DNNs are appealing because, as deep overparameterized models, they can learn rich, continuous representations of various data types, including discrete ones. This ability makes them highly flexible and enables high predictive performance in several application domains, including image recognition,15 image generation,16 and natural language processing,17 among others. However, unlike these fields, where abundant public data is available (e.g., images, videos, and text), the availability of labeled positive examples in antibiotic datasets (i.e., compounds with antibiotic properties) remains exceedingly limited. This scarcity significantly constrains supervised training, particularly for high-capacity models. For example, the crowdsourced dataset COADD18 for E. coli ATCC 25922 contains just 159 active compounds (considering an 80% growth inhibition threshold, Table 1), and many of these are structural analogs, with approximately 15 unique antibiotic classes represented. This limited data set hinders the practical application of DNNs, which generally require large amounts of data to achieve high predictive accuracy. Further, these models suffer from overfitting in the low-data regime.
| Usage | Dataset | Description | Number of targets | Number of compounds | Number of scaffolds | Number of actives in antibacterial datasets | Completeness |
|---|---|---|---|---|---|---|---|
| Pre-training | RDKit | In silico physicochemical properties | 208 | 877k | 255k | n/a | 100% |
| Pre-training | ExCAPE | Binary labels against human targets | 1332 | 877k | 255k | n/a | 5.90% |
| Pre-training | DOCKSTRING | Docking scores against human targets | 58 | 260k | 100k | n/a | 100% |
| Training, fine-tuning | Stokes | Bacterial growth | 1 | 2220 | 1066 | 118 | 100% |
| Training, fine-tuning | COADD | Bacterial inhibition | 1 | 81 225 |
23 843 |
159 | 100% |
Transfer learning is a model training strategy that aims to increase the performance of machine learning models in the absence of sufficient training data and has proved effective in molecular property prediction tasks with graph neural networks.19 In this approach, models are not trained on the final task of interest from scratch, but rather, they are trained in two stages. First, they are pre-trained on separate tasks for which large amounts of data are available. Ideally, these tasks should be highly related to the final task of interest; however, this is not required.20 In the second step, the parameters learned during pre-training are adapted to the final task of interest, for which training data is scarce, in what is known as fine-tuning. Crucially, fine-tuning involves minor modifications of the pre-trained parameters to avoid overfitting to the limited dataset of the task of interest. This can be achieved by setting a low learning rate or a smaller number of epochs. In this way, transfer learning attempts to learn general representations during pre-training and subsequently adapts them to maximize performance on downstream, specific tasks.21
Recent advances in transfer learning have leveraged knowledge gained from large-scale, often general-purpose datasets to improve performance on specific downstream tasks. Li et al. demonstrated the potential of MolPMoFiT, a fine-tuned language model for molecular activity prediction, marking a pivotal step toward next-generation QSAR modeling through inductive transfer learning.22 Similarly, King-Smith et al. introduced a foundational model for chemistry, showing how large pre-trained architectures can be adapted across a wide spectrum of chemical tasks.23 In the domain of chemical reactivity, Keto et al. leveraged chemically aware pre-training to improve reaction prediction performance,24 while Noto et al. successfully transferred learned representations across different photocatalytic reaction classes.25 Further developments have combined chemical and biological data;8 Liu et al. trained InfoAlign to learn molecular representations by aligning molecules with cellular responses through an information bottleneck, improving property prediction and enabling zero-shot molecule-morphology matching.26 Overall, this shows that domain-specific transfer learning strategies can be used to improve models, and in this work we aimed to use transfer learning to enhance virtual screens for sub-micromolar inhibitors of ESKAPE pathogens, whereafter the model predictions were experimentally validated.
Here, we take a transfer-learning approach to train an ensemble of DGNNs for virtual screening in order to identify compounds with activity against the Gram-negative bacterial species Escherichia coli. Our training workflow (Fig. 1a) comprised two stages: during pre-training, models were optimized to learn general molecular labels that were not specific to bacteria, such as docking scores, binding affinity against and physicochemical properties. During fine-tuning, models were optimized on small public antibacterial datasets measured on E. coli. Our virtual screening protocol (Fig. 1b) prioritized the top predictions by the trained DGNN ensemble while maximizing the diversity of the final subset. First, we selected the highest-ranking compounds from two large commercial libraries, Chemdiv and Enamine, using clustering based on fingerprints and grouping based on antibiotic class or functional group to increase chemical diversity. Second, we validated our protocol by testing the antibacterial activity of the hit compounds against E. coli, finding high enrichment of actives. Finally, we also tested against a panel of Gram-positive and Gram-negative species, including multidrug-resistant strains of ESKAPE pathogens. Our protocol identified several compounds that were structurally novel, nontoxic and demonstrated broad-spectrum activity.
For pretraining, we curated a dataset of RDKit molecular descriptors (calculated using RDKit v2020.09.5), ExCAPE binding affinity annotations (v2 2019), and DOCKSTRING docking scores.28,29 These datasets, docking scores, binding affinities, and physicochemical properties, were selected as pretraining targets because they represent general, transferable molecular features relevant across many bioactivity domains and are well-represented in public datasets. These properties have been shown to be useful in transfer learning using chemically-informed feature space prior to fine-tuning on scarce data.30,31 We used the ExCAPE data aggregating affinity assays from PubChem and ChEMBL,32 particularly binary affinity labels against mammalian proteins from physical experiments, although it is highly sparse. All molecular structures were stored as SMILES strings. We used DOCKSTRING, which contains AutoDock Vina docking scores of a subset of 260k ExCAPE molecules against human protein targets.28 For all datasets above, stereochemical information was discarded for consistency since not all datasets included it. During hyperparameter optimization experiments, SMILES were standardized using the same pipeline as DOCKSTRING,28 which included canonicalization with RDKit,33 discarding SMILES with unnatural charges (formal charges on atoms other than N or O), explicit hydrogens, radicals or multiple molecular fragments, and finally, protonating SMILES at pH 7.4 with OpenBabel.34 We calculated RDKit chemical descriptors33 for all molecules in the ExCAPE dataset. Descriptors, as implemented in rdkit.Chem.Descriptors33 module included physicochemical and topological properties such as molecular weight, number of valence electrons, maximum and minimum partial charges, Bertz complexity index, log
P, and number of rings (Fig. 1a). Finally, this resulted in a dataset, RED, a concatenation of RDKit, ExCAPE, and DOCKSTRING features for the 260k overlapping molecules from DOCKSTRING.
For fine-tuning and training, we used the antibacterial datasets by Stokes et al. (hereafter referred to as the ‘Stokes dataset’)13 and the Community for Open Antimicrobial Drug Discovery (COADD).7,18 Both datasets were generated in wild-type E. coli strains, with Stokes et al. using E. coli BW25113 and COADD using E. coli ATCC 25922. We focused on E. coli ATCC 25922 for this study. Antibacterial activity can be expressed in potency using concentration units or growth inhibition using values between 0 (no inhibition) and 1 (full inhibition). Growth inhibition is standard in high-throughput experiments exposing bacteria to a fixed, constant compound. Stokes indicated results regarding final growth achieved rather than final inhibition (i.e., in the raw Stokes dataset, zero would indicate the highest level of inhibition and one the lowest), while COADD expressed values as percentages. We processed Stokes and COADD values to make their notation consistent (Fig. 2). Occasionally, values outside the usual 0–1 range were observed: negative values indicated that the compound promoted bacterial growth rather than halting it, whereas values slightly above one could result from experimental variability or error. For classification, inhibition values were binarized using an activity threshold of 0.8 (active if inhibition >0.8), consistent with the one previously employed by Stokes et al. All datasets are released viahttps://github.com/mgarort/dockbiotic/tree/main/data/ for public use.
XGBoost (eXtreme Gradient Boosting) is a popular algorithm for training tree ensembles for regression or classification. The key idea of tree ensembling is to combine several weak learners (simple models slightly better than random guessing) into a single strong learner (a model with high predictive performance) that takes a majority vote. Boosting means that the ensemble is trained by adding one tree at a time, and new trees are built, giving more weight to the training data points that are incorrectly classified. Thus, complex examples become more influential as training progresses. We employed the original XGBoost implementation,35 taking binary Morgan fingerprints of length 2048 or 4096 as inputs.33,36,37
AttentiveFP,38 a DGNN, generates a vector representation of a molecule in two steps. First, it produces an embedding of each atom in an iterative procedure of message passing with attention. Each atom embedding gets updated at each iteration through a transformation that takes in that atom's current embedding and its neighbors' current embeddings (message passing). The influence of all neighbors in the transformation is not equal, but rather, they are weighted by attention coefficients. Second, it produces an embedding of the entire molecule in another iterative procedure with attention. An initial embedding of the molecule is produced by summing all the atoms' embeddings. Then, for several iterations, the molecular embedding gets updated through a transformation that takes in each atom's current molecular embedding and the current molecular embedding. Attention coefficients weigh the influence of each atom in the transformation. Finally, once the molecule embedding has been generated, a prediction can be performed with a final linear layer. We used the DeepChem39 implementation of AttentiveFP. The input for AttentiveFP was DeepChem graph representations of type MolGraphConvFeaturizer using edges, e.g. providing the argument use_edges = True. Doing so incorporated bond features by concatenating one-hot vectors for bond type (single, double, triple, or aromatic), same ring membership, conjugation status, and stereo configuration. These edge features modulate the attention-weighted aggregation of neighbor atom embeddings during message passing, allowing the model to distinguish between different bond types when computing molecular representations.
All models used a batch size of 64 and were trained with Adam optimization.40 When training with transfer learning, we used a learning rate of 10−3 for pre-training and a learning rate of 10−4 for fine-tuning. The fine-tuned model started with the last layer initialized randomly and all other layers initialized with the weights from the pre-trained model. All layers were trainable during fine-tuning, allowing the full network to adapt to the antibacterial task. The 10× learning rate reduction during fine-tuning encouraged staying in the vicinity of the pre-trained weights. When trained without transfer learning, training was consistent with the fine-tuning phase of transfer learning to facilitate comparisons. Therefore, models not pre-trained were also trained with a learning rate of 10−4 throughout.
Inhibition values in the processed COADD and Stokes datasets were real numbers ranging between 0 and 1. The prediction of these values could be framed as regression or binary classification after binarizing according to some threshold. However, both of these settings are simplifications and carry potential disadvantages. For example, consider a molecule with a measured inhibition of 0.7. Predictions of 0.4 and 1.0 would be assigned the same quantitative error by a regression loss even though 1.0 is qualitatively very different because it indicates perfect inhibition. Similarly, consider the binary classification setting and a binarization threshold of 0.8. Molecules with measured inhibition of 0.1 or 0.7 would both be binarized as inactive, even though these values indicate very different levels of growth. For these reasons, in addition to losses for regression (MSE) and binary classification (cross-entropy), we defined a custom loss function to predict inhibition values between 0 and 1, which we call the inhibition loss. The inhibition loss combined regression and classification properties by using a custom sigmoid-based squashing function centered at a threshold (e.g., 0.8) to approximate a binary classification loss. It penalized false positives and false negatives differently using adjustable scaling factors c+ and c−, ensuring flexibility in prioritizing errors (for further details, see SI S1).
In virtual screening, molecules from an extensive library are ranked computationally, and the top subset is selected and tested. In our benchmarking experiments, we trained all models on the Stokes dataset and tested them on a reduced version of COADD, selecting the top-ranked compounds from COADD to emulate a virtual screening workflow. COADD was reserved for testing because it was more extensive and diverse (with respect to the number of scaffolds, see Table 1) than Stokes, having been aggregated from compounds suggested by numerous independent research groups over several years. Therefore, it is expected to better represent the screening of an extensive chemical library. We removed all compounds from COADD that were analogs or structurally similar to those in Stokes, as determined by Tanimoto similarity computed on Morgan fingerprints and RDKit path fingerprints. Compounds with a Tanimoto similarity higher than 0.9 on any of the two fingerprint types were removed.
Model performance for virtual screening was quantified with the enrichment factor (EF) which is defined as the ratio of actives in the top selected subset over the ratio of actives in the initial library:
We computed the EF on the top 200 compounds from COADD. Specifically, we computed the average EF over three random repetitions with different initializations for each hyperparameter combination.
We trained an ensemble of 6 models using the two best transfer-learning hyperparameters (3 models with different random seeds for each hyperparameter combination). We trained two ensembles of 6 models: one with fully standardized SMILES for ChemDiv and one with minimally standardized SMILES for Enamine since Enamine SMILES were too numerous (5.52b) to undergo complete standardization. Here, fully standardizing SMILES involved removing isomeric information, protonation, and validation for a consistent, chemically valid representation while minimal standardization of SMILES only removed isomeric information and canonicalized the structure without additional processing. The latter is used for larger datasets (like Enamine) to save on computational resources.
| Library | Version | # Compounds | Number of Bemis–Murcko scaffolds |
|---|---|---|---|
| ChemDiv | 1566731 | 1.57m | 277k |
| Enamine | 2022q1-2 | 5.52b | 2.94b (estimated) |
For ChemDiv (1.57m compounds) SMILES were fully standardized using the same protocol from DOCKSTRING. Enamine's (5.52b) large size precluded charge checking and protonation, so SMILES in pre-training, training, and screening underwent a minimal standardization protocol of canonicalization with RDKit.
XGBoost employed a fingerprint molecular representation computed with RDKit, and AttentiveFP used a graph representation by DeepChem.39 Molecular fingerprints were also used to calculate Tanimoto similarity to assess closeness to the training set. We computed binary Morgan fingerprints (2048 bits) and RDKit fingerprints of path length six and length 2048, calculated the Tanimoto similarity on each fingerprint type, and used the highest of the two as the similarity score.
We reported the number of Bemis-Murcko (BM) scaffolds in each dataset using RDKit (Table 2). The Enamine dataset was too large to decompose every molecule, so we estimated the number of BM scaffolds on a random sample of 2 million compounds.
922 from the National Collection of Type Cultures (NCTC). A subset of compounds was assessed in a panel of strains, including a uropathogenic strain of E. coli (ECU) 13
400, P. mirabilis (PM) 432
002, K. pneumoniae (KP) 13
442, A. baumannii (AB) 19
606 and S. aureus (SA) 29
213. The strains EC 13
400, AB 19
606, KP 13
442, and SA 29
213 were obtained from the NCTC, and the strain PM 432
002 was obtained from the American Type Culture Collection (ATCC).
In brief, a 2-fold serial dilution of each molecule in DMSO was prepared from 256 μg mL−1 down to 0.000122 μg mL−1 across two 96-well plates. A bacterial culture was prepared following the MacFarland 0.5 standard and added to each well. Plates were incubated for 18 hours at 37 °C without shaking. The MIC was the concentration of the last well with complete inhibition (i.e., a completely clear well). For the MBC, 10 μl of each well was pipetted onto agar plates and further incubated for 24 hours at 37 °C without shaking. The formation of colonies was observed, and the concentration of the last well from which no colonies were formed was taken as the MBC.
First, we compared two model classes: a tree-based model on hand-engineered molecular features (XGBoost on Morgan fingerprints) and a DGNN (AttentiveFP) trained with and without transfer learning. Performing this comparison was important because classical models on fingerprints can be highly performant relative to deep neural networks. Fig. 3 shows the results of our benchmarking experiment. The poorest level of enrichment was achieved by AttentiveFP without pre-training, thus confirming our hypothesis that the antibacterial data in the training datasets was too small to train DGNNs from scratch; XGBoost on molecular fingerprints achieved fair enrichment, with 2048 bit fingerprints being the most successful. AttentiveFP obtained the highest level of enrichment with pre-training; pre-training was better than training from scratch for all pre-training datasets, and a pre-trained AttentiveFP model was better than XGBoost on fingerprints for almost all pre-training datasets. These results supported our proposed transfer-learning approach for virtual screening.
Second, we benchmarked different training loss functions. The inhibition values in Stokes and COADD were real numbers ranging primarily between 0 and 1. Prediction of antibacterial activity could, therefore, be framed as regression of raw values, with a mean squared error (MSE) loss, or as classification of binarized values, with a cross-entropy loss. In addition, we tried a custom inhibition loss (IL) that we designed specifically for inhibition since both regression and classification presented disadvantages for predicting inhibition values. This loss was derived from a hard binary classification loss, which was modified by swapping its complex step functions with soft sigmoid-like functions and which was augmented with hyperparameters that controlled the relative weight of false positives and false negatives. Again, we found that pre-training with any of the three losses was superior to training from scratch (SI Table S2). However, we did not observe a significant difference between the three losses, with EFs, which overlapped considerably in terms of standard deviation. Therefore, for simplicity, we decided to frame the prediction of inhibition values as regression with MSE loss.
Finally, we optimized the hyperparameters related to the transfer-learning protocol: the choice of a pre-training dataset, the number of pre-training epochs, and the number of fine-tuning epochs (EF obtained are shown in SI Fig. S1). To increase the robustness of our screening protocol, we trained an ensemble of 6 final models for virtual screening. To improve the diversity of models in the ensemble, we selected the two highest-performing hyperparameter combinations and trained three AttentiveFP models with each combination, starting with different random initializations. Since overparameterized neural models are data-intensive, random initialization performs better than bootstrapping the training data when creating ensembles of deep neural models.50 The best EF was achieved by pre-training on ExCAPE for 20 epochs and fine-tuning for 500 epochs and by pre-training on RDKit for 10 epochs and fine-tuning for 1000 epochs.
Overall, physicochemical properties demonstrate superior transfer learning effectiveness, with RDKit achieving the highest enrichment factor of 75.2 (10 pre-training epochs, 1000 fine-tuning epochs) compared to binding affinity predictions (ExCAPE: maximum 79.6) and docking scores (DOCKSTRING: maximum 55.6) in the test set (SI Fig. S1). While ExCAPE shows the single highest peak performance, RDKit demonstrates more consistent high performance across multiple training configurations (64.3–75.2 range), indicating greater robustness and reliability for transfer learning applications. This could be due to the biological relevance of fundamental molecular descriptors (lipophilicity, hydrogen bonding capacity, molecular size) that influence activity across diverse targets, unlike docking scores which encode highly target-specific geometric complementarity and show the most limited performance ceiling. These results also establish a hierarchical molecular representation framework where physicochemical properties provide the most transferable foundation, binding affinity offers intermediate specificity, and docking scores represent highly specialized but perhaps less transferable features.
000 in Enamine, with the caveat that inherent redundancy in combinatorial libraries in Enamine could be responsible for multiple hits.
To reduce the number of selected compounds while maintaining chemical diversity, we clustered the Enamine pre-selection using molecular fingerprints and selected top-ranking compounds from each cluster to avoid redundancy (as described in Selection protocol in Methods) —particularly among overlapping quinolone amides. For both ChemDiv and Enamine, we further grouped compounds into known antibiotic classes or those featuring recurrent functional groups (e.g., nitro moieties), selecting top candidates from each group. The final selection comprised 54 ChemDiv and 140 Enamine compounds, of which 53 and 103, respectively, were available for experimental testing (Table 3), totaling 156 compounds. This included one known polyketide antibiotic retained as a positive control.
| Step | ChemDiv | Enamine |
|---|---|---|
| Initial size | 1.57m (106) | 5.52b (106) |
| Predicted inhibition ≥0.5 | 369 (236) | 10 791 (1.97) |
| Tanimoto similarity ≤0.8 | 151 (96.6) | 10 059 (1.85) |
| Clustering | — | 5699 (1.03) |
| Antibiotic diversity | 54 (34.5) | 140 (2.53 × 10−2) |
| Delivered by provider | 53 (33.9) | 103 (1.86 × 10−2) |
![]() | ||
| Fig. 6 Distribution of MIC and MBC of the (a) 53 molecules selected from ChemDiv and (b) the 103 molecules selected from Enamine; shown only compounds with MIC values of ≤64 μg mL−1. | ||
| Significance | Threshold | ChemDiv | Enamine |
|---|---|---|---|
| MIC hit | MIC ≤ 64 μg mL−1 | 32 | 52 |
| MIC similar to ampicillin | MIC ≤ 8 μg mL−1 | 10 | 22 |
| MIC better than ampicillin | MIC < 4 μg mL−1 | 4 | 6 |
| MBC hit | MBC ≤ 64 μg mL−1 | 22 | 38 |
| MBC similar to ampicillin | MBC ≤ 16 μg mL−1 | 11 | 23 |
| MBC better than ampicillin | MBC < 8 μg mL−1 | 4 | 7 |
We next evaluated whether a subset of compounds with activity against E. coli had activity against other bacterial species. Though broad-spectrum antibiotics are highly valuable, it is much more straightforward to train models to predict compound activity against a single bacterial strain vs. multiple species. However, given the shared homology of some essential targets and the large proportion of broad-spectrum antibiotics used to train the dataset, it is reasonable that a subset of compounds identified using this model may have broad-spectrum activity.
Given the high prevalence of quinolone derivatives and nitrofurans in the validated compound set, structural diversity was considered in addition to MIC in the selection of molecules to test for broad-spectrum activity. Compounds were grouped by scaffold or functional group, such as nitrofurans or halogen-containing molecules, and examples from each structural category are shown in Fig. 5. A subset of compounds was selected from each category with an MIC requirement of ≤16 μg mL−1 against E. coli (optimising potency, availability and cost among other factors). 18 compounds were evaluated for broad-spectrum activity against the following panel of strains, including three ESKAPE pathogens: a uropathogenic strain of E. coli (ECU) 13
400, P. mirabilis (PM) 432
002, K. pneumoniae (KP) 13
442, A. baumannii (AB) 19
606 and S. aureus (SA) 29
213 (for detailed results see SI Table S3).
Another limitation of this study is the exclusion of stereochemical information from molecular representations to ensure dataset consistency, as not all publicly available datasets contained complete stereochemical annotations or stereochemistry annotation is poor and absolute configuration is unknown.10 While this standardization step is important for fair comparisons, it may reduce model performance, since stereochemistry plays a key role in determining biological activity (enantiomers, for instance, can have vastly different pharmacological effects, toxicities, and target affinities). That said, prior research has shown that 2D molecular representations can often suffice for bioactivity prediction and may even outperform 3D descriptors, as they reduce noise from irrelevant conformational variability and benefit from analogue bias in benchmarks that favor 2D structural similarity over complex 3D features.54,55
Our results also highlight the limitations of virtual screening based on existing chemical libraries. Our DGNN ensemble primarily identified compounds structurally related to known antibiotics, such as quinolones, cephalosporins, and penicillins, consistent with prior annotations in the Stokes and COADD training sets. As shown in Fig. 5, clusters in the AttentiveFP representation space are aligned with known antibacterial classes. While this validation confirms the model's learned representation is meaningful, it also implies the screen is biased toward chemical space already covered by known active compounds.
We believe the primary driver was the limited diversity in the fine-tuning data, which overwhelmingly consist of known antibiotic scaffolds. The custom loss function was designed to optimize classification performance, but it did not explicitly penalize structural redundancy or encourage exploration of underrepresented chemical space. Thus, while the loss function may have contributed indirectly to the lack of diversity, the dominant factor appears to be the training data's chemical composition. Many top Enamine hits were quinolone amides, and although they were active, their structural redundancy necessitated clustering and the selection of representatives. In contrast, ChemDiv yielded more chemically diverse hits, possibly due to its inclusion of natural-product-like scaffolds, despite being ∼3500 times smaller than Enamine. The inherent redundancy in combinatorial libraries, such as Enamine, may limit their utility unless this diversity is accounted for.
Some hits did deviate substantially from the training space (Fig. 6), and these are arguably the most valuable. Finding novel analogs in a fast, automated way with virtual screening could in itself be useful; for example, norfloxacin and ciprofloxacin are both approved for clinical use as antibiotics even though they differ minimally in a single substituent (the former has an ethyl group where the latter has a cyclopropyl). The most valuable hits, however, are those that differ significantly from known antibacterials. Structurally novel antibacterials are more likely to evade known resistance mechanisms, thereby opening up opportunities for new structure–activity relationship (SAR) exploration. They may represent samples from relatively unexplored regions of chemical space. However, our current models only predicted high inhibition for compounds somewhat related to known antibacterials (Fig. 5), which is a limitation shared by many ligand-based virtual screening approaches. To identify relatively novel hits, we needed to consider all candidates with predicted inhibition values higher than 0.5. Other library candidates with high activity but distant from the training set were likely missed in our virtual screening.
The limited structural diversity of known antimicrobials constrains machine learning approaches. Yet, as described by Tommasi et al.,56 identifying compounds with legitimate activity against wild-type Gram-negative bacteria is exceedingly difficult. For example, AstraZeneca screened millions of compounds but was unable to identify any tractable hits against Gram-negative bacteria.56
In this work, we did not filter molecules by predicted toxicity or accumulation, and yet our model seems to identify active molecules that are not generally cytotoxic and have some broad-spectrum activity. Although our ensemble approach demonstrated success in the low-data regime for antibacterial prediction, extending this methodology to build reliable predictive models for toxicity profiles would face additional challenges. The heterogeneity in experimental conditions, cell lines, and assay protocols across different toxicity datasets introduces noise.10 Furthermore, the integration of multiple toxicity endpoints (cytotoxicity, hemolytic activity, organ-specific toxicity) would require careful consideration of endpoint relationships and potential conflicts between different safety profiles and developing new approach methodlogies to predict them.57 In the future, a promising direction would be to utilize high-quality data on toxicity and Gram-negative accumulation to refine our virtual screening algorithm.
Overall, we demonstrate that transfer learning with deep graph neural networks significantly enhances virtual screening performance in the data-sparse regime of antibacterial discovery. By pre-training on large, general molecular datasets and fine-tuning on limited E. coli data, our AttentiveFP ensemble achieved high enrichment factors and identified structurally novel, sub-micromolar compounds active against Gram-positive and Gram-negative ESKAPE pathogens. Experimental validation confirmed a 54% hit rate, with broad-spectrum efficacy and minimal cytotoxicity. The open-source models and scalable workflow developed in this study demonstrates that deep learning models for antibacterial screening can be effectively trained using transfer learning, even when the amount of antibacterial data is limited and the pre-training features are unrelated to antibacterial activity.
Supplementary information (SI): S1 provides details of the inhibition loss function. Fig. S1: provides enrichment factors obtained in the test set from the various models in the process of optimising hyperparameters related to transfer learning protocols. Table S1: shows Benchmarking model class. All Attentive FP models were pre-trained for ten epochs and fine-tuned (or trained) for ten epochs. Table S2: shows Benchmarking loss where all models were pre-trained for 10 epochs and fine-tuned or trained for 10 epochs. Values indicate enrichment factor (EF). All Attentive FP models were pre-trained for 10 epochs and fine-tuned (or trained) for 10 epochs. “MSE” refers to training regression models with a MSE loss and “Cross-entropy” refers to training classification. IL: custom inhibition loss. Table S3: shows the minimum inhibitory concentration (MIC) and minimum bactericidal concentration (MBC) of 18 compounds evaluated for broad-spectrum activity against the panel of strains, including three ESKAPE pathogens. These include a uropathogenic strain of E. coli (ECU) 13400, P. mirabilis (PM) 432002, K. pneumoniae (KP) 13442, A. baumannii (AB) 19606 and S. aureus (SA) 29213. See DOI: https://doi.org/10.1039/d5sc03055b.
Footnote |
| † These authors contributed equally. |
| This journal is © The Royal Society of Chemistry 2025 |