Ben
Cree‡
*,
Mateusz K.
Bieniek‡
,
Siddique
Amin
,
Akane
Kawamura
and
Daniel J.
Cole
*
School of Natural and Environmental Sciences, Newcastle University, Newcastle Upon Tyne, NE1 7RU, UK. E-mail: daniel.cole@ncl.ac.uk
First published on 8th January 2025
FEgrow is an open-source software package for building congeneric series of compounds in protein binding pockets. For a given ligand core and receptor structure, it employs hybrid machine learning/molecular mechanics potential energy functions to optimise the bioactive conformers of supplied linkers and functional groups. Here, we introduce significant new functionality to automate, parallelise and accelerate the building and scoring of compound suggestions, such that it can be used for automated de novo design. We interface the workflow with active learning to improve the efficiency of searching the combinatorial space of possible linkers and functional groups, make use of interactions formed by crystallographic fragments in scoring compound designs, and introduce the option to seed the chemical space with molecules available from on-demand chemical libraries. As a test case, we target the main protease (Mpro) of SARS-CoV-2, identifying several small molecules with high similarity to molecules discovered by the COVID moonshot effort, using only structural information from a fragment screen in a fully automated fashion. Finally, we order and test 19 compound designs, of which three show weak activity in a fluorescence-based Mpro assay, but work is needed to further optimise the prioritisation of compounds for purchase. The FEgrow package and full tutorials demonstrating the active learning workflow are available at https://github.com/cole-group/FEgrow.
As such, attention is turning to cheminformatics and machine learning based algorithms for structure-based de novo hit expansion, linking and merging.8 A wide range of approaches are available to build from initial structural biology data, including DeepFrag9 that identifies promising fragments for addition to an input bound ligand, using a deep convolutional neural network, and DEVELOP10 that combines 3D pharmacophoric constraints from the binding pocket with a graph-based deep generative model for R-group and linker design. The SILVR method enables an equivariant diffusion model to be conditioned to generate molecules based on a reference structure, such as a fragment from a crystallographic screen.11 The V-SYNTHES approach makes use of on-demand libraries for hit-finding by decomposing compounds from purchasable databases into reactive scaffolds and synthons, and using the highest scoring docked fragments as seeds for further growth.12 One particularly noteworthy example is the use of fragment merging to design hits against the nonstructural protein 3 (NSP3) of the severe acute respiratory syndrome-coronavirus-2 (SARS-CoV-2).13 Fragments from a crystallographic screen were merged using the Fragmenstein package,14 ensuring placement of molecular substructures onto the original fragments, and subsequently used as templates for searching on-demand chemical space. In this way, fragments were rapidly elaborated into a 0.4 μM hit (representing a >400-fold improvement in affinity).
While extremely promising, all of the above de novo design approaches suffer from some combination of the following issues: (i) reliance on an approximate classical molecular mechanics force field or knowledge-based algorithm for generating and optimising binding poses, (ii) use of an approximate objective function (usually a docking score) as a surrogate measure of binding affinity, (iii) approximation of a rigid target receptor structure, and (iv) limited synthetic tractability of the designed compounds. We therefore developed the FEgrow software as an open-source, interactive Jupyter notebook based workflow for building user-defined congeneric series of ligands in protein binding pockets to start to address some of these open questions (Fig. 1A).15 FEgrow grows user-defined functional groups (R-groups) off a constrained core of a known hit compound, thus incorporating input from structural biology and the expertise of the user in selecting synthetically tractable elaborations. Since publication, we have added functionality for connecting R-groups to the core via a flexible linker, which can be chosen from a library of those common to bioactive molecules.16 In this way, users can choose from 1 M+ combinations of linker and R-group from our distributed libraries (or upload their own R-group modifications). The modular workflow allows for the incorporation of state-of-the-art molecular modelling algorithms, such as the use of hybrid machine learning/molecular mechanics (ML/MM) potential energy functions to optimise the ligand binding pose,17,18 and the gnina convolutional neural network scoring function to predict the binding affinity.19 We plan to expand the range of available optimisation algorithms and scoring functions as they become available (see Methods section).
![]() | ||
Fig. 1 (A) Example building and scoring of a SARS-CoV-2 inhibitor20 using the interactive FEgrow workflow.15 The fixed core (grey) is extended using a user-defined, flexible linker (pink) and R-group (yellow), and scored using gnina.19 (B) Compound libraries with substructures that match the rigid core can now be automatically grown and scored, treating the rest of the molecule as fully flexible. (C) Proposed active learning cycle. Compounds are grown, built in the binding pocket and scored with FEgrow. The outputs are used to train a machine learning model, which is used to select the next batch of compounds. Optionally, the chemical space can be seeded using compounds available from on-demand chemical libraries. |
While interactive work is useful for small-scale studies, we have found it useful to automate the workflow for use on high performance computing (HPC) clusters, and since publication have added an application programming interface (API) to FEgrow (Fig. 1B). This enables us to build virtual libraries with a common core, for example, using reaction-based generative scaffold decoration with LibInvent21 or substructure searching of compound libraries,22 and then rapidly build the compounds into the protein binding pocket with FEgrow. However, unless the libraries are designed using information from the binding pocket, time is wasted building and scoring compounds that are unlikely to be beneficial and it is still not feasible to routinely scan all possibilities.
Hence, rather than exhaustive or random searches of chemical space, we investigate here the use of active learning to elaborate compound design with FEgrow. The general idea behind this approach is that a subset of compounds is evaluated using an expensive design objective function (in this case the molecular growing and scoring algorithms in FEgrow) and used to train a machine learning model (Fig. 1C).23 The machine learning model then predicts the objective function for the remainder of the chemical space, and the next subset of molecules is picked for evaluation (for example, in order to optimise the objective or further explore the chemical space). By cycling through this procedure, the algorithm can iteratively make up for any lack of diversity in the initial training subset, and it is has been found previously that the most promising compounds can be identified by evaluating only a fraction of the total chemical space.
Several studies have investigated the effects of choices such as machine learning algorithm, sample selection protocol and total dataset size on active learning efficiency for experimental and computational affinity predictions.24–28 In general, active learning has been shown to increase enrichment of hits compared to either random or one-shot training of a machine learning model, at low additional cost, and to be relatively insensitive to choices of molecular representation, model hyperparameters and initial training subsets. Active learning has shown practical utility in prioritising compounds based on objective functions from docking29–31 or free energy calculations.25,26,32,33
Here, we interface FEgrow with active learning to efficiently search the chemical space of linkers and R-groups from a user-defined vector. As well as using a docking score to guide optimisation, we also experiment with functions that combine other molecular properties, such as molecular weight, and 3D structural information, such as protein–ligand interaction profiles (PLIP).34 To address the issue of synthetic tractability of the compound designs, we combine the workflow with regular searches of the Enamine REAL database to ‘seed’ the chemical search space with promising purchasable compounds. After testing and optimising the hyperparameters of the active learning models, we apply the algorithm to the prospective design of inhibitors of the main protease (MPro) of SARS-CoV-2, the virus responsible for the COVID-19 pandemic. This target has undergone extensive study in recent years. The COVID moonshot consortium used open science crowd-sourced designs, in combination with high-throughput structural biology and assays, free energy calculations, and machine learning driven synthetic route predictions, to generate a series of potent inhibitors.4 Other notable approaches that include biological confirmation of hits have employed, for example, structure-based design starting from a drug repurposing study,20 virtual screening of a curated collection of commercially available compounds,35 a deep reinforcement learning model using pharmacophore and substructure matches with known inhibitors,36 and a deep generative framework using only target sequence information as input (along with prioritisation based on factors such as docking score and retrosynthetic feasibility).37 Here we employ active learning to prioritise compounds for purchase and testing from the Enamine REAL database based only on early fragment hits. We suggest several novel designs that show activity in a fluorescence-based Mpro assay, as well as automatically generating several compounds that show high similarity to known moonshot hits.
The lowest energy structures are then output for scoring. In the first iteration of FEgrow, we used the gnina convolutional neural network (CNN), which has been jointly trained on binding pose and affinity prediction.19,43,44 We showed that the gnina ‘CNNaffinity’ scores (predicted pK) correlated reasonably well with experiment for ten series of congeneric inhibitors built using FEgrow.15 Here, we add further options for scoring molecules based on protein–ligand interaction profile (PLIP),34 molecular properties, or a combination thereof. For construction of the PLIP score, interactions formed in the available protein-fragment complex crystal structures were one-hot encoded to form a reference vector of desired interactions (here, hydrophobic, hydrogen-bonding, π-stacking, and salt bridge were all identified). A similar vector was constructed for the designed de novo compound, and its Tanimoto similarity to the reference vector used as the objective for optimisation. It has been argued that combining information from various properties can also be advantageous,8 for example by using pharmacophore constraints in combination with docking scores, and we make use here of a simple, combined score (CS):
![]() | (1) |
The first approach is gradient boosting machine (GBM), which is a random forest based technique, utilising ensembles of decision trees. These trees are created from random subsets of features (fingerprints), that are then used to make predictions. GBMs expand on traditional random forests by using the gradient of the error to construct trees specifically designed to minimise this error. Gradually increasing the number of relatively poor individual trees additively increases their predictive power (hence ‘gradient boosted’). The second model is Gaussian process (GP) regression, which is a Bayesian approach that makes predictions by assuming observations can be modelled by the probability distribution over possible reasonable (Gaussian) functions.45 These Gaussian distributions are iteratively refined by the observation of new samples. Because model prediction is performed via a probability distribution, it natively incorporates uncertainty and other useful quantities, such as estimates of expected improvement of a given new sample.46
The acquisition function defines the method by which new molecules are picked at the start of each active learning cycle, with the simplest example being a ‘greedy’ approach, which directly selects the (currently predicted) highest scoring molecules. However, an acquisition function has to balance picking the best compounds, with the need to further refine the accuracy of the machine learning model. Picking the best scoring candidates in descending order might initially increase the objective function, but the algorithm will have the propensity to get stuck in local maxima and to be sensitive to the initial selection of training molecules.
There are a variety of alternatives that aim to avoid the problems of a simple greedy approach, and the approach used here is the upper confidence bound (UCB) uncertainty-based acquisition function.47 UCB considers not just the value of the objective function, but also the variance of the prediction (model uncertainty), effectively biasing towards the selection of molecules about which the model is the least certain of the predicted score. The UCB function is defined by:
UCB(x) = μ(x) + βσ(x), | (2) |
In detail, the Enamine REAL database of 4.5 B compounds was searched for similarity to designed molecules through the public interface to SmallWorld https://sw.docking.org, using a graph-edit-distance space search.22 At each cycle, 100 new, top-scoring compounds were searched, and up to 100 of the most similar compounds from the REAL database were extracted per search query (using a maximum distance of 5 steps). This 10 K compound set was filtered for substructure match with the core using RDKit,38 and those compounds that passed were added to the active learning search space. Active learning then selects compounds for scoring following Enamine enrichment, as usual, but there is no explicit bias to select compounds from the on-demand catalogue.
For testing of the active learning protocols, the chemical space was assembled by combining the pyridyl moiety with 508 R-groups51 and 100 of the most common linkers16 from the FEgrow library. A total of 47710 unique molecules were successfully grown into the binding pocket and scored using the gnina CNN scoring function.19 A further 1656 molecules were assigned a penalty score of pK = 0 as they could not be embedded due to steric clash with the protein. In cases where rare errors occurred, such as a failure to assign force field parameters, the molecules were discarded completely.
The previously tested FEgrow molecule building protocol was applied throughout.15 The ETKDG algorithm39 was used to generate 50 conformers, using a 0.5 Å root-mean-square similarity threshold. Any conformers with an atom closer than 1 Å to any atom in the protein was discarded. Energy minimisation was applied using a hybrid machine learning/molecular mechanics energy function in a mechanical embedding scheme.15 The ANI-2x potential17 was used for the ligand, in cases where all elements in the molecule are covered by the model, or the Open Force Field Sage41 potential otherwise. The lowest energy conformer was retained for scoring.
An active learning library based on scikit52 and modAL53 python packages was adopted from another study.26 A set of molecules to initialise the active learning cycle can be selected via RDKit's MaxMin picker38 from the chemical space, or picked at random. The processing was parallelised using the python library Dask,54 which supports a diverse set of technologies, including the Slurm Workload Manager that is deployed ubiquitously on high-performance computing clusters. Dask is used to secure resources (scheduling workers on Slurm), submitting work and retrieving results. The three major computationally-expensive components were parallelised: (1) building and scoring of the molecules, (2) computing the Morgan fingerprints, and (3) computing the Tanimoto similarity across the chemical space for the Gaussian process modelling.
Fig. 2b shows the distribution of predicted binding affinities, computed using the gnina convolutional neural network scoring function19 from FEgrow built structures. The scores are symmetrically distributed around pK = 4.5, with a maximum affinity of around 6.0, which is indicative of a set of low molecular weight (range between 100 and 350 Da, Fig. S1†), unoptimised compounds at the start of a hit finding effort. Indeed, it is at this stage where the options for expansion are vast, and strategies to suggest exploration of hits are particularly valuable. Note that compounds that could not be built (for example, due to steric clashes with the protein) are arbitrarily assigned a pK of zero, so that this information can be included in the active learning model.
Fig. 2c further shows the UMAP projection of the chemical space, coloured by predicted pK. The visualisation shows a diverse composition of linkers and functional groups, with well-spread clusters of the highest affinity binders, potentially providing a challenging search space for active learning. Fig. 2b also shows locations in the chemical space of example linker and R-groups, attached to the pyridyl core, that make up the stronger predicted binders. Favourable predicted linkers include amides, sulfonylurea and various 6-membered ring heterocycles, and relatively bulky R-groups are feasible, which is generally expected given the size and shape of the binding pocket.3,4 (Note that at this stage no consideration is given to synthetic accessibility or stability of the compound designs).
We next sought to use active learning to accelerate the search through this chemical space, using the oracle to assess the performance of model hyperparameters, and using the predicted binding affinity as the optimisation target. In particular, we have investigated the effects of initial compound selection (random or diverse), number of compounds picked per cycle (in the range 200–500), machine learning model (GBM or GP) and acquisition method (greedy or UCB). As discussed, the dependence of active learning efficiency on the choice of model parameters is well documented, and so we do not devote much space to it here.
By way of example, Fig. 3 shows the effect of the number of compounds picked per cycle on model recall and precision (F1 score) for the two machine learning models (GBM and GP). For a fixed total number of compounds selected (here, 2500), one might expect the model to improve at small sample sizes (hence, more active learning cycles), but we find that the efficiency is already well converged when picking 500 per cycle. Similarly, the choice of machine learning model has little effect, with slightly higher metrics for the GBM model, but both recall and precision comparisons are within the error bars. Fig. 4 further shows the effect of using the UCB uncertainty-based acquisition function, instead of greedy selection, in conjunction with the GP machine learning model. There is some small improvement in recall over greedy selection, but no significant change in the metrics used either as a function of cycle size or the β parameter in eqn (2). Fig. S4–S7† show the results of additional hyperparameter experiments, with similar conclusions.
Note that for the current dataset, random selection would give a recall of 0.05 and F1 score of 0.03 for identification of the top 2% of compounds. Therefore, with recall of around 0.25–0.30 for most of our experiments, we see efficiency improvements with active learning of around a factor of 5× compared to random selection. For reference, the growth and scoring of this compound set in FEgrow requires around 1000 cpuhrs, which is not prohibitive, but automated acceleration at no cost is clearly worthwhile.
In the next section, we choose to use a GP model with UCB acquisition function, with a cycle size of 200 and a diverse set of starting compounds. The overall accuracy of the chosen regression model (using β = 10), following training on 5% of the dataset, is 0.97 pK units (Fig. S2†), which is competitive with typical models used in active learning with fingerprint-based representations.27Fig. 5 shows a similar UMAP projection as in Fig. 2, but now only showing compounds acquired by our chosen active learning model in the first (left) and final (right) cycles. We observe both a wide exploration of the chemical space, which is important to increase diversity in the final set, and a focusing of the explored regions in the final cycle to compounds with a higher predicted binding affinity, which is important for the use of the model to identify strong binders.
![]() | ||
Fig. 5 Difference in selection for first (left) and final (right) active learning cycles, showing a narrowing into areas predicted to be potent and avoiding unpromising areas. |
Fig. 6 shows an example design run, optimising the compounds for predicted pK using the gnina scoring function (further examples are given in Fig. S8–S10†). The distribution of predicted affinity increases over the first 10 active learning cycles then starts to saturate with a mean predicted pK close to 6 (micromolar affinity). Over the full run, 95% of the compounds were successfully built (assigned pK > 0) and 15% had a predicted pK > 6. For comparison, we also extracted 1000 molecules at random that contained the pyridyl substructure from the REAL database used to seed the active learning cycles. For this set, 377 molecules (38%) could be successfully built, with an average predicted pK = 4.9 and only two compounds with predicted pK > 6.0 (0.2%).
Fig. 7a shows the highest scoring compound from this run, with a predicted affinity of 88 nM. The compound extends hydrophobic contacts into the S3 and S1′ pockets, for example with Met165 and Thr25, but despite this does not form any specific polar interactions (other than the original core interaction with His163). Since an early fragment screen had provided valuable information about the nature of potential protein–ligand interactions in this binding pocket, we sought to reduce the reliance on the gnina scoring function and drive the active learning towards compounds that recovered known crystallographic information (see Methods). Fig. 7b shows the top-scoring compound, as defined by the Tanimoto similarity to the vector of reference interactions. In this case, the grown molecule forms additional hydrogen bonding interactions with Asn142, Gly143, Ser144, Cys145 and Glu166, and hydrophobic interactions with Thr25 and Glu166. The majority of these interactions are recapitulated by, for example, fragments PDB: 5RGI and 5RF7 (Fig. 7d).
![]() | ||
Fig. 7 (a) Top-scoring compounds optimised for (a) predicted pK, (b) protein–ligand interaction profile and (c) combined scoring function. (d) Fragment 5RGI shown in pink (H-bond donation by Gly143, Ser144, Cys145 and His163), and 5RF7 in green (hydrophobic and H-bond donation with Glu166). |
Finally, we sought to combine the strengths of both docking scores and crystallographic information to optimise a combined scoring function. Fig. 7c shows the top-scoring compound as defined by eqn (1) after 33 cycles of active learning. Although this compound is scored much lower by the gnina scoring function (predicted affinity 2 μM), it extends into the S3 and S1′ pockets and retains many of the interactions observed in Fig. 7b (e.g., hydrogen bonding interactions with Asn142, Gly143, Ser144, Cys145 and Glu166).
Finally, to investigate whether the relatively low affinity of designed compounds is due to insufficient exploration of chemical space or the empirical objective functions used to optimise molecules, we performed a retrospective analysis of the designed compound space against known binders resulting from the COVID moonshot crowd-sourced discovery campaign.4 In particular, Fig. 10 shows the three most similar compounds from the active learning runs (as defined by Tanimoto similarity search between RDKit Morgan fingerprints with a radius of 3 and size of 2048) to a curated set of 292 hit compounds. Considering that our FEgrow runs took as input only a single PDB receptor structure and pyridyl fragment core, it is clear that this fragment growing and on-demand library screening approach holds promise for suggesting biologically active compounds early in hit discovery campaigns. However, further work is needed to ensure that the most promising compounds are located at the top of ranked lists for synthetic prioritisation and testing.
With the design of FEgrow, we hope to overcome some of the current limitations of de novo drug design discussed in the Introduction. Some of these limitations are addressed in the current study, and some will be addressed in future aided by ongoing advances in molecular modelling and machine learning. For example, we tackle the question of binding pose optimisation by using a fast and accurate machine learning potential (ANI-2x17) to describe the ligand energetics in a hybrid ML/MM mechanical embedding scheme. However, with the flexibility of the FEgrow interface with OpenMM,18 new models could be substituted in, and these are now approaching sufficient speed and accuracy (including for long-ranged interactions) such that geometry optimisation of the entire protein–ligand complex could be implemented using a single, consistent machine learning potential.58,59 In this study, we made the approximation that the protein binding pocket is rigid and used a single receptor structure for design. However, now that ligand building and scoring is fully automated, future studies could use, for example, ensembles of receptor structures, which may be beneficial in cases where the pocket is more flexible.
A limitation of this and other similar studies is the choice of objective function in the active learning cycles. To demonstrate the flexibility of the FEgrow package, we demonstrated four design cycles here, two optimising for predicted affinity using the gnina CNN scoring function and two including a more direct optimisation of protein–ligand contacts extracted from crystallographic fragment screens. While we do not have enough data to assess the relative merits of these scoring functions, we expect the latter to be useful where experimental structural data exists, at least as part of a multi-objective optimisation in future.60 As a flexible alternative to PLIP scores trained on system-dependent crystal structures, it has also been shown that transferable neural networks can be trained on the PDBbind structural database to recognise favourable protein–ligand interactions.61
As shown in Fig. 1c, to address the issue of synthetic tractability of the de novo built compounds, we inserted regular queries of the Enamine REAL database into the active learning cycles. In this way, we can use the initial chemical space to train the active learning regression models, and then over time seed the chemical space with compounds that are both similar to predicted actives and purchasable. In this way, we were able to test the predictions of the active learning workflow with a turn around time of a few weeks from order to biological testing. We note that the methods described here are not intended to be specific to the REAL database, and expanding the search query to a range of suppliers would further increase the similarity between top-scoring designs and seeded compounds. Furthermore, the question of how to efficiently search these very large databases is undergoing continual scrutiny, and methods that employ 3D similarity searches62,63 or that formulate the search in synthon-based rather than enumerated form64,65 could be incorporated into FEgrow in future. Of the 19 designed compounds that were purchased here, three showed measurable activity, but none approached the desired levels for further progression. Nevertheless, a similarity search showed the presence of effective inhibitors in the built chemical space, and so further investigation will focus on ranking compound designs ahead of purchase. Such post-processing could include molecular dynamics simulations to assess stability of predicted binding modes with a flexible receptor, or ideally an extra stage of physics-based free energy calculations to more rigorously predict binding affinities of compound designs.26,66
While the test case employed here is intended to showcase possible applications of FEgrow in prospective computer-aided design, it is challenging to comprehensively benchmark the utility of new computational methods. In this regard, the CACHE (critical assessment of computational hit-finding experiments) provide an invaluable resource for prospectively validating computational predictions.67 Indeed, we entered an earlier version of FEgrow into CACHE challenge 2, with some success in identifying a chemical starting point for SARS-CoV-2 NSP13.68 To facilitate further design and validation work by the community, we have made FEgrow freely available, along with a full tutorial for the active learning workflow, at: https://github.com/cole-group/FEgrow.
Footnotes |
† Electronic supplementary information (ESI) available: Further analysis of oracle dataset and active learning runs, 2D structures of the Enamine compounds ordered, and full experimental methods (PDF). See DOI: https://doi.org/10.1039/d4dd00343h |
‡ B. C. and M. K. B. contributed equally to this work. |
This journal is © The Royal Society of Chemistry 2025 |