 Open Access Article
 Open Access Article
      
        
          
            José 
            Jiménez-Luna
          
        
       a, 
      
        
          
            Laura 
            Pérez-Benito
a, 
      
        
          
            Laura 
            Pérez-Benito
          
        
       bc, 
      
        
          
            Gerard 
            Martínez-Rosell
          
        
      f, 
      
        
          
            Simone 
            Sciabola
          
        
      d, 
      
        
          
            Rubben 
            Torella
          
        
      e, 
      
        
          
            Gary 
            Tresadern
bc, 
      
        
          
            Gerard 
            Martínez-Rosell
          
        
      f, 
      
        
          
            Simone 
            Sciabola
          
        
      d, 
      
        
          
            Rubben 
            Torella
          
        
      e, 
      
        
          
            Gary 
            Tresadern
          
        
       c and 
      
        
          
            Gianni 
            De Fabritiis
c and 
      
        
          
            Gianni 
            De Fabritiis
          
        
       *afg
*afg
      
aComputational Science Laboratory, Parc de Recerca Biomèdica de Barcelona, Universitat Pompeu Fabra, C Dr Aiguader 88, Barcelona, 08003, Spain. E-mail: gianni.defabritiis@upf.edu
      
bLaboratori de Medicina Computacional, Unitat de Bioestadística, Facultat de Medicina, Universitat Autònoma de Barcelona, Spain
      
cJanssen Research and Development, Turnhoutseweg 30, 2340 Beerse, Belgium
      
dBiogen Chemistry and Molecular Therapeutics, 115 Broadway Street, Cambridge, MA 02142, USA
      
ePfizer I&I, 610 Main Street, Cambridge, MA 02139, USA
      
fAcellera, Carrer del Dr Trueta, 183, 08005 Barcelona, Spain
      
gInstitució Catalana de Recerca i Estudis Avançats (ICREA), Passeig Lluis Companys 23, 08010 Barcelona, Spain
    
First published on 16th October 2019
The capability to rank different potential drug molecules against a protein target for potency has always been a fundamental challenge in computational chemistry due to its importance in drug design. While several simulation-based methodologies exist, they are hard to use prospectively and thus predicting potency in lead optimization campaigns remains an open challenge. Here we present the first machine learning approach specifically tailored for ranking congeneric series based on deep 3D-convolutional neural networks. Furthermore we prove its effectiveness by blindly testing it on datasets provided by Janssen, Pfizer and Biogen totalling over 3246 ligands and 13 targets as well as several well-known openly available sets, representing one the largest evaluations ever performed. We also performed online learning simulations of lead optimization using the approach in a predictive manner obtaining significant advantage over experimental choice. We believe that the evaluation performed in this study is strong evidence of the usefulness of a modern deep learning model in lead optimization pipelines against more expensive simulation-based alternatives.
It is therefore common to focus on relative binding free energy (RBFE) simulation methods,4–13 where the difference in affinity between two ligands is computed using a thermodynamic cycle that alchemically perturbs only the small region associated with the changing atoms. RBFE methods have shown good results in several studies, with accuracy close to 1 kcal mol−1 and reasonable correlations. Despite this, these methods suffer from several issues, such as system preparation, treatment of waters, force-field selection, protein flexibility and computational cost, making their prospective application difficult in practice.14 On the other side, many empirical,15,16 knowledge-based17,18 and machine learning19–24 scoring functions have been designed for the task of predicting absolute binding affinities. They mostly tackle the problem in a regression setup, where the binding affinity is to be predicted using a set of protein–ligand descriptors, modelling the interaction among both. The fact that they model absolute affinities and are trained on very chemically diverse bodies of data, such as iterations of the PDBbind25 database, limits their applicability when predicting small structural differences between two ligands, such in the congeneric series case. While other machine learning approaches have been presented for this task,26–28 here we propose a modern 3D-convolutional-neural-network-based continuous learning approach for relative binding affinity prediction in congeneric series and show strong predictive power using multiple blind benchmarks as well as public datasets at negligible computational costs. This study serves as a very large evaluation of a modern machine-learning pipeline for lead optimization in a real-life drug discovery scenario, thanks to the joint collaboration with several pharmaceutical companies.
Out of the total 645 available congeneric series available in BindingDB, 495 with IC50 affinity values were extracted and processed for further evaluation, as it was the unit with most data available, containing a diverse set of targets. The majority of these sets encompass a single protein–ligand crystal structure, the rest of the ligands modelled against the reference using the Surflex docking software.35 We then assign each protein structure in the database to a family cluster using a 90% sequence similarity threshold, as per PDB conventions.36 For each series in the same protein cluster we use a maximum common substructure (MCS) protocol as available in rdkit37 to remove identical ligands. This procedure ensures that the same ligand is not repeated against similar targets, avoiding potential overfitting problems and overoptimistic evaluations.38 Affinity values were log-converted to avoid target scaling issues (pIC50 = −log10![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) IC50). Ligands that could not be read by rdkit were removed. Histograms of the number of ligands and their affinity range per series are provided in ESI Fig. 7,† with the average available number of ligands per series being 8.84. In the Schrödinger and BRD4 sets, since only ΔG (per kcal mol−1) information was available, we converted affinity values to the pIC50 range assuming non-competitive binding. Descriptive information on these series is provided in ESI Table 1.† Compounds provided by Janssen were docked using a common scaffold structure via the Glide software. These congeneric series range from 48 up to a 900 different compounds with varying affinity ranges (ESI Table 2†).
IC50). Ligands that could not be read by rdkit were removed. Histograms of the number of ligands and their affinity range per series are provided in ESI Fig. 7,† with the average available number of ligands per series being 8.84. In the Schrödinger and BRD4 sets, since only ΔG (per kcal mol−1) information was available, we converted affinity values to the pIC50 range assuming non-competitive binding. Descriptive information on these series is provided in ESI Table 1.† Compounds provided by Janssen were docked using a common scaffold structure via the Glide software. These congeneric series range from 48 up to a 900 different compounds with varying affinity ranges (ESI Table 2†).
|  | (1) | 
We define several channels for both protein and ligand, in the sense that the atomic contribution to each voxel depends on their type. For the protein we define eight pharmacophoric-like descriptors, as detailed in ESI Table 3.† For the ligands we use a simpler representation based on atom types contained in the set {C,N,O,F,P,S,Cl,Br,I,H}, for a total of 18 stacked channels. We note that there is no particular reasoning behind this choice of descriptors other than they showed promising practical performance in previous studies.
The neural network we propose has a novel zero-symmetric architecture whose main building blocks are 3D-convolution operations. In this work we focus on predicting relative affinities for close analogues in lead optimization, therefore, our approach is to build a network whose input is a pair of ligand binding voxelized representations belonging to the same series. A two-input convolutional neural network was designed, with fixed weights on both legs. The inputs are forwarded through several convolution and pooling operations and then flattened into a 192-dimensional latent representation. The symmetry property of relative binding affinity requires that inverting the order of the input ligands should change the sign of the predicted value, and we embed such symmetry in the network by computing the difference between latent representations. A final linear layer with no bias is then applied to the result of this difference, ensuring zero-symmetry by design and producing the desired predicted difference in affinity. In contrast, calculating relative affinities from an absolute predictor inevitably leads to the concatenation of errors from two separate predictions.
A schema of our architectural choice is provided in Fig. 1 is provided in the ESI.† It features two convolution operations with a kernel size of 3 in each leg, followed by a max-pooling operation, and finally another convolution operation with the same kernel size for both before flattening and performing the latent difference between analogues. The ReLU activation function was used for all layers in the network except for the last, which does not feature one. We include a dropout layer in the end to control for overfitting. Xavier initialization was used for the weights. Training is performed using the Adam stochastic gradient descent optimizer52 with standard hyperparameters (β1 = 0.9, β2 = 0.999, ε = 10−4) using a batch size of 32 samples for 50 epochs. Data augmentation is performed during training by randomly rotating input coordinates to mitigate the lack of rotational equivariance in CNNs. Furthermore, given a set of relative binding predictions, its absolute counterparts can always be retrieved given a single experimentally determined absolute reference, such as the one provided by a lead. If more than one is available, absolute affinities can be computed towards each, in practice providing a predictive absolute affinity distribution, whose average can then be interpreted as a maximum a posteriori (MAP) estimate of the absolute affinity and its standard deviation as a measure of its uncertainty, given the current model state.
Now we present results on the Wang et al.30 and BRD4 inhibitor datasets.31 In this and the rest of cases, we pretrained a model with all difference pairs available in the BindingDB database, which provides a prior for further fine-tuning. We then mixed new available data as training in each sequential iteration of each set with the rest of the BindingDB database for only 3 epochs, significantly reducing computational overhead. A FEP baseline provided by Wang et al.30 is used for comparison. The model efficiently interpolates differences for unseen ligands, achieving considerably high correlation coefficients and low errors in all series with as few as 3–4 additional ligands and associated activity pairs, surpassing in many cases the much more expensive FEP baseline (ESI Fig. 2†). For instance, for the MCL1 target, after testing 3 ligands, the correlation coefficient is above 0.8, surpassing the FEP baseline, and the RMSE is below 1.2 (pIC50 units).
The same evaluation procedure was taken for the compounds available in the Janssen PDE sets (Fig. 2 and ESI Fig. 3†) for both a random and a temporal split, where a baseline against Glide score55 is also added. Excellent performance was seen on a random split given enough training data, and as expected, although the temporal split performance is lower, it is still sufficiently high to be used in a real-life prospective lead optimization scenario. For instance, for the first PDE2 activity set after 20 ligands sorted by time, the Pearson's correlation coefficient R and RMSE were 0.77 and 1.35 (in pIC50 units) respectively. Results for the ROS1 and BACE sets, show a similar trend and insights (Fig. 3 and ESI Fig. 4†). Interestingly, performance slightly decreases for the BACE target late in the evaluation, suggesting that the network may have found an unexplored activity cliff or that the last tested ligands are harder to predict than others in the series. Furthermore, we also provide a type of split where only differences among the most chemically close ligands are predicted, based on ECFP4 fingerprint similarity, as available in rdkit. That is, in each training step we predict from the remaining untested pool of ligands those that are closest to the ones in our training set, with the intention of resembling a real-life lead optimization RBFE scenario, typically applied to close analogues. Split-based results on fingerprint similarity for the first PDE2 set (ESI Fig. 5†), show that after 20 ligands sorted by chemical similarity the R and RMSE were 0.83 and 1.12 (in pIC50 units). These suggest better performance in this scenario than the proposed temporal split, and closer to the random one.
|  | ||
| Fig. 2 Average Pearson's correlation coefficient R (±1 standard deviation) based on 25 independent runs on different sets for the Janssen PDE2, PDE3 and PDE10 targets. | ||
|  | ||
| Fig. 3 Average Pearson's correlation coefficient R (±1 standard deviation) based on several independent runs on two sets for the Janssen ROS1 and BACE targets. | ||
We then present the results provided by Pfizer using a temporal split in Table 1, where specific target names cannot be disclosed. We compare such results with several baselines such as molecular weight, clog![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) P, a MM-GBSA pipeline56,57 and deep-learning absolute affinity predictor KDEEP,20 trained on the v.2016 iteration of the PDBbind database. The model proposed here performs considerably better than the rest when given only 10% of the training data, again highlighting the importance of incrementally training these on the congeneric series of interest. An exception, however, is found in the Kinase #3 series, for which no significant improvement is observed when providing extra training data. This particular last case appears to be particularly hard to predict, as all tested methods perform poorly. We provide results using a temporal split for the last two congeneric series provided by Biogen, for which we also compare against several baselines: (a) Glide score, (b) an MM-GBSA pipeline, and (c) a standard QSAR approach using MACCS, ECFP4 and rdkit descriptors with a random forest model (Fig. 4 and ESI Fig. 6†). Our model reveals similar conclusions, significantly outperforming all baselines. Curiously, it can also be seen that the proposed method does not perform significantly worse than the aforementioned baselines in the second target when no training data is used. When some is used, such as only 5 analogues, our proposed machine-learning model significantly outperforms all baselines.
P, a MM-GBSA pipeline56,57 and deep-learning absolute affinity predictor KDEEP,20 trained on the v.2016 iteration of the PDBbind database. The model proposed here performs considerably better than the rest when given only 10% of the training data, again highlighting the importance of incrementally training these on the congeneric series of interest. An exception, however, is found in the Kinase #3 series, for which no significant improvement is observed when providing extra training data. This particular last case appears to be particularly hard to predict, as all tested methods perform poorly. We provide results using a temporal split for the last two congeneric series provided by Biogen, for which we also compare against several baselines: (a) Glide score, (b) an MM-GBSA pipeline, and (c) a standard QSAR approach using MACCS, ECFP4 and rdkit descriptors with a random forest model (Fig. 4 and ESI Fig. 6†). Our model reveals similar conclusions, significantly outperforming all baselines. Curiously, it can also be seen that the proposed method does not perform significantly worse than the aforementioned baselines in the second target when no training data is used. When some is used, such as only 5 analogues, our proposed machine-learning model significantly outperforms all baselines.
| Target | # ligands | Mol. weight (ρ) | clog ![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) Pa (ρ) | MM-GBSA (ρ) | K DEEP (ρ) | This work (10% training, ρ) | This work (20% training, ρ) | This work (30% training, ρ) | 
|---|---|---|---|---|---|---|---|---|
| a Calculated log ![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) P as available in rdkit. | ||||||||
| Kinase #1 | 362 | 0.19 | 0.06 | 0.56 | 0.42 | 0.49 | 0.64 | 0.73 | 
| Kinase #2 | 106 | 0.1 | 0.28 | 0.25 | 0.25 | 0.25 | 0.41 | 0.51 | 
| Kinase #3 | 95 | 0 | 0.04 | 0.25 | −0.27 | 0.3 | 0.3 | 0.31 | 
| Enzyme | 93 | 0.43 | 0.24 | 0.01 | 0.49 | 0.43 | 0.26 | 0.59 | 
| Phosphodiesterase | 100 | 0.37 | 0.36 | 0.67 | 0 | 0.49 | 0.64 | 0.73 | 
| Activator of transcriptions | 199 | 0.13 | 0.08 | 0.66 | 0.29 | 0.72 | 0.84 | 0.94 | 
| Weighted avg. | 0.19 | 0.14 | 0.47 | 0.25 | 0.49 | 0.59 | 0.69 | |
| Simple avg. | 0.2 | 0.18 | 0.4 | 0.18 | 0.45 | 0.52 | 0.64 | |
|  | ||
| Fig. 4 Results over 5 runs on Biogen's tyrosine-protein kinase and receptor-associated kinase using a temporal split, and MM-GBSA and QSAR random forest pipelines as baselines. | ||
All the tests regarding internal pharmaceutical data were carried out blindly by providing fully-containerized software to our collaborators, who executed the application and reported corresponding results. Furthermore only one pretrained model was provided without any opportunity to overfit to each specific test set.
One aim of our study was to test whether machine-learning driven relative affinity predictions could efficiently identify key high potency compounds in a close to real-life lead optimization scenario, by retrospectively comparing them to the experimental order of synthesis. With some of the large industrial datasets it was possible to test this and we used the most active compound as a surrogate interesting lead molecule. The model is trained on the first experimentally tested compounds, and then is incrementally trained by choosing from the remaining ones based on a upper confidence bound (UCB)-like criterion,58 defined as:
| UCB = μ(x) + βσ(x), | (2) | 
We stop the procedure once the model retrieves the analogue with the highest associated affinity, and compare this with its original synthesis experimental order in its corresponding series. We present results for this simulation-based benchmark in Table 2. In 4 out of 5 sets our proposed model is able to reach the compound with the highest affinity faster than its experimental order or by random selection. Surprisingly, in all ten independent runs of the second set for the PDE2 target, the compound with the highest affinity was found after only a single synthesis epoch. Furthermore, one would expect the average affinity in the training set to increase at each synthesis epoch (as the model is tasked to pick compounds with increasingly higher UCB). This is the case for 4 out of 5 sets again (Fig. 5), with the exception of the ROS1 target, which shows a non-monotonic trend, albeit its model reaches the compound with highest affinity before its experimental order. In all tested cases, the average training pool affinity for the ligands selected by the model is higher than experimental choice. Overall results are very promising and suggest that the proposed method could be applied successfully in a prospective scenario. Particularly, in the first PDE2 set, we were able to reach potent compounds synthesizing up to six times less molecules than the baseline method used by the medicinal chemistry team.
| Target | Set | # ligands | Chosen per synthesis epoch | Experimental order | Found at synthesis epoch | Total sampled ligands | Advantage over experimental choice | Advantage over random choice | 
|---|---|---|---|---|---|---|---|---|
| PDE2 | 1 | 900 | 10 | 766 | 12.2 | 132 | 634 | 318 | 
| PDE2 | 2 | 303 | 10 | 61 | 1 | 20 | 41 | 131.5 | 
| PDE2 | 3 | 278 | 10 | 253 | 5.9 | 69 | 184 | 70 | 
| ROS1 | — | 165 | 10 | 73 | 3.1 | 41 | 32 | 41.5 | 
| BACE | — | 229 | 10 | 190 | 20.8 | 218 | −28 | −103.5 | 
 , and for testing nm factors, where n and m are the number of training and testing instances respectively. At the moment, predictions are limited to a default total of a 1000 molecules per congeneric series, with runtimes averaging and hour on a modern GeForce 1080Ti GPU. Larger experiments can be arranged for users willing to run more computationally demanding experiments.
, and for testing nm factors, where n and m are the number of training and testing instances respectively. At the moment, predictions are limited to a default total of a 1000 molecules per congeneric series, with runtimes averaging and hour on a modern GeForce 1080Ti GPU. Larger experiments can be arranged for users willing to run more computationally demanding experiments.
    
    
      | Footnote | 
| † Electronic supplementary information (ESI) available. See DOI: 10.1039/c9sc04606b | 
| This journal is © The Royal Society of Chemistry 2019 |