Isidro
Cortés-Ciriano†
a,
Qurrat Ul
Ain†
b,
Vigneshwari
Subramanian
c,
Eelke B.
Lenselink
d,
Oscar
Méndez-Lucio
b,
Adriaan P.
IJzerman
d,
Gerd
Wohlfahrt
e,
Peteris
Prusis
e,
Thérèse E.
Malliavin
*a,
Gerard J. P.
van Westen
*f and
Andreas
Bender
*b
aUnité de Bioinformatique Structurale, Institut Pasteur and CNRS UMR 3825, Structural Biology and Chemistry Department, 25-28, rue du Dr. Roux, 75 724 Paris, France
bUnilever Centre for Molecular Informatics, Department of Chemistry, Lensfield Road, CB2 1EW Cambridge, UK
cFaculty of Pharmacy, University of Helsinki, FIN-00014 Helsinki, Finland
dDivision of Medicinal Chemistry, Leiden Academic Centre for Drug Research, Einsteinweg 55, 2333 CC, Leiden, The Netherlands
eComputer-Aided Drug Design, Orion Pharma, Orionintie 1, FIN-02101 Espoo, Finland
fEuropean Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK
First published on 7th October 2014
Proteochemometric (PCM) modelling is a computational method to model the bioactivity of multiple ligands against multiple related protein targets simultaneously. Hence it has been found to be particularly useful when exploring the selectivity and promiscuity of ligands on different proteins. In this review, we will firstly provide a brief introduction to the main concepts of PCM for readers new to the field. The next part focuses on recent technical advances, including the application of support vector machines (SVMs) using different kernel functions, random forests, Gaussian processes and collaborative filtering. The subsequent section will then describe some novel practical applications of PCM in the medicinal chemistry field, including studies on GPCRs, kinases, viral proteins (e.g. from HIV) and epigenetic targets such as histone deacetylases. Finally, we will conclude by summarizing novel developments in PCM, which we expect to gain further importance in the future. These developments include adding three-dimensional protein target information, application of PCM to the prediction of binding energies, and application of the concept in the fields of pharmacogenomics and toxicogenomics. This review is an update to a related publication in 2011 and it mainly focuses on developments in the field since then.
Isidro Cortés-Ciriano received a MSc in Biology and a MSc in Biochemistry from the University of Navarra (Spain) in 2011. Since 2012, he is a fellow of the Pasteur-Paris International PhD Programme in the Structural Bioinformatics Unit at the Institute Pasteur, where he works on the development and application of machine learning methods for predictive bioactivity modelling in the context of multi-target systems. |
Qurrat Ul Ain is an IDB-CCT (Islamic Development Bank-Cambridge Commonwealth Trust) scholar for PhD in University of Cambridge since 2012. Her research focuses on bioactivity modelling approaches. She received her BS (HONS) in Bioinformatics from International Islamic University Islamabad and M.Phil in Bioinformatics from Quaid-i-Azam University Islamabad Pakistan. |
Vigneshwari Subramanian studied Bioinformatics at the University of Helsinki, Finland and is currently doing her PhD in Computational Drug Discovery in the same university. Her research focuses on proteochemometric modelling involving 3D protein field-based descriptors. |
Eelke B. Lenselink is currently pursuing his PhD at the LACDR in Leiden were he focuses on ligand and structure based design for GPCRs. |
Oscar Méndez-Lucio received a BSc in pharmaceutical and biological chemistry and a MSc in chemistry from the National Autonomous University of Mexico (UNAM). Since 2012 he is a PhD student in the University of Cambridge working on the bioactivity and selectivity of kinase inhibitors. |
Ad IJzerman is a full professor of medicinal chemistry at the Leiden Academic Centre for Drug Research of Leiden University, The Netherlands. He has a keen interest in using computer science methods for medicinal chemistry needs. |
Gerd Wohlfahrt earned the PhD degree in chemistry, University of Braunschweig, Germany. Currently, he serves as a Principal Research Scientist, Computer-Aided Drug Design, Orion Pharma, Espoo, Finland. His areas of expertise include bioinformatics, chemoinformatics, drug discovery, and structural biology. His research interests comprise comparison of protein families, integration of protein- and ligand-based data for drug discovery, oncology target, and drug discovery. |
Peteris Prusis defended his PhD thesis at Uppsala University, Sweden, which included discovery of proteochemometrics modeling approach. After many years of academic career at Uppsala University he shifted his focus to industry, starting as post-doc at AstraZeneca, Sweden and now as Senior Research Scientist at Orion, Finland. |
Therese Malliavin defended her PhD at the Universite Paris-Sud and is a CNRS research fellow, working at the Institut Pasteur in Paris. Her main scientific interest concerns the relationship between biomolecules internal mobility and structure, and their interactions with other biomolecules and ligands. |
Gerard JP van Westen finished a PhD in proteochemometrics at Leiden University (Netherlands) after an internship at Johnson and Johnson (Belgium). Subsequently he spent three years at the European Bioinformatics Institute in the ChEMBL Group (UK), and is returning to Leiden University. |
Andreas Bender is a Lecturer for Molecular Informatics at the Centre for Molecular Informatics of the University of Cambridge, where he leads a research group comprising about ca. 20 members performing research on various aspects of chemical and biological data integration and analysis. He received his PhD from the University of Cambridge in 2005, and returned after a Presidential Postdoctoral Fellowship with Novartis in Cambridge/MA and an Assistant Professorship at the University of Leiden in The Netherlands to Cambridge. |
Predictive bioactivity methods, such as Quantitative Structure–Activity Relationship (QSAR) models, are based upon the compound similarity principle.5,6 However, it has been shown that the activity of a compound against a single target is not sufficient to understand its actions in a biological system. In fact promiscuity is intrinsic to chemical compounds,7,8 bioactivity against related targets frequently needs to be considered for efficacy of e.g. CNS-active drugs and anti-cancer drugs,9,10 and promiscuity has been used to anticipate side-effects.11 Hence, only the simultaneous modelling of both the chemical and the target domain, across a series of protein targets, permits the meaningful mining of the compound–target interaction space.12
The term chemogenomics comprises techniques capable to capitalize on this huge amount of bioactivity data by considering compound and target information, in order to find unknown interactions between (new) compounds and their (new) targets.13,14 Proteochemometrics (PCM) modelling describes methods where a computational description from the ligand side of the system is combined with a description of the biological side being studied and both are related to a particular readout of interest.15,16
In this context, ligands are typically small molecules although biologics also have been explored. Conversely, the biological parameters in the model can comprise protein binding sites, but also e.g. gene expression levels of particular cell lines. The readout describes the biological effect of a particular ligand on the protein or cell line of interest (such as an IC50 value of this particular combination of compound and biological system). Additionally, PCM relates to personalized medicine as it can predict the effect of a ligand on a complex biological system, e.g. cell line, from genotypic information.17
In ligand space, chemogenomic approaches relying only on ligand data have shown that there is an unequal distribution of ligand data. This is due to the fact that some target classes (e.g. GPCRs or kinases) have been traditionally regarded as more interesting from a medicinal chemistry standpoint, and are thus overrepresented in bioactivity databases.23 Moreover, while some chemogenomic methods implicitly consider target information using bioactivity profiles of groups of similar ligands, i.e. the interaction between these compounds and a panel of targets, they are outperformed by techniques that explicitly consider target information.24,25 In addition, bioactivity profiles for related compounds are not always available.
In target space, techniques were employed which benefit from the structural or sequence information available and rely on groups of related targets with the aim to identify possible off-target effects and drug specificity for a particular target of interest.25 Based on the inverse similarity principle, related proteins are likely to interact with similar compounds. As in the previous case, the unavailability of data also constitutes a limitation for target-based chemogenomics.
The combination of ligand and target data allows the creation of predictive models that can rationalize e.g. viral or cancer cell line selectivity, whereas models exclusively based on ligands cannot explain the role of the target in selectivity.26 Merging data from ligand and target sources into the frame of a single machine learning model allows the prediction of the most suitable pharmacological treatment for a given genotype (personalized medicine), which ligand-only and protein-only approaches are not able to perform. This is precisely the underlying principle in proteochemometrics (PCM), which employs both ligand and target features simultaneously, and which therefore enables the deconvolution of both the target and the chemical spaces in parallel.15,16
A PCM model is trained on a dataset composed of a series of targets and compounds, where ideally compounds have been measured on as many targets as possible (illustrated in Fig. 1). The simultaneous modelling of the target and the ligand space permits to better understand complex drug–target interactions (e.g. selectivity)30–33 than would be possible with chemogenomics as the effect of target and chemical variability can be evaluated (e.g. protein mutations or the effect of chemical substructures on bioactivity). Thus, the aim of PCM is the complete modelling of the compound–target interaction space (Fig. 1), including also the prediction of the bioactivity of novel compounds on yet untested targets.
Initial attempts to incorporate description of several proteins and their ligands in a single QSAR model involved modelling of the interaction between mutated glucocorticoid receptors and DNA.34,35 The first full scale PCM study involving different proteins was devoted to the interaction of chimeric melanocortin receptors with chimeric peptides at Uppsala University.36 The name “proteochemometrics” was coined later by the same research group.15 Since then PCM has been applied on various diverse datasets (Table 1).37,38 While the current review will focus on recent developments in the field, a comprehensive discussion of PCM-related work has been presented in a previous review by van Westen et al. from 2011 to which we would like to refer the reader.16
Dataset (datapoints) | Receptor | Ligand descriptors | Target descriptors | Bioactivity type | Machine learning technique | In silico model validation | Prospective validation? | Remarks, inferences | References |
---|---|---|---|---|---|---|---|---|---|
a The wide applicability of PCM is evidenced by the increased coverage of drug targets in the studies of the last three years. Although traditional drug targets, such as GPCRs or kinases, are still widely represented, new applications (e.g. the modelling of viral genotypes or pharmacogenomics) are gaining ground steadily. BPN – Back Propagation Networks, BS – Bootstrapping Validation, CTD – composition and transition of amino acid properties, CV – Cross-Validation, DCNB – Dual Component Naive Bayes, DCSVM – Dual Component Support Vector Machines, DT – Decision Trees, DTV – Decoy Test Validation, ENR – Elastic Net Regression, EV – External Validation, GP – Gaussian Processes, KNN – K-Nearest Neighbors, LCO – Leave-Cluster-Out Validation, LOTO – Leave-One-Target-Out Validation, NB – Naïve-Bayes, NN – Neural Network, MLR – Multiple Linear Regression, OOB – Out-Of-Bag Validation, PCA – Principal Component Analysis, PLS – Partial Least Squares, Random Forest – RF, RS – Random Splitting, SVM – Support Vector Machines, SVR – Support Vector Regression, Y-Sc – Y-Scrambling. | |||||||||
PDBbind170 (1300) | 1300 protein–ligand complexes | Atom-type based | Atom-type based | K d, Ki | RF | Y-Sc, OoBV, EV | No | Increasing the training set size improves the model's predictability | Ballester et al., 2010 (ref. 208) |
ProLINT database209 (3595) | 62 kinases | Structural fragments and 2D autocorrelation vectors | Sequence-based structural fragments and amino acid sequence autocorrelation | IC50 | SVM | 3-fold CV, EV | No | SVM based on autocorrelation descriptors perform better than fragment-based approaches | Fernandez et al., 2010 (ref. 210) |
PDBbind170 (1255) | Diverse proteins | Property-encoded shape distributions | Property-encoded shape distributions | K d, Ki | SVM | 5-fold CV, EV | No | Training set enrichment and expansion enhances prediction accuracy | Das et al., 2010 (ref. 175) |
Stanford HIV drug resistance database211 (4495) | 728 reverse transcriptases | Dragon descriptors71 | Z-scales48 | IC50 | PLS | 7-fold CV, EV | No | Receptor–ligand and receptor–receptor cross-terms improved model performance | Junaid et al., 2010 (ref. 149) |
Immune epitope database212 (31992) | 12 HLA-DRB1 proteins | Z-scales48 | Z-scales48 | IC50 | PLS | 7-fold CV, EV | No | Identified protein residues and peptide positions for binding predictions | Dimitrov et al., 2010 (ref. 213) |
Karaman et al. dataset214 (12046) | 317 human kinases | Dragon descriptors71 | Z-scales,48 amino-acid composition, sequence order and CTD | K d | PLS, SVM, KNN, DT | Double CV | No | SVM outperforms all machine learning approaches | Lapins, et al., 2010 (ref. 215) |
CSAR-NRC HiQ176 | 346 protein–ligand complexes | Atom counts | Atom counts | K d | MLR | RS | No | Distance dependent atom descriptors make the regression models more robust | Kramer et al., 2011 (ref. 176) |
Gold standard set (1933) | 313 diseases (OMIM)216 | Diverse drug–drug similarity measures | Disease–disease similarity measure | Classifier score | Logistic regression classifier | 10-fold CV, EV | No | Possibilities to include patient specific gene expression profiles make the models suitable for pharmacogenomics studies | Gottlieb et al., 2011 (ref. 217) |
Sc-PDB218 (2882) | 581 targets | Hashed fingerprints | Protein sequence and 3-D structure based | Actives/inactives | SVM | 5-fold CV, EV | No | Structure-based approaches perform better than sequence-based approaches | Meslamani et al., 2011 (ref. 60) |
GLIDA database119 (5207) and GVK kinase database (15616) | 317 GPCRs and 143 kinases | Dragon descriptors71 | Protein sequence and feature-based | K i, IC50, EC50 | SVM | 5-fold | 9 compounds for ADRB2 | Highly active compounds predicted by SVM not identified by ligand-based/structure-based approaches | Yabuuchi et al., 2011 (ref. 62) |
5 inhibitors for EGFR | |||||||||
Tibotec BVBA (4024) | 14 HIV RT | Circular fingerprints | Hashed fingerprints | EC50 | SVM | Y-Sc, E CV, LosoV | 317 novel predictions were experimentally verified | Viral mutants PCM models can assist the development drugs for HIV infection | van Westen et al., 2011 (ref. 26) |
Bioinfo-DB61 (336678) | Oxytocin receptor | MACCS structural keys | Fingerprints based on the properties of amino acids in active site | Actives/inactives | RF | 10-fold CV, EV | Biological evaluation of 37 compounds (2 hits) | PCM models yield better hits than the conventional virtual screening procedures | Weill, et al., 2011 (ref. 61) |
PDBbind refined set (1387) | 23 protein families (1387 proteins) | Atom-type based | Atom-type based, distance-dependent protein ligand atom type pairs | K d | MLR, PLS | 5-fold CV, LCO | No | Inclusion of descriptors from PCM models predict free energies more accurately than docking programs | Kramer et al., 2011 (ref. 169) |
Stanford HIV drug resistance database (4794 protease and 4495 RT sequence-inhibitor combinations) | 828 HIV-1 protease variants | GRIND alignment independent descriptors219 | Z-scales48 | Inhibitor concentration | PLS | Double loop CV, Y-Sc and EV | No | Intra-protease cross-terms improve model performance | Spjuth et al., 2011 (ref. 150) |
Kinase SARfari3 (85908) | 342 human kinase domains | Extended connectivity fingerprints (ECFP-6)70 | Fingerprints based on amino acid residues and physiochemical properties | IC50, Kd, Ki | DCSVM and DCNB | RS, EV | No | DCSVMs provide better activity prediction | Niijima et al., 2012 (ref. 94) |
BindingDB220 (1275) | 5 HDAC isoforms | Physical properties and topological indices of compounds | Sequence similarity, structure similarity, geometry descriptors | IC50 | SVR | 10-fold CV, EV | No | SVR models with PUK kernels have stronger mapping capabilities | Wu et al., 2012 (ref. 92) |
Docked complexes (2335 PDB structures & 3671 FDA drugs) | 2335 human targets | Ligand shape descriptors | Binding site shape descriptors | Ligand contact point score | PCA | DTV, EV | VEGFR2 inhibition by Mebendazole and Cadherin 11 inhibition by Celecoxib were verified | TFMS PCM approach can assist in drug repositioning studies | Dakshanamurthy et al., 2012 (ref. 221) |
Literature (160 protein–ligand complexes) | 47 HIV-1 proteases | Physical properties, topological indices of compounds | Z-scales48 | K i | SVR | 10-fold CV, EV | No | Protein–ligand interaction fingerprints improved models over cross-terms | Huang et al., 2012 (ref. 41) |
CHEMBL 23 (10999) | 8 human and rat adenosine receptors | Circular fingerprints | Hashed fingerprints | K i | SVM | Y-Sc, EV, DTV | 6 novel compounds were experimentally identified | Addition of orthologue information increased model quality | van Westen et al., 2012 (ref. 22) |
CHEMBL 83 (81689; 43965) | 136 GPCRs and 176 kinases | MACCS keys | Sequence descriptors | K i, IC50 | SVM | 5-fold CV, EV | No | Feature selection improved the predictive accuracy of the models | Cheng et al., 2012 (ref. 222) |
GVK biosciences database223 (628120) | 238 class A GPCRs | Chemical kernels based on ECFP-6 fingerprints and dragon descriptors | Protein kernels based on full length, TM and loop sequences | Agonists/antagonists | SVM | RS, DT, EV | No | Protein kernels based on TM sequences showed higher prediction accuracy | Shiraishi et al., 2013 (ref. 158) |
GDSC dataset224 (38930) | 639 cancer cell lines | PaDEL descriptors72 | CNV, sequence variation and microsatellite instability status | IC50 | RF and NNs | 8-fold CV, EV | No | PCM based on existing drugs allows drug repositioning and pharmacogenomics studies | Menden et al., 2013 (ref. 29) |
Peptide library (180) | 4 proteases | Binary and physiochemical descriptors | Binary descriptors | K i | PLS | 5-fold CV | No | Inclusion of intra-peptide cross-terms improved model performance | Prusis et al., 2013 (ref. 151) |
Kinase SARfari (54012) | 372 kinases | Topological fingerprints | Amino-acid composition and CTD | IC50, Kd, Ki | RF and NB | OOB, 5-fold CV, EV | No | Random forests outperform Näive Bayes | Cao, et al., 2013 (ref. 126) |
Virco (300000) | HIV mutants (10700 NNRTI, 10500 NRTI, 27000 PI) | Circular fingerprints | Z-scales48 | IC50 | SVM | Y-Sc, 5-fold CV, EV | No | Phenotypic resistance for novel mutants can be predicted via PCM | van Westen et al., 2013 (ref. 145) |
GPCRDB225 (310) | 9 human amine GPCRs | Physical properties and topological indices of compounds | Z-scales48 and TM identity descriptors | K i | SVR and GP | 10-fold CV, EV | No | SVR is superior to GP | Gao et al., 2013 (ref. 112) |
TM identity descriptors perform better than Z-scales descriptors | |||||||||
PubChem BioAssay dataset4 (63391) | 5 CYP 450 isoforms | Molecular signatures | CTD | AC50 | KNN, SVM and RF | CV, EV | No | Non-linear methods (SVM and RF) perform better | Lapins et al., 2013 (ref. 88) |
Binding and PDSP KI database220 (13079) | 514 human targets | Topological fingerprints | Amino-acid composition and CTD | K i | RF and NB | OOB, 5-fold CV, EV | No | Random forests outperform KNN, SVM, NB and BPN | Cao et al., 2013 (ref. 226) |
In vitro OATP modulation data (2000) | OATP1B1 and OATP1B3 | Circular fingerprints | Z-scales48 and feature-based ProtFP | K i | RF | OOB, EV | Agreement between experiment and prediction | 4 class models are superior to 2-class models and provide information about selectivity | Bruyn et al., 2013 (ref. 54) |
Karaman et al.,214 Davis et al.227 and Metz et al.228 datasets | 50 kinases | Mold 2,229 open babel230 and volsurf231 descriptors | Knowledge-based fields 123 and watermap 124 derived fields | K d/Ki | PLS | 7-fold CV, EV, LOTO, Y-Sc | No | Field-based models are superior to sequence-based models | Subramanian et al., 2013 (ref. 66) |
When no reliable alignment is possible, target descriptors can be calculated using the whole protein sequence without aligning them.49 The usage of only primary sequence descriptors to predict protein–protein interactions was shown efficient by Shen et al.50 who were able to train a SVM model based on more than 16000 protein–protein pairs described with conjoint triad feature amino acid descriptors. Similarly, analyses of sequence variability among targets exhibiting divergent bioactivity profiles, enabled the characterization of binding pocket residues energetically important for ligand binding and selectivity for GPCRs and kinases.51–53
If present, structural information from crystallographic structures can be used by selecting residues near the ligand binding site (e.g. 5 or 10 Å sphere around the co-crystallized ligand).21,43,44,47 Subsequently, the corresponding residues for other targets can be obtained from sequence alignment. This semi-structural method is less reliable than a full structural superposition and alignment gaps might appear. However, in practice, the former appears to have better resolution, which might be due to the fact that domains not involved in ligand binding are not considered.22,54,55 To date, binding sites in PCM models have been derived from single crystallographic structures,22,42,55,56 thus ignoring the intrinsically dynamic nature of proteins. However, databases such as Pocketome57 might facilitate the introduction of dynamic properties of protein binding sites in PCM models as they contain ensembles of conformations for druggable binding sites extracted from co-crystal structures in the Protein Data Bank. To the knowledge of the authors, descriptors accounting for the dynamic properties of binding site amino acids have not been reported in the literature. Including this dynamic information might lead to a better description of protein targets in cases where small molecule binding is dependent on the binding site conformation, e.g. kinases.
Beyond sequence similarity, targets have also been described in different ways to model compound bioactivities on multiple targets.58–62 Among others, targets have been characterized by: (i) the incorporation of biological tests and inverse virtual screening data; (ii) structural pocket similarity analyses; (iii) topology analyses of both compound–target and protein–protein interaction networks; (iv) the combination of pharmacophoric and interaction fingerprints; and (v) 3-dimensional alignment-free methods of binding sequences.7,63–66 The availability of a plethora of target descriptors enables the application of PCM to target families where, for instance, little structural information is available. The advantages brought to the PCM field by each of these descriptor types will be reviewed in Sections 4 and 5. In cases where targets are not proteins, but more complex biological systems, such as cell lines, the target space can be described with ‘omics’ data, namely: copy-number variation (CNV) data, gene expression levels, exome sequencing data, cell line fingerprints, protein abundance, and miRNA expression levels.17,29
Next to the circular fingerprint, physicochemical descriptors, such as DRAGON or PaDEL,71,72 have been widely used in recent years (Table 1). Other ligand descriptors, such as atom types, topological indices, MACCs keys or ligand shape descriptors, have been also applied in the context of PCM.
In the experience of the authors, the description of compounds with circular Morgan fingerprints permits the generation of statistically validated PCM models but on several occasions the addition of physicochemical properties to fingerprints has been demonstrated to improve performance.54 This was especially true on data sets with a large chemical diversity, e.g. resulting from screening a diverse set or resulting from covering a group of targets with diverse ligands.
However, the degree of completeness of the ligand–target interaction matrix is only one parameter influencing the predictive ability of a model. The variability on the chemical and the target side are the other two factors that need to be considered both in model validation and to assess its applicability domain.75 Hence, the authors strongly suggest validating PCM models following a number of basic guidelines, which are in line with the recommendations from Park and Marcotte.77 Firstly, external validation (e.g. 70–30 validation), a model is trained on 70% percent of the data (training set) and the bioactivity for the remaining 30% (test set) is predicted. In this case, all targets and compounds are present in both the training and the test set. This method corresponds to a Park and Marcotte C1 validation and serves to determine if a reliable model can be fit on the data set.
Secondly, Leave-One-Target-Out (LOTO) validation: all the bioactivity data annotated on a target is excluded from the training set. A model is subsequently trained on the training set, which is used to predict the bioactivities for the compounds annotated on the hold-out target. This process is repeated for each target. This validation scheme corresponds to a Park and Marcotte C2 validation and reflects the common situation in prospective validation where there is no information for a given target for which we intend to find hits.
Thirdly, Leave-One-Compound-Out (LOCO) validation: the bioactivity data for a compound on all targets is excluded from the training. Similarly to the LOTO validation, the PCM model trained on the remaining data is used to predict the bioactivity for the hold-out compound on each target. This data availability scenario corresponds to a Park and Marcotte C2 validation and resembles the situation where a PCM model is applied to novel chemistry in a e.g. prospective validation screening campaign. If the number of compounds in the training dataset is large, compound clusters can be used instead of single compounds, thus leading to the Leave-Once-Compound-Cluster-Out validation scenario (LOCCO).17
In addition to these scenarios, the authors suggest to compare the performance of the PCM model trained on all data to single-target QSAR models. The goal of this validation is twofold. Firstly a direct comparison to QSAR can determine whether it is wise to apply PCM to a data set. Secondly, as was touched upon above, bias in the data can be the cause of some targets being reliably modeled and some targets being poorly modeled (see Section 6).23–25 When calculating validation parameters (such as the correlation coefficient) on the full test set, poorly modeled targets can be masked. In order to notice discontinuities, the authors recommend to not only calculate the validation parameters on the full test set. In addition, also calculate validation parameters on test set data points that are grouped per target and points that are grouped per ligand.45 The values of the statistical metrics calculated per target can be directly compared with those obtained with single QSAR models (comparing values calculated on the full test set would not be an accurate comparison).
Ideally, the final validation is one where a target and all compounds that have been tested on this (and other targets) are iteratively excluded from the training set. This approach corresponds with a Park and Marcotte C3 validation. C3 validation is considered extrapolation rather than interpolation, as both parts of the pair (the ligand and the target) have not been seen in the training set by the model.
Taken together, these validation scenarios enable a thorough and earnest validation of PCM models and a comparison to the state of the art. Finally, the authors also suggest to calculate the statistical metrics on, at least, the predictions calculated with three models trained on different subsets of the complete dataset, and to accompany them with the standard deviation observed over the repetitions.75 Similarly, it is advisable to carefully estimate the maximum achievable performance given the uncertainty of the data.17,75
Machine learning method | Short description | Advantages | Disadvantages | References |
---|---|---|---|---|
a New algorithms have been introduced in PCM focusing on: (i) increasing interpretability; (ii) reducing training times; (iii) providing individual intervals of confidence for the predictions; and (iv) considering the experimental uncertainty in the modelling. | ||||
Support Vector Machine (SVM) | Maps the input space into a higher dimensional space where a hyper-plane is defined by ‘support vectors’, lying at the interface between classes | – Medium training time | – Optimize bandwidth hyper-parameter | Gao et al., 2013 (ref. 112) |
– PUK kernel uses an approximation of linear, polynomial and RBF kernels | – No consideration of experimental error | Hur et al., 2008 (ref. 87) | ||
– No Error bars for the predictions | Genton et al., 2001 (ref. 90) | |||
van Westen et al., 2012 (ref. 22) | ||||
Dual-component SVM (DC-SVM) | Amino acid residues and compound fragments are treated as two components | – Accurate prediction of active versus inactive | – Huge kernel matrix | Niijima et al., 2012 (ref. 94) |
– Reduced efficiency due to size | ||||
Transductive SVM (TSVM) | Semi-supervised text mining technique | – Effective with unbalanced datasets | – Difficult to implement without proper tuning | Kondratovich et al., 2013 (ref. 96) |
– Smoothen the decision boundaries | Wang et al., 2005 (ref. 97) | |||
Collobert et al., 2006 (ref. 98) | ||||
Relevant Vector Machine (RVM) | Probabilistic counterpart of SVM | – Contains sparse descriptors | – Non informative predicted variance | Tipping, 2001 (ref. 99) |
– Fast prediction | Lowe et al., 2011 (ref. 100) | |||
– Easy retrieval of important descriptors | ||||
Random Forest (RF) | – Constructs multiple decision trees with random selection of variables | – Computationally less expensive than SVM | – Requires relatively large amounts of memory | De Bruyn et al., 2013 (ref. 54) |
– Short training time | ||||
– High interpretability | ||||
Gaussian Processes (GP) | – Non-parametric Bayesian technique | – Measureable interval of confidence (IC) | – Long training time | Schwaighofer et al., 2007 (ref. 113) |
– Gives each prediction as Gaussian distribution | – Consideration of experimental uncertainty | Cortes-Ciriano et al.75 | ||
Matrix factorization (CF) | – Calculates activities as dot product of compound and target features | – Missing values are predicted efficiently | – Performance on sparse data | Gao et al.115 |
– Interpretability | ||||
– Multi-task learning | – Inferred features could be used as descriptors in the activity model | Erhan et al.117 | ||
– Estimates relatedness between targets |
In a recent study from Lapins et al.88 Random Forest (RF), K-Nearest Neighbors (KNN), and SVMs were applied to construct a PCM model of Cytochrome P450 (CYP) inhibition. The models were trained on 5 CYPs and 17143 compounds. CYPs were described with transition and composition description of amino acids, while compounds were described with structural signature descriptors. These PCM models were shown to outperform single target models in terms of Area Under the Curve (AUC: PCM: >0.90, QSAR: 0.79–0.89) that were constructed in parallel by Cheng et al.89 Of the methods used, RF and SVM were shown to be comparable in terms of accuracy and AUC. The high performance of the SVM model in the external validation (AUC: 0.940) evidences the suitability of this approach to correctly extrapolate in both the target and compound space.
SVMs can use different internal methods (kernels) to derive bioactivity predictions, the most dominant being the Radial Basis Function (RBF) kernel.90 Radial basis function kernels have been shown to perform well on PCM data.16,22 Recently the VII Pearson function-based Universal Kernel (PUK)91 was also applied to PCM. Wu et al.92 showed that they were able to improve the mapping power of their PCM models for 11 histone deacetylases (HDAC's) by using a PUK kernel. Nonetheless, the radial kernel still constitutes a common option when inducting bioactivity models given the necessity to tune only one kernel parameter, i.e. σ, which in practice means shorter training times. Based on those results, the experienced user should keep in mind that although the radial kernel is a robust option with reliable results (in the experience of the authors), a proper kernel choice should be made on the basis of the data at hand.93
Dual Component SVMs (DC-SVM) are an extension of the classical SVM and have been applied by Niijima et al.94 to a kinase dataset spanning the whole kinome. They proposed a dual component naïve Bayesian model in which kinase–inhibitor pairs are represented by protein residues and ligand fragments that form dual components. Hence the probability of being active is simply estimated as the ratio of bioactivity values between active and inactive pairs. This method was further extended to SVMs by modifying a Tanimoto kernel to include compound fragments. PCM DC-SVMs outperformed ligand based SVMs (QSAR) in internal validation, as accuracies of 90.9% and 86.2% were respectively obtained. However the same level of accuracy was not achieved when using external datasets, which produced accuracies of 73.9% and 81.3% for DC-SVM and ligand based SVM. Therefore, these results do not permit to conclude that DC-SVM outperform SVM although this might happen with other datasets.
A second type of SVMs, Transductive SVMs (TSVMs), have been applied to model 10 small (between ∼1000 and ∼3000 datapoints) and unbalanced QSAR datasets from the Directory of Useful Decoys (DUD)95 repository displaying a balanced accuracy higher than 30% on some datasets with respect to SVM.96 The concept relies on transduction, allowing the modelling of partially labeled data which cannot be included using regular SVM. TSVMs could be potentially extended to PCM and have been shown to outperform SVMs in some cases.97,98
A third flavor of SVMs are Relevance Vector Machines (RVMs).99 The added value of RVM is the interpretability of the models, which is a consequence of their Bayesian nature. Each descriptor is associated to a coefficient, which determines its relevance for the model. Coefficients associated to low relevance descriptors are close to zero, hence the model becomes sparse and therefore permits shorter prediction times. Although the predicted variance is not informative in regression studies, class probabilities can be efficiently determined in classification.100 RVMs have been demonstrated by binary classifiers trained on a subset of the MDDR database.100 Therein, it was demonstrated that RVMs performed on par with ‘classic’ SVM, encouraging the authors to conclude that RVM should be added to the current chemoinformatic tools and as such potentially applied to future PCM studies.
On the basis of the above, SVM constitutes a useful algorithm in which initial drawbacks such as interpretability (e.g. the determination of which chemical substructures most contribute to compound bioactivity) can be overcome with new developments (e.g. RVM).
Although RFs have a high interpretability it should be noted that they do not output error estimates (as is also the case with SVM), although recent papers suggest the usefulness of the variance along the trees of a random forest model to determine its applicability domain.102,103 Error estimates are of tremendous importance given the high levels of noise and error annotations in public bioactivity databases. Thus, fully informative predictions should be accompanied by individual uncertainties. This issue can be remediated by applying Quantile Regression Forests (QRF) which infer quantiles from the conditional distribution of the response variable.104 To our knowledge QRFs have not been applied to QSAR or PCM yet. A machine learning technique that has been used in PCM with inherent error estimation capabilities are Gaussian processes, as described below.
Fig. 3 illustrates the basic idea underlying GP modelling. The prior probability distribution (Fig. 3A) covers all possible functions candidate to model the data, each of which has a different weight determined by the kernel (covariance) parameters. Subsequently, only those functions from the prior distribution in agreement with the experimental data are kept (Fig. 3B). The mean of this function is considered as the best fit to the data. Given that each prediction is a Gaussian distribution, different confidence intervals can be defined from its variance (Fig. 3B).
Fig. 3 Illustrative example of GP theory in a two-dimensional problem. (A) The prior probability distribution embraces all possible functions which can potentially model the dataset. A subset of six prototypical functions is depicted. Normally, the mean of the distribution is set to zero (black dashed line). (B) The inclusion of bioactivity information (red dots) accompanied by its experimental uncertainty (blue error bars) updates the prior distribution into the posterior probability distribution. In the posterior probability distribution, only those functions in agreement with the experimental data are kept. The uncertainty (pink area) notably increases in those areas with little experimental information available. The mean of the posterior distribution (black dashed line) is considered the best fit to the data. A prototypical function from the posterior is shown in blue. For a new compound–target combination, the bioactivity is predicted as a Gaussian distribution, in which the mean is the best prediction and its variance the uncertainty. A radial-kernelled GP with σ = 1 was employed to generate the figure. The python infpy package helped to produce the plots.207 |
Gao et al.112 showed that SVMs performed, in general, slightly better than GPs when modelling a dataset composed of 128 ligand and 9 human amine GPCRs, although the models trained on the best combination of descriptors exhibited Q2 values of 0.744 and 0.742 for GP and SVM respectively. Worth of mention, the difference in performance between GP and SVM was not assessed neither statistically nor by comparing the results of a series models trained on different resamples of the whole dataset. Moreover, the predicted error bars by the GP PCM models were not considered. More recently, Cortes-Ciriano et al.75 showed the actual potential of GPs by applying both SVMs and GPs implemented with a panel of diverse kernels to multispecies PCM datasets, namely: human and rat adenosine receptors, mammal GPCRs and Dengue virus proteases. GP and SVM performed comparably as absolute differences were statistically insignificant. However, GP provided notable added values via: (i) the determination of the model AD, (ii) the probabilistic nature of the predictions, and (iii) the inclusion of the experimental uncertainty in the model.
In the experience of the authors regarding the application of GP in PCM,75 and in agreement with Schwaighofer et al.,113 the intervals of confidence (IC) calculated by GP are in accordance with the cumulative Gaussian distribution. Therefore, these intervals of confidence provide valuable information about individual prediction errors. In practice, knowing the error for each prediction can certainly guide decision-making about which compounds should be tested in prospective experimental validation of in silico PCM models. Overall, GP appear as an appealing approach for PCM in spite of the longer CPU time required for the training, as GP is an algorithm of O(N3) time complexity (i.e., it scales with the third power of the size of the dataset).114
Gao et al.115 incorporated a CF approach between 93 cyclopamine derivatives and four cell lines (BxPC-3, NCI-H446, SW1990 and NCI-H157), and showed that collaborative filtering multi-target QSAR outperforms normal QSAR for their dataset. The mean Root-Mean Squared Error (RMSE) for four cell lines was 0.65 log units for CF while it increased to 0.85 log units for (single target) SVR. The collaborative QSAR framework, combined with a feature selection methodology based on collaborative filtering and the content-based recommender systems (a system used by electronic retailers and content providers such as Amazon.com),116 enabled the definition of weights for the compound descriptors (drug-like index). When interpreting their models the authors could determine that molecular volume, polarity, and the cyclic degree are the most influent compound features for multi-cell line inhibitors for this particular pathway (which, from the chemical standpoint, would however be sometimes difficult to interpret structurally). Erhan et al.117 also used CF with a large library of compounds against a family of 12 related targets screened in AstraZeneca's HTS campaigns. The authors elegantly demonstrated how the principles of CF filtering can be used to derive a predictive model with the capability to extrapolate on the target side. However, better results were obtained when using target descriptors (binding pocket fingerprints of 14 bins in this case, where each bin accounts for a type of interaction – ionic, polar, or hydrophobic – in the binding site). Another novelty of this work was the introduction of the kernel-based method Jrank (a kernel perceptron algorithm), which was able to outperform the multi-task neural network in most cases and it never produced significantly worse models. Indeed, in 6 out of 7 cases, this kernel outperformed the random retrieval of compounds. Moreover, the authors also noted that improvements are still possible since Jrank not always outperformed the single-target models.
The overview presented above shows that PCM heavily draws on recent developments in the machine-learning field. However, given that the methods used are only the means to an end, we will in the following also summarize PCM applications in the medicinal chemistry and chemical biology fields, to different target classes as well as different types of biological readout.
Overall, PCM models trained on GPCRs binding site amino acid descriptors have proven to be a powerful approach to identify the GPCRs targets for a given compound, and to predict ligands for orphan GPCRs. The increasing availability of bioactivity data on GPCRs of interest and orthologous sequences,75 as well as the development of novel methodologies to assess GPCRs similarity, is likely to increase the application of PCM on this target family in drug discovery campaigns.
In a recent study by Cao et al.,126 the full kinase sequence space was described by alignment-independent ‘Composition, Transition and Distribution’ (CTD) features,127 along with topological features of compounds. The dataset comprised a total number of kinase–compound interactions of 54012, with data from 22229 compounds and 372 kinases. The best RF model exhibited a classification accuracy in five-fold cross-validation of 93.7%, and a sensitivity of 92.26%. Moreover, this high predictive power was maintained in the four validation levels suggested by Park and Marcotte,77 as the following accuracies and sensitivities (respectively and in percentage units) were obtained: (i) L1: 93.15 and 91.23; (ii) L2: 89.53 and 88.24; (iii) L3: 90.71 and 89.48; and (iv) 87.30 and 85.82. Hence the statistically soundness of this PCM model enabled the classification of compound–kinase pairs as interacting, using a 100 nM concentration as cut-off, or non-interacting. The high predictive ability of the models should be considered nevertheless with caution as the degree of completeness of the bioactivity matrix used in the training was only 0.65%. Therefore, these PCM models should be iteratively updated as more bioactivity values become available. Interestingly, kinases similar in the sequence space exhibited high dissimilarity when assessing their similarity with the inhibitors bioactivities. This was assessed using 120 kinases with more than 15 bioactivity annotations, 14400 datapoints in total. Thus, these data highlights the adequacy of considering chemical and target space to optimize kinase inhibitors.
While high affinity is generally desired for drugs (except possibly in case of multicomponent therapeutics),128 selectivity is equally important when targeting a protein family with highly similar binding sites, such as in this case kinases. Subramanian et al.66 applied PCM models to a kinase dataset comprising 50 different proteins in the DFG-in conformation to better understand both the residue and compound features which determined whether the ATP-binding site of kinases are involved in compound binding. The resulting PLS models, which included cross-terms (see Section 2.3), demonstrated the added value of PCM over ligand based approaches, as statistically satisfactory QSAR models were reported for only 44% of the targets. More importantly, the models could be visually interpreted, thus enhancing the practical usefulness of PCM for the optimization of compound selectivity. (Further details on the study are given in Section 4.4, as models targets were encoded with 3-dimensional information.)
The distinction between Type I and Type II inhibitors has been proved to be amenable to PCM by Mendez-Lucio et al.129 In order to distinguish between Type I and Type II inhibitors, the authors trained a PCM model on a dataset consisting of 463 data points from the interaction matrix defined by 50 known kinase Type I (ATP-competitive) inhibitors against 12 different sequences of ABL1 (five of them) in both the phosphorylated and non-phosphorylated state.130 The model exhibited sound predictive ability, assessed by cross-validation, with RMSE and Q2 values of 0.420 and 0.887 respectively. In addition, the model allowed the full interpretation of both compound (inhibitor) and protein (kinase) features. Hence, along with the prediction of pKd, a PCM model can provide information about the effect of both compound structural features and protein amino acid residues.131–133 The importance of a given compound substructure, or a given amino acid residue, can be evaluated by the calculation of the difference in bioactivity between the predicted value for a compound with and without that substructure.75Fig. 4 displays how this information can be presented in practice and shows the average (over the whole data set) effect of presence of a number of features on the pKd of inhibitor – kinase pairs.
Fig. 4 The effect of presence of compound and amino acid features on bioactivity. (A) Bar plot showing the features of kinase Type I inhibitors and amino acids that affect the pKd value. For this model, the electronic properties related to amino acid 315 and 317 have large impact on pKd (shown as green bars), because of their relevance to enzyme–ligand interactions. (B) Kinase inhibitors containing the highlighted compound features responsible for change in pKd value. The presence of ECFP4_7, ECFP4_34, ECFP4_57 and ECFP4_124 increase the activity, whereas ECFP4_24, ECFP4_41 and ECFP4_120 decrease it.130 |
As shown by these recent PCM studies on the kinase superfamily, PCM can support new concepts for kinase inhibition implicating the simultaneous interaction of kinase inhibitors with several targets leading to multi-target kinase chemotherapy.129,134 Therefore, PCM constitutes a suitable technique to help in the design of kinase inhibitors with respect to their potency and selectivity (Fig. 4).129
Recently, Wu and co-workers utilized structural similarity between three classes of HDACs and generated a predictive model for a novel candidate anti-tumour drug.92 They implemented various descriptors (physicochemical properties) and similarity descriptors (sequence and structure) of compounds and targets in the PCM model and successfully identified the class-selective inhibitors for class-I and class-II HDACs. The best model exhibited high predictive ability, as the authors reported a Q2 value on the external set of 0.754. Overall, the increasing importance of epigenetic targets in drug discovery as well as the availability of large-scale resources of epigenetic targets and its modulators,143,144 will facilitate the application of PCM to this target family.
Next to these applications, PCM has been used to model the sensitivity of viral mutants to antiretroviral drugs, which could potentially guide HIV treatment.145 Resistance testing and prediction using these models is achieved by incorporating genotypic (protein) and drug (chemical) data and subsequently linking them to phenotypic data (resistance). PCM then allows the prediction of optimal treatment regimens. The advantage of PCM over established sequence-based approaches is that interpretation of a single model allows the combined elucidation of residues responsible for the change in efficacy and the complementary chemical features affected.146–149 For instance, van Westen et al.145 trained PCM models based on a large clinical dataset composed of circa 300000 datapoints combining both phenotypic and genotypic data. The application of PCM enabled the integration of the similarity of marketed drugs together with protein sequence similarity. The best model exhibited a fold change error of 0.76 log units, which constitutes an improvement of 0.15 log units with respect to previously reported models trained on only protein sequence similarity (0.91 log fold change error). In addition, the authors identified novel mutations of both HIV reverse transcriptase and HIV protease conferring drug resistance, underlining the ability of PCM models not only to model bioactivity information, but to also learn about features relevant for activity from both the ligand and the protein target side.
Similarly, drug susceptibility profiles were predicted based on PCM. In that way, two models have been reported for the prediction of: (i) the susceptibility (bioactivity profile) of a given HIV protease genotype to seven commonly used protease inhibitors;146 and (ii) the susceptibility of HIV reverse transcriptase to eight nucleoside/nucleotide reverse transcriptase inhibitors.149 PCM models were trained on 4792 HIV protease–inhibitor combinations, being the Q2 value on the external set for the best model 0.87. These models have been made publically available via web-services available at http://www.hivdrc.org/services, allowing free use of these algorithms.150
While the ligands of most PCM studies discussed here were small molecules, protease peptide substrates are also amenable to PCM. This has been demonstrated recently by Prusis et al.151,152 to study the enzyme kinetics parameters for designed small peptide substrates on four dengue virus NS3 proteases using PCM modelling. It was found that the PCM models for Km and Kcat were significantly different. Therefore, by optimizing peptide amino acid properties important for Km activity it was possible to improve peptide affinity to protease, while losing their catalytic activity, hence obtain peptides, which were dengue protease inhibitors.
These studies by Prusis et al. and van Westen et al. are some of the few reports in which predictions have been validated prospectively, demonstrating the predictive power of PCM in different scenarios.
The binding site focused techniques used in above described studies allowed for the identification of orthosteric and allosteric sites on the same target for different ligand families. In this line, Gao et al.93 showed the higher predictive ability of models trained on trans-membrane identity descriptors (Q2 = 0.74) over Z-scales (Q2 = 0.72) when modelling the inhibition constant of 9 human aminergic GPCRs and 128 ligands, (310 ligand–target combinations). Similarly, Shiraishi et al.158 revealed specific chemical substructures binding to relevant TM pocket residues, which is not only relevant to mutational analysis but also serves as a complementary approach to Structure-Based Drug Discovery (SBDD).62,158 TM identity descriptors and TM kernels behave more discriminatingly than Z-scales for GPCRs and allow identification and interpretation of GPCR residues associated with binding of ligands (of a particular chemotype). Therefore, the identification of chemical moieties and residues involved in ligand binding enables the development and optimization of GPCRs inhibitors with respect to both potency and selectivity.
Jacob et al.118 found no improvement through the use of 3-D information. In this study an analysis of 2446 ligands interacting with 80 human GPCRs was performed using a linear vector representing conserved amino acids in the binding pockets. While the binding pocket kernel implicitly encodes 3-D information, the spatial arrangements were derived from the comparison to only two template proteins. Overall, the 3-D kernels (∼77% prediction accuracy) did not show improvements compared to lower dimensional protein descriptions (∼77% prediction accuracy with a protein similarity kernel). Likewise Wassermann et al.159 found little improvement using 3-D information in their analysis of interactions of 12 proteases with 1359 ligands using the TopMatch similarity score,160 which used all amino acids within 8 Å around the catalytic residues to describe the target proteins. This 3-D description did not perform better (∼61% recovery rate) than the sequence (∼57%) and protein class-based (∼62%) kernels used in this publication.
Conversely, early work by Strömbergsson et al.161 used local protein substructures, encoded as motifs of 5 amino acid stretches, which are closer than 6.5 Å to each other. This local substructure method showed for a set of 104 enzymes an improvement over the use of global SCOP (Structural Classification of Proteins) folds and the RMSE values on the external validation set decreased from 2.06 to 1.44 pKi units. Additionally, it was found that local substructures close to the ligand binding sites were assigned more importance in the models than more distant ones, which is intuitively understandable. Similarly, Meslamani and Rognan did find an improvement by using 3-D information.60 581 diverse proteins were described by the 3-D cavity descriptor FuzCav,65 which is a vector of 4834 integers reporting counts of pharmacophoric feature triplets mapped to Cα-atoms of binding site-lining residues. The use of cavity 3-D kernels showed a clear advantage (F-measure 0.66) over sequence-based descriptions (F-measure 0.54) in predicting target-ligand pairings for a large external test set (>14000 ligands, 531 targets), especially in local models. This difference seems to be even more pronounced for datasets with limited ligand data (<50 ligands). Likewise, a recent study by Subramanian et al.66 described the superimposed binding sites of 50 (unique) kinases by molecular interaction fields derived from knowledge-based potentials and Schrödinger's WaterMaps.162,163 Also in this example a significant improvement for 3-D methods (r2 = 0.66, q2 = 0.44) compared to sequence-based methods (r2 = 0.50, q2 = 0.34) was reported. Additionally, this combination of methods allows interpretation and easy visualization of PCM results within the context of ligands and binding pockets.
Earlier studies have not clearly shown the advantages of 3D PCM over solely sequence-based approaches, whereas more recent studies show that including 3D information appears to improve performance. The particular data set used (e.g. number of ligands), and the quality of the data provided, likely determines if there is a possible gain in this type of description. However, the constantly increasing number of protein structures, more robust alignment-free methods (e.g. Nisius and Gohlke164 or Andersson et al.81), and introduction of protein descriptors with easier interpretability (e.g. Desaphy et al.165), might help the interpretation and the visualization of PCM models in the future.
Kramer et al.169 demonstrate this concept by building a structure-based PCM scoring function. Their method inducts a bagged stepwise multiple linear regression model with a subset of 1387 protein–ligand complexes extracted from the PDBbind09-CN database.170 Subsequently a new compound–target interaction descriptor based upon distance-binned Crippen-like atom type pairs was introduced. The best model outperformed commercially available scoring functions assessed on the PDBbind09 database and was able to explain 48% of the variance of the external set, providing a RMSE equal to 1.44. Although similar methods had been previously proposed,171–175 this was the first study where a sufficiently large validation was accomplished to ascertain model's predictive power. Additionally, the implementation of bagged stepwise multiple linear regression (MLR) and PLS enabled the evaluation of the importance of ligand and target descriptors for the PCM model.
Similarly, a subsequent study reported the development of a scoring function based upon the CSAR-NRC HiQ benchmark dataset (http://csardock.org).176 The best model exhibited acceptable statistics with a cross-validated R2 = 0.55 and RMSE = 1.49.176 Finally, Koppisetty et al.177 were able to predict for the first time ligand binding free energies where the enthalpic and entropic contributions for a given binding event were deconvoluted. Therein, the authors demonstrated the importance of including ligand descriptors (QIKPROP and LIGPARSE calculated in Schrödinger suite)178 to the models in addition to 3-dimensional ligand–protein interaction descriptors.
As demonstrated above, PCM overlaps with methods that are originally coming from the structure-based field due to PCM describing in principle any method to relate ligand features and protein/target features on a large scale to an output variable of interest. Another source of complementary information is the information from divergent and convergent homologous sequences. This allows PCM models to extrapolate the bioactivity of ligands to the same protein target in different species as shown below.
This has also been shown to be true for affinities of ligands binding to these orthologues by analyzing bioactivity data, such as in a recent study by Kruger et al.21 the authors demonstrate that the same small molecule exhibits similar binding affinities when acting on orthologues (though some exceptions were found, e.g. Histamine H3 receptor). Moreover, the authors verified that larger differences in binding affinity are observed for paralogues with respect to orthologues by analyzing the differences in binding for a total number of 20309 compounds on 516 human targets, with 651 being the final number of orthologous pairs. These observations aid in optimizing ligands for their interaction with conserved residues across a given protein family, thus making them more desirable lead compounds (thus avoiding their interaction with unrelated targets).180
In the field of PCM, Lapinsh et al.37 demonstrated for the first time the capability of PCM to successfully combine the pKi values of 23 organic compounds on 17 human (paralogues) and 4 rat (orthologues) amine GPCRs. The authors were able to deconvolute the binding site interactions into two types, namely: those involved in specificity and those involved in affinity. Therefore, compound design can be envisioned from the viewpoint of affinity or specificity. Similarly, the contribution of TM regions involved in the interactions of amine GPCRs and compounds to compound affinity was also quantified. For example, TM regions 2, 3, 4, 6 and 7 are responsible for low overall affinity in β2 receptors; however, the same regions are positive contributors to overall high affinity in α1a receptors. van Westen et al.22 built on this by including in a PCM model bioactivity data from four human and rat adenosine receptors (A1, A2A, A2B and A3). The authors screened a commercial chemolibrary composed of 791162 compounds with the most predictive PCM model obtained, which exhibited Q2 and RMSE values of 0.73 and 0.61 pKi units, respectively. Prospective experimental validation led to the discovery of new high-affinity inhibitors, among which a compound with a pKi value of 8.1 on the A1 receptor. Finally, the authors have applied PCM to model the pIC50 value of 3228 distinct compounds on 11 mammalian cyclooxygenases (COX) using ensemble PCM.55 The final ensemble PCM model, trained on the cross-validation predictions of a panel of 282 RF, SVM and Gradient Boosting Machine (GBM) models, each trained with different values of the hyperparameters, led to predictions on the test set with RMSE and R02 values of 0.71 and 0.65, respectively. Additionally, the description of compounds with unhashed Morgan fingerprints permitted a chemically meaningful model interpretation, which highlighted chemical moieties responsible for selectivity towards COX-2 in agreement with the literature.55
The ability of PCM to embrace multispecies information using sequence descriptors allows the creation of models capable to predict compound activity on targets with little available data points on the human orthologue. The existing large body of bioactivity data collected on organisms other than human (e.g. rat and mouse) provides a good resource. This data was derived from the traditional usage of rodent tissues as a source of proteins for biochemical and pharmacological assays. Moreover, the difference in bioactivity between a compound acting on its human target with respect to its orthologue in another species (e.g. the CCR1 antagonist BX471) hampers the utilization of animal models to study human diseases at a molecular level.181 Thus, PCM can help not only to reduce the number of experiments required to complete the compound–target interaction matrix,29 but also appears as a practical tool to understand complex diseases in scenarios where current experimental settings are insufficient (e.g. undeveloped enzymatic assays for a given protein). Similarly, PCM might be applied as a supporting tool in allometric scaling to predict the behavior of clinical candidate drugs in humans.182,183 Nonetheless, the extrapolation capabilities of PCM models are subjected to the completeness of the bioactivity matrix (Fig. 1). In practice, even though high performance can be attained with a matrix completeness level below 3%, the variability of the chemical space plays a key role in determining the extrapolation capability of a PCM model on the chemical side.75 Therefore, a balance has to be found between the coverage of chemical and target space, and the degree of completeness of the bioactivity matrix.
The availability of pharmacogenomics and toxicogenomics data has enabled predictive modelling of cancer cell line sensitivity. These models consider as the dependent variable the response of a whole cell to a given drug, such as in the form of EC50 values, which determines the concentration at which a chemical exerts half of its maximal effect. Therefore, the ‘target’ component in the PCM model is no longer a single protein, described in terms of binding site properties, but by more complex (usually genomic) features such as oncogene mutations, cell karyotypes or gene expression levels.
In the context of human cell lines, the work on the NCI-60 cell line panel, which covers cells from 9 different cancer types, has helped to find novel molecular determinants of drugs sensitivity, as well as to develop drugs targeting concrete tumor types (disease-oriented); e.g. 9-Cl-2-methylellipticinium acetate for central nervous system tumours.184 However, the number of cancer cell lines with drug sensitivity data has vastly increased with the release in 2012 of two major cancer cell line panels, namely: the Cancer Cell Line Encyclopedia (CCLE) consisting of 947 cancer cell lines185 and the Genomics of Drug Sensitivity in Cancer (GDSC) consisting of 727 cancer cell lines.186 The setup of both cell line collections, sharing a total number of 471 cell lines, enabled large scale pharmacological profiling thereof. In that way, Barretina et al.185 measured the chemotherapeutic effect of 24 drugs on the CCLE panel, while Garnett et al.,186 tested 130 chemical compounds on the GDSC cell line collection. In both cases, the cell lines were further characterized genomically, by measuring gene expression data, chromosomal copy numbers, oncogene mutations, and microsatellite instability. Recently, Basu et al.187 measured the sensitivity of 242 cell lines form the CCLE panel to an Informer Set composed of 354 diverse molecules, including 54 clinical candidates and 35 FDA-approved drugs. The sensitivity data is publicly available at the Cancer Therapeutics Response Portal (CTRP).188
The availability of public bioactivity profiles for compounds in combination with detailed genetic information of the cell lines constitutes a scenario where ML can be applied for predictive cell line sensitivity modelling. In this area, Menden et al.29 exploited cell line drug sensitivity information from the GDSC and incorporated genomic features in combination with chemical descriptors in non parametric models, i.e. neural networks and random forests. These models allowed the authors to determine the missing drug response (IC50) values in the original cell-line compound matrix. The best model predicted the sensitivity on the external (blind) test with a correlation between observed and predicted of 0.64, while a value of 0.61 was obtained when predicting the response on a tissue unseen by the model in the training phase. Recently, the authors have integrated PCM random forest models with conformal prediction for the large-scale prediction of cancer cell line sensitivity with error bars.17,189 Compounds were described with Morgan fingerprints, whereas a total of 16 cell line profiling datasets were benchmarked for their predictive signal. Gene expression data constantly led to the highest predictive power. Interestingly, the authors found statistically significant differences in predictive power between PCM models trained on cell line identity fingerprints (inductive transfer knowledge between cell lines)190 and cell line profiling data, suggesting that the explicit inclusion of cell line information improves the prediction of cell line sensitivity. Of practical relevance, the predicted bioactivities enabled the prediction of growth inhibition patterns on the NCI60 panel and the identification of genomic markers of drug sensitivity.
The cancer cell line collections described above still remain to be fully exploited. While they constitute a great opportunity for PCM to integrate both drug sensitivity and genomics data in single models, this data integration still remains challenging due to the disagreement of drug sensitivity measurements between the CCLE and the GDSC.191,192 Overall, the principles of PCM, namely the combination of chemical and cell line (target) information in single machine learning models, are suited to integrate and exploit the increasing availability of drug sensitivity measurements on cancer cell line panels. The application of PCM in pharmacogenomics is a recent sub-field of which the authors are certain it will grow in the near future. Moreover, in silico drug sensitivity prediction is a cost-efficient method capable to relate large-scale pharmacogenomics data, which is likely to foster the identification of chemotherapeutic lead compounds in both the academic and pharmaceutical cancer drug discovery pipeline.
As described in this review, a large variety of protein targets have been modelled using PCM. Beyond the modelling of the activity of compounds on targets of diverse nature, the interaction between nucleic acids and proteins is also amenable to PCM modelling. In this context, Bellucci et al. predicted protein–RNA interaction based upon the physicochemical properties of both the polypeptide and the nucleotide chains.202 However, to date few studies have been published in this area.50,202
In addition to being informative for biologists, these confidence intervals constitute a valuable source of information about the applicability domain (AD) of a given model.75 The AD is defined as the amount of ligand and target space to which a given model can be reliably applied. Thus, in addition to the model validation schemes presented above, an estimation of model AD should accompany any reported model in order to be of practical usefulness.
Another limitation which is often inherently related to bioactivity data is that of data skewness. Some datasets mostly report active203 or inactive molecules,204 and thus compound–target combinations untested experimentally are normally considered as inactive or active interactions, respectively. Moreover, public data in general tend to favor a relatively small number of proteins classes that have been extensively explored (e.g. GPCRs and kinases).23–25,205 As such, for some targets the available data might not be sufficient for PCM projects given that imbalanced datasets can lead to models with high negative or false positive rates. Nevertheless, the modelling of cell line sensitivity has shown that PCM displays high interpolation power, as the accuracy of prediction reached a plateau when 20% of the whole compound-cell line matrix was included in the training set.29
Beyond the quality of the data, the descriptor choice still constitutes a field of active research, specially with respect to protein descriptors, which development will deeply influence the success of PCM in the coming years.45 A recent paper by Brown et al.190 suggested that PCM mostly relies on inductive transfer knowledge and that protein descriptors mostly act as labels and do not account for structural differences among them. However, we have recently shown that both amino acid descriptors and cell line profiling datasets account for structural information of eukaryotic, mammal and bacterial DHFR, and cancer cell lines, where the difference in performance on the test set between inductive transfer and PCM models was statistically significant.17,56
PCM requires the concatenation of ligand and target descriptors, and sometimes also cross-terms, which substantially increases the dimensionality of the input space with respect to QSAR. Although this higher dimensionality might lead to overfitting in PCM,206 in practice, PCM has been shown to exhibit higher predictive power on the test set than QSAR.22,26,75
Overall, the ability of PCM to become a customary technique in both the public and the private domain in the following years will certainly rest on its capability to capitalize on biological data of diverse nature, including personalized ‘omics’ data (personalized medicine), in combination with structural data of ligands, be those small molecules, antibodies or peptides.
3D | 3-Dimensional |
CF | Collaborative Filtering |
GP | Gaussian Process |
GPCR | G Protein-Coupled Receptor |
IC50 | Half Maximal Inhibitory Concentration |
K d | Dissociation constant |
K i | Inhibition constant |
PCM | Proteochemometric(s) |
PLS | Partial Least Squares |
QSAR | Quantitative Structure–Activity Relationship |
R&D | Research and Development |
RF | Random Forests |
SVM | Support Vector Machines |
TM | Trans Membrane |
Footnote |
† Authors contributed equally to this work. |
This journal is © The Royal Society of Chemistry 2015 |