Polypharmacology modelling using proteochemometrics (PCM): recent methodological developments, applications to target families, and future prospects

Proteochemometric (PCM) modelling is a computational method to model the bioactivity of multiple ligands against multiple related protein targets simultaneously. Hence it has been found to be particularly useful when exploring the selectivity and promiscuity of ligands on di ﬀ erent proteins. In this review, we will ﬁ rstly provide a brief introduction to the main concepts of PCM for readers new to the ﬁ eld. The next part focuses on recent technical advances, including the application of support vector machines (SVMs) using di ﬀ erent kernel functions, random forests, Gaussian processes and collaborative ﬁ ltering. The subsequent section will then describe some novel practical applications of PCM in the medicinal chemistry ﬁ eld, including studies on GPCRs, kinases, viral proteins ( e.g. from HIV) and epigenetic targets such as histone deacetylases. Finally, we will conclude by summarizing novel developments in PCM, which we expect to gain further importance in the future. These developments include adding three-dimensional protein target information, application of PCM to the prediction of binding energies, and application of the concept in the ﬁ elds of pharmacogenomics and toxicogenomics. This review is an update to a related publication in 2011 and it mainly focuses on developments in the ﬁ eld since then.


Introduction
1.1 Available bioactivity data is growing: but can we make sense of it?
The cost of developing new drugs has been continuously increasing in recent years and it is now estimated to be in the order of $1.8 billion per drug. In addition, price pressure from health care providers has been increasing and there is a growing relevance of more targeted medicine. Hence, the 'blockbuster model' of the pharmaceutical industry is being challenged. 1,2 However, at the same time the amount of bioactivity data available both inside companies as well as in the public domain has signicantly increased, for example with introduction of ChEMBL and PubChem Bioassay. 3,4 This trend can be expected to only pick up further speed in the future. 3 The question now arises how this growing amount of bioactivity data can be used in real-world drug discovery and chemical biology projects, both to make drug discovery in commercial settings more efficient, but also to understand on a more fundamental level how we can use data in order to design a ligand with desired properties in a biological system.
Predictive bioactivity methods, such as Quantitative Structure-Activity Relationship (QSAR) models, are based upon the compound similarity principle. 5,6 However, it has been shown that the activity of a compound against a single target is not sufficient to understand its actions in a biological system. In fact promiscuity is intrinsic to chemical compounds, 7,8 bioactivity against related targets frequently needs to be considered for efficacy of e.g. CNS-active drugs and anti-cancer drugs, 9,10 and promiscuity has been used to anticipate side-effects. 11 Hence, only the simultaneous modelling of both the chemical and the target domain, across a series of protein targets, permits the meaningful mining of the compound-target interaction space. 12 The term chemogenomics comprises techniques capable to capitalize on this huge amount of bioactivity data by considering compound and target information, in order to nd unknown interactions between (new) compounds and their (new) targets. 13,14 Proteochemometrics (PCM) modelling describes methods where a computational description from the ligand side of the system is combined with a description of the biological side being studied and both are related to a particular readout of interest. 15,16 In this context, ligands are typically small molecules although biologics also have been explored. Conversely, the biological parameters in the model can comprise protein binding sites, but also e.g. gene expression levels of particular cell lines. The readout describes the biological effect of a particular ligand on the protein or cell line of interest (such as an IC 50 value of this particular combination of compound and biological system). Additionally, PCM relates to personalized medicine as it can predict the effect of a ligand on a complex biological system, e.g. cell line, from genotypic information. 17

Synergy between ligand and target space
An analysis of the drug-target interaction network demonstrated that a given ligand interacts with six protein targets on average at therapeutic concentrations. 7 Targets with correlated bioactivity proles might be related or distant from a sequence similarity standpoint. It has been recently shown that for class A GPCRs protein classication based on ligand activity differs considerably from the classic description of proteins based Vigneshwari Subramanian studied Bioinformatics at the University of Helsinki, Finland and is currently doing her PhD in Computational Drug Discovery in the same university. Her research focuses on proteochemometric modelling involving 3D protein eld-based descriptors.
Eelke B. Lenselink is currently pursuing his PhD at the LACDR in Leiden were he focuses on ligand and structure based design for GPCRs. upon sequence alignments. 18,19 Hence, full sequence similarity from multiple sequence alignments would not generally correlate with similar ligand affinity. Conversely, kinases exhibiting a sequence identity higher than 60% tend to have similar ATPbinding sites and hence they tend to be inhibited by similar compounds. 20 Similarly, compound binding is more conserved between human and rat orthologous proteins with respect to paralogues. 21,22 Thus, to better understand intra-family and inter-species selectivity both the target and the compound space need to be considered simultaneously.
In ligand space, chemogenomic approaches relying only on ligand data have shown that there is an unequal distribution of ligand data. This is due to the fact that some target classes (e.g. GPCRs or kinases) have been traditionally regarded as more interesting from a medicinal chemistry standpoint, and are thus overrepresented in bioactivity databases. 23 Moreover, while some chemogenomic methods implicitly consider target information using bioactivity proles of groups of similar ligands, i.e. the interaction between these compounds and a panel of targets, they are outperformed by techniques that explicitly consider target information. 24,25 In addition, bioactivity proles for related compounds are not always available.
In target space, techniques were employed which benet from the structural or sequence information available and rely on groups of related targets with the aim to identify possible offtarget effects and drug specicity for a particular target of interest. 25 Based on the inverse similarity principle, related proteins are likely to interact with similar compounds. As in the previous case, the unavailability of data also constitutes a limitation for target-based chemogenomics.
The combination of ligand and target data allows the creation of predictive models that can rationalize e.g. viral or cancer cell line selectivity, whereas models exclusively based on ligands cannot explain the role of the target in selectivity. 26 Merging data from ligand and target sources into the frame of a single machine learning model allows the prediction of the most suitable pharmacological treatment for a given genotype (personalized medicine), which ligand-only and protein-only approaches are not able to perform. This is precisely the underlying principle in proteochemometrics (PCM), which employs both ligand and target features simultaneously, and which therefore enables the deconvolution of both the target and the chemical spaces in parallel. 15,16 2 Proteochemometric modelling 2.1 PCM as a practical approach to use chemogenomics data PCM modelling, is a computational technique which combines both ligand and target information within a single predictive model in order to predict an output variable of interest (usually the activity of a molecule in a particular biological assay). 15,16 It is this combination of orthogonous information that sets PCM apart from both QSAR and chemogenomics. 25,27 Generally, the term 'target' refers to proteins since the majority of PCM models in the literature have been devoted to the study of the activity of compounds on protein targets. Yet, target can also refer to a certain protein binding pocket (to allow distinction between binding modes, protein conformations, or allosteric/orthosteric binding), to a protein complex, or even to a cell line. 28,29 Each binding site and each binding mode can be regarded (computationally) as a 'different target'.
A PCM model is trained on a dataset composed of a series of targets and compounds, where ideally compounds have been measured on as many targets as possible (illustrated in Fig. 1). The simultaneous modelling of the target and the ligand space permits to better understand complex drug-target interactions (e.g. selectivity) [30][31][32][33] than would be possible with chemogenomics as the effect of target and chemical variability can be evaluated (e.g. protein mutations or the effect of chemical substructures on bioactivity). Thus, the aim of PCM is the complete modelling of the compound-target interaction space (Fig. 1), including also the prediction of the bioactivity of novel compounds on yet untested targets.
Initial attempts to incorporate description of several proteins and their ligands in a single QSAR model involved modelling of the interaction between mutated glucocorticoid receptors and DNA. 34,35 The rst full scale PCM study involving different proteins was devoted to the interaction of chimeric melanocortin receptors with chimeric peptides at Uppsala University. 36 The name "proteochemometrics" was coined later by the same research group. 15 Since then PCM has been applied on various diverse datasets (Table 1). 37,38 While the current review will focus on recent developments in the eld, a comprehensive discussion of PCM-related work has been presented in a previous review by van Westen et al. from 2011 to which we would like to refer the reader. 16

Practical relevance of PCM
The novel way that PCM considers the unity of chemical and target space permits to better understand and predict the Fig. 1 Ligand-target interaction space. The interaction between ligands (chemical compounds) and targets (biological macromolecules) can be envisioned as a matrix, where rows are indexed by target ids and columns by compound ids. Each matrix cell contains the binding affinity of a given compound on a given target, indicated by the following colors: blue means low affinity and yellow means high affinity. Traditional bioinformatics techniques have dealt with the similarity between targets, normally based upon sequence similarity. On the other hand, ligand based (QSAR) models have studied series of compounds acting on a given target. By contrast to both of them, PCM relates the chemical-target interaction space by describing targets and compounds with numerical descriptors permitting to predict activities of a given compound on a given target. The wide applicability of PCM is evidenced by the increased coverage of drug targets in the studies of the last three years. Although traditional drug targets, such as GPCRs or kinases, are still widely represented, new applications (e.g. the modelling of viral genotypes or pharmacogenomics) are gaining ground steadily. BPN inuence of target variability on compound activity. For instance, predicting compound activity on a cancer cell line panel can identify selective compounds towards a particular cell line. 17 Similarly, the inuence of viral proteins mutations in compound activity can be quantied. 39 Therefore, PCM opens new avenues: (i) to mine drug affinity databases with the goal to create multi-target and multispecies models, (ii) to integrate toxicogenomics and phenotypic data in predictive models, (iii) to identify designed or natural ligands for orphan receptors (receptor deorphanization), (iv) and to design personalized medicine for viral infections or a dened cancer type based on genotypic information. The ability of PCM to model these data depends on the structure of the input matrix, as we will elaborate on below, and concrete examples referring to the above elds will be presented in the subsequent sections.

Input data for PCM
The ligand-target interaction space can be visualized as a matrix containing the activities of all possible ligand-target combinations ( Fig. 1). 40 PCM attempts to predict the activity of a ligand on any target and vice versa, the activity of any ligand on a given target. The integration of these independent compound-target interactions is however possible in PCM due to the combination of chemical and target information in a single machine learning model. Fig. 2 gives an overview of how different sources of data can be integrated for modelling a particular aspect of bioactivity of a given ligand in different biological settings. Fig. 2A displays how compound and target information relate and are combined in a predictive model which permits the extrapolation in either (or both) the chemical or target space (to the extent the training data allows). These two input spaces are numerically described (Fig. 2B)   interpretation of the target space, which can identify residues that are implicated in e.g. drug resistance of a viral protein.
Thus, compounds can be developed by considering potency and selectivity towards a given target or target family. The nal panel shows how PCM models can help to determine the best drug regime given a patients genotype (personalized medicine).
Here, the activity of all drugs would be predicted on that genotype and the drug predicted to exhibit the highest activity would be preferentially selected.

Target descriptors
As was touched upon above, PCM is rather exible and can deal with a multitude of different target descriptors. Here, we will summarize some of the more common descriptors and later on in the review focus on novel descriptor types, for a full overview of established descriptors please see van Westen et al. 2011. 16 By far the most common descriptors are alignment dependent sequence descriptors. 43 The authors refer the reader to a pair of benchmark studies recently published for more information on this type of descriptor. 44,45 This type of protein descriptor is usually obtained from a concatenation of individual amino acid descriptors and requires the individual sequences to be aligned. This can be done using full sequence alignment by established tools such as ClustalW and subsequently these alignments are converted to position-dependent numerical descriptors, e.g. the Z-scales by Sandberg. [46][47][48] When no reliable alignment is possible, target descriptors can be calculated using the whole protein sequence without aligning them. 49 The usage of only primary sequence descriptors to predict protein-protein interactions was shown efficient by Shen et al. 50 who were able to train a SVM model based on more than 16 000 protein-protein pairs described with conjoint triad feature amino acid descriptors. Similarly, analyses of sequence variability among targets exhibiting divergent bioactivity proles, enabled the characterization of binding pocket residues energetically important for ligand binding and selectivity for GPCRs and kinases. [51][52][53] If present, structural information from crystallographic structures can be used by selecting residues near the ligand binding site (e.g. 5 or 10Å sphere around the co-crystallized ligand). 21,43,44,47 Subsequently, the corresponding residues for other targets can be obtained from sequence alignment. This semi-structural method is less reliable than a full structural superposition and alignment gaps might appear. However, in practice, the former appears to have better resolution, which might be due to the fact that domains not involved in ligand binding are not considered. 22,54,55 To date, binding sites in PCM models have been derived from single crystallographic structures, 22,42,55,56 thus ignoring the intrinsically dynamic nature of proteins. However, databases such as Pocketome 57 might facilitate the introduction of dynamic properties of protein binding sites in PCM models as they contain ensembles of conformations for druggable binding sites extracted from co-crystal structures in the Protein Data Bank. To the knowledge of the authors, descriptors accounting for the dynamic properties of binding site amino acids have not been reported in the literature. Including this dynamic information might lead to a better description of protein targets in cases where small molecule binding is dependent on the binding site conformation, e.g. kinases.
Beyond sequence similarity, targets have also been described in different ways to model compound bioactivities on multiple targets. [58][59][60][61][62] Among others, targets have been characterized by: (i) the incorporation of biological tests and inverse virtual screening data; (ii) structural pocket similarity analyses; (iii) topology analyses of both compound-target and proteinprotein interaction networks; (iv) the combination of pharmacophoric and interaction ngerprints; and (v) 3-dimensional alignment-free methods of binding sequences. 7,63-66 The availability of a plethora of target descriptors enables the application of PCM to target families where, for instance, little structural information is available. The advantages brought to the PCM eld by each of these descriptor types will be reviewed in Sections 4 and 5. In cases where targets are not proteins, but more complex biological systems, such as cell lines, the target space can be described with 'omics' data, namely: copy-number variation (CNV) data, gene expression levels, exome sequencing data, cell line ngerprints, protein abundance, and miRNA expression levels. 17,29

Ligand descriptors
Similarly, from the ligand side a large number of descriptors have been employed in PCM in the last decade. 67,68 Circular ngerprints are the most commonly applied due to both their consistent good performance and interpretability when using the unhashed (keyed) version. 69,70 Keyed circular ngerprints, in both binary and counts format, where each bit in the descriptor accounts for the number of occurrences of a substructure in a given molecule, enable the interpretation of models and the identication of chemical substructures implicated in compound potency and selectivity. The performance of models trained on hashed and unhashed circular Morgan ngerprints do not vary signicantly. 55 Therefore, we advocate for the customary usage of unhashed ngerprints in order to enhance the interpretability of PCM models.
Next to the circular ngerprint, physicochemical descriptors, such as DRAGON or PaDEL, 71,72 have been widely used in recent years (Table 1). Other ligand descriptors, such as atom types, topological indices, MACCs keys or ligand shape descriptors, have been also applied in the context of PCM.
In the experience of the authors, the description of compounds with circular Morgan ngerprints permits the generation of statistically validated PCM models but on several occasions the addition of physicochemical properties to ngerprints has been demonstrated to improve performance. 54 This was especially true on data sets with a large chemical diversity, e.g. resulting from screening a diverse set or resulting from covering a group of targets with diverse ligands.

Cross-term descriptors
Thirdly, some PCM studies have dened an additional class of descriptors, called cross-terms, by multiplying ligand and target descriptors. These descriptors serve as descriptors for the nonlinear components in the interaction between ligand and target (e.g. a hydrogen bond that can be formed in one target but not in another). 43,73 Therefore, its application is advisable when using linear modelling techniques (such as Partial Least Squares (PLS)). In the case of non-linear techniques, crossterms are not essential as the models should be able to capture this information. 22,74 Nonetheless, the experience of the authors indicates that they might be nevertheless useful to improve model performance when using SVM or GP even though their interpretability might not be straightforward. For further reading on different types of descriptors applied in PCM we refer the reader to van Westen et al. 16

Validation of PCM models
Due to the previously mentioned bias in bioactivity data (both from a chemical point of view and target point of view) the ligand-target interaction matrix is virtually never complete. [23][24][25] The authors have trained PCM models on sparse datasets with a degree of matrix completeness in the 2-3% range that demonstrated good performance on the test set. 75 The statistical metrics proposed by Golbraikh and Tropsha 76 can be used (similar to QSAR) to validate models using observed and predicted values on the test set. Recent studies recommend the usage of nested cross-validation (NCV) to report model performance. [77][78][79][80] In NCV, two validation loops are nested: the inner one serves to optimize the values of the hyperparameters through traditional k-fold cross-validation, whereas the outer loop serves to assess the predictive ability of the model trained on the whole training set. This procedure is repeated k 0 times, each time changing the composition of the training and the test sets. Thus, NCV does not provide the best parameter combination, as in each k 0 round the best values of the hyperparameters might change due to the variance of the different training sets. Still, it provides the best estimate of the CV error as it provides an error interval, which can be wide depending on the dataset modeled. 80 However, the degree of completeness of the ligand-target interaction matrix is only one parameter inuencing the predictive ability of a model. The variability on the chemical and the target side are the other two factors that need to be considered both in model validation and to assess its applicability domain. 75 Hence, the authors strongly suggest validating PCM models following a number of basic guidelines, which are in line with the recommendations from Park and Marcotte. 77 Firstly, external validation (e.g. 70-30 validation), a model is trained on 70% percent of the data (training set) and the bioactivity for the remaining 30% (test set) is predicted. In this case, all targets and compounds are present in both the training and the test set. This method corresponds to a Park and Marcotte C1 validation and serves to determine if a reliable model can be t on the data set.
Secondly, Leave-One-Target-Out (LOTO) validation: all the bioactivity data annotated on a target is excluded from the training set. A model is subsequently trained on the training set, which is used to predict the bioactivities for the compounds annotated on the hold-out target. This process is repeated for each target. This validation scheme corresponds to a Park and Marcotte C2 validation and reects the common situation in prospective validation where there is no information for a given target for which we intend to nd hits.
Thirdly, Leave-One-Compound-Out (LOCO) validation: the bioactivity data for a compound on all targets is excluded from the training. Similarly to the LOTO validation, the PCM model trained on the remaining data is used to predict the bioactivity for the hold-out compound on each target. This data availability scenario corresponds to a Park and Marcotte C2 validation and resembles the situation where a PCM model is applied to novel chemistry in a e.g. prospective validation screening campaign. If the number of compounds in the training dataset is large, compound clusters can be used instead of single compounds, thus leading to the Leave-Once-Compound-Cluster-Out validation scenario (LOCCO). 17 In addition to these scenarios, the authors suggest to compare the performance of the PCM model trained on all data to single-target QSAR models. The goal of this validation is twofold. Firstly a direct comparison to QSAR can determine whether it is wise to apply PCM to a data set. Secondly, as was touched upon above, bias in the data can be the cause of some targets being reliably modeled and some targets being poorly modeled (see Section 6). [23][24][25] When calculating validation parameters (such as the correlation coefficient) on the full test set, poorly modeled targets can be masked. In order to notice discontinuities, the authors recommend to not only calculate the validation parameters on the full test set. In addition, also calculate validation parameters on test set data points that are grouped per target and points that are grouped per ligand. 45 The values of the statistical metrics calculated per target can be directly compared with those obtained with single QSAR models (comparing values calculated on the full test set would not be an accurate comparison).
Ideally, the nal validation is one where a target and all compounds that have been tested on this (and other targets) are iteratively excluded from the training set. This approach corresponds with a Park and Marcotte C3 validation. C3 validation is considered extrapolation rather than interpolation, as both parts of the pair (the ligand and the target) have not been seen in the training set by the model.
Taken together, these validation scenarios enable a thorough and earnest validation of PCM models and a comparison to the state of the art. Finally, the authors also suggest to calculate the statistical metrics on, at least, the predictions calculated with three models trained on different subsets of the complete dataset, and to accompany them with the standard deviation observed over the repetitions. 75 Similarly, it is advisable to carefully estimate the maximum achievable performance given the uncertainty of the data. 17,75 2.8 Review outline Table 1 summarizes the main features of the PCM studies published between 2010 and 2013. In addition to traditional therapeutic targets (e.g. kinases or GPCRs), which continue to be well represented in recent PCM studies, other applications and techniques are gaining ground steadily, namely: (i) the modelling of the selectivity of viral protein mutants, mainly HIV; (ii) the inclusion of bioactivity information from mammal orthologues; (iii) the usage of 3-dimensional target information; and (iv) toxicogenomics and pharmacogenomics. In this review, we will focus on: (Section 3): (novel) machine learning techniques successfully applied in recent PCM studies (Table 2) and other predictive modelling contexts such as chemoinformatics; (Section 4): recent applications of PCM on established protein target classes; (Section 5): novel applications; (Section 6) pitfalls of PCM; (Section 7) future perspectives and concluding remarks close the review.

Machine learning in PCM
Most of the currently used machine learning (PLS, rough set modelling, neural net modelling, Naïve Bayesian classiers, and decision tree algorithms) as well as data preprocessing techniques in PCM have been described in recent reviews by Andersson et al. 81 and van Westen et al. 16 Moreover, feature selection methods and common algorithms have been recently benchmarked, with the overall conclusion that kernel and tree methods, such as SVM or RF, do not benet from feature selection, and that no particular algorithm-feature selection pair appears to be preferable. [82][83][84] Therefore, only recent applications of novel techniques applied to PCM or chemoinformatic modelling will be discussed here, namely: Support Vector Machines (SVM), Random Forest (RF), Gaussian Processes (GP) and Collective Filtering (CF). A detailed description of the machine learning algorithms described in the following subsections is given in Table 2.

Support Vector Machines (SVM)
Support Vector Machines (SVMs) are a group of non-linear machine learning techniques commonly used in computational biology, and in PCM in particular. 16,22 SVMs became popular in the last decade due to their performance and efficient capacity to deal with large datasets also in high-dimensional variable spaces, even though interpretability can be challenging. [85][86][87] Furthermore SVMs require proper tuning of the so-called hyper parameters, usually determined by an exponential grid search.
In a recent study from Lapins et al. 88 Random Forest (RF), K-Nearest Neighbors (KNN), and SVMs were applied to construct a PCM model of Cytochrome P450 (CYP) inhibition. The models were trained on 5 CYPs and 17 143 compounds. CYPs were described with transition and composition description of amino acids, while compounds were described with structural signature descriptors. These PCM models were shown to outperform single target models in terms of Area Under the Curve (AUC: PCM: >0.90, QSAR: 0.79-0.89) that were constructed in parallel by Cheng et al. 89 Of the methods used, RF and SVM were shown to be comparable in terms of accuracy and AUC. The high performance of the SVM model in the external validation (AUC: 0.940) evidences the suitability of this approach to correctly extrapolate in both the target and compound space.
SVMs can use different internal methods (kernels) to derive bioactivity predictions, the most dominant being the Radial Basis Function (RBF) kernel. 90 Radial basis function kernels have been shown to perform well on PCM data. 16,22 Recently the VII Pearson function-based Universal Kernel (PUK) 91 was also applied to PCM. Wu et al. 92 showed that they were able to improve the mapping power of their PCM models for 11 histone deacetylases (HDAC's) by using a PUK kernel. Nonetheless, the radial kernel still constitutes a common option when inducting bioactivity models given the necessity to tune only one kernel parameter, i.e. s, which in practice means shorter training times. Based on those results, the experienced user should keep in mind that although the radial kernel is a robust option with reliable results (in the experience of the authors), a proper kernel choice should be made on the basis of the data at hand. 93 Dual Component SVMs (DC-SVM) are an extension of the classical SVM and have been applied by Niijima et al. 94 to a kinase dataset spanning the whole kinome. They proposed a dual component naïve Bayesian model in which kinase-inhibitor pairs are represented by protein residues and ligand fragments that form dual components. Hence the probability of being active is simply estimated as the ratio of bioactivity values between active and inactive pairs. This method was further extended to SVMs by modifying a Tanimoto kernel to include compound fragments. PCM DC-SVMs outperformed ligand based SVMs (QSAR) in internal validation, as accuracies of 90.9% and 86.2% were respectively obtained. However the same level of accuracy was not achieved when using external datasets, which produced accuracies of 73.9% and 81.3% for DC-SVM and ligand based SVM. Therefore, these results do not permit to conclude that DC-SVM outperform SVM although this might happen with other datasets.
A second type of SVMs, Transductive SVMs (TSVMs), have been applied to model 10 small (between $1000 and $3000 datapoints) and unbalanced QSAR datasets from the Directory of Useful Decoys (DUD) 95 repository displaying a balanced accuracy higher than 30% on some datasets with respect to SVM. 96 The concept relies on transduction, allowing the modelling of partially labeled data which cannot be included using regular SVM. TSVMs could be potentially extended to PCM and have been shown to outperform SVMs in some cases. 97,98 A third avor of SVMs are Relevance Vector Machines (RVMs). 99 The added value of RVM is the interpretability of the models, which is a consequence of their Bayesian nature. Each descriptor is associated to a coefficient, which determines its relevance for the model. Coefficients associated to low relevance descriptors are close to zero, hence the model becomes sparse and therefore permits shorter prediction times. Although the predicted variance is not informative in regression studies, class probabilities can be efficiently determined in classication. 100 RVMs have been demonstrated by binary classiers trained on a subset of the MDDR database. 100 Therein, it was demonstrated that RVMs performed on par with 'classic' SVM, encouraging the authors to conclude that RVM should be added to the current Table 2 Selection of machine learning prediction methods used for PCM a chemoinformatic tools and as such potentially applied to future PCM studies. On the basis of the above, SVM constitutes a useful algorithm in which initial drawbacks such as interpretability (e.g. the determination of which chemical substructures most contribute to compound bioactivity) can be overcome with new developments (e.g. RVM).

Random Forests (RF)
Random Forest (RF) models are oen comparable in performance to SVMs, 16 and are also non-linear. However, contrary to SVMs RFs tend to have relatively short training times and do not require extensive parameter tuning. 101 Furthermore, in addition to their comparable performance, RFs permit an evaluation of both feature contribution and feature importance in PCM models, as shown by de Bruyn et al. 54 An example of such evaluation is given in the identication of organic aniontransporting polypeptide (OATP) inhibitors, where continuous descriptors, both Z-scales (proteins) and physiochemical features (compounds), were binned into discrete classes. For each feature (protein and ligand) the correlation to activity and importance was calculated for each target class. In that way, compound inactivity was correlated with the presence of chemical substructures positively charged at pH 7.4, number of atoms <20, and molecular weight <300. Conversely, chemical substructures with a number of ring bonds between 18 and 32, without atoms with positive charge, and with a log D value between 3.4 and 7.5 were found to favour OATP inhibition.
Although RFs have a high interpretability it should be noted that they do not output error estimates (as is also the case with SVM), although recent papers suggest the usefulness of the variance along the trees of a random forest model to determine its applicability domain. 102,103 Error estimates are of tremendous importance given the high levels of noise and error annotations in public bioactivity databases. Thus, fully informative predictions should be accompanied by individual uncertainties. This issue can be remediated by applying Quantile Regression Forests (QRF) which infer quantiles from the conditional distribution of the response variable. 104 To our knowledge QRFs have not been applied to QSAR or PCM yet. A machine learning technique that has been used in PCM with inherent error estimation capabilities are Gaussian processes, as described below.

Gaussian Processes (GP)
The determination of the applicability domain (AD) of a model (when are model predictions reliable or when can a model extrapolate) is one of the major concerns in bioactivity modelling (see previous studies 105-107 for comprehensive reviews). Major obstacles to the AD determination are the errors and uncertainties contained in bioactivity databases, 108-111 which are mainly due to data curation and experimental errors, 110 as well as the accurate quantication of distances in the descriptor and the biological space, which would enable to anticipate prediction errors. Gaussian processes (GP) aim to address these concerns by permitting to handle data uncertainty as input into a probabilistic model. Fig. 3 illustrates the basic idea underlying GP modelling. The prior probability distribution (Fig. 3A) covers all possible functions candidate to model the data, each of which has a different weight determined by the kernel (covariance) parameters. Subsequently, only those functions from the prior distribution in agreement with the experimental data are kept (Fig. 3B). The mean of this function is considered as the best t to the data. Given that each prediction is a Gaussian distribution, different condence intervals can be dened from its variance (Fig. 3B). For a new compound-target combination, the bioactivity is predicted as a Gaussian distribution, in which the mean is the best prediction and its variance the uncertainty. A radial-kernelled GP with s ¼ 1 was employed to generate the figure. The python infpy package helped to produce the plots. 207 Gao et al. 112 showed that SVMs performed, in general, slightly better than GPs when modelling a dataset composed of 128 ligand and 9 human amine GPCRs, although the models trained on the best combination of descriptors exhibited Q 2 values of 0.744 and 0.742 for GP and SVM respectively. Worth of mention, the difference in performance between GP and SVM was not assessed neither statistically nor by comparing the results of a series models trained on different resamples of the whole dataset. Moreover, the predicted error bars by the GP PCM models were not considered. More recently, Cortes-Ciriano et al. 75 showed the actual potential of GPs by applying both SVMs and GPs implemented with a panel of diverse kernels to multispecies PCM datasets, namely: human and rat adenosine receptors, mammal GPCRs and Dengue virus proteases. GP and SVM performed comparably as absolute differences were statistically insignicant. However, GP provided notable added values via: (i) the determination of the model AD, (ii) the probabilistic nature of the predictions, and (iii) the inclusion of the experimental uncertainty in the model.
In the experience of the authors regarding the application of GP in PCM, 75 and in agreement with Schwaighofer et al., 113 the intervals of condence (IC) calculated by GP are in accordance with the cumulative Gaussian distribution. Therefore, these intervals of condence provide valuable information about individual prediction errors. In practice, knowing the error for each prediction can certainly guide decision-making about which compounds should be tested in prospective experimental validation of in silico PCM models. Overall, GP appear as an appealing approach for PCM in spite of the longer CPU time required for the training, as GP is an algorithm of O(N 3 ) time complexity (i.e., it scales with the third power of the size of the dataset). 114

Collaborative Filtering (CF)
One of the requirements for PCM is that target (protein) features need to be dened explicitly (usually by physicochemical characterization of amino acids). While this approach is effective, it nevertheless requires a certain level of information about target sequences and structures. An alternative approach would be to infer target features from an unsupervised approach and not use them as model input a priori. This was done quite recently in multi-target QSAR study of multiple cell lines for the hedgehog signalling pathway. 115 Gao et al. 115 incorporated a CF approach between 93 cyclopamine derivatives and four cell lines (BxPC-3, NCI-H446, SW1990 and NCI-H157), and showed that collaborative ltering multi-target QSAR outperforms normal QSAR for their dataset. The mean Root-Mean Squared Error (RMSE) for four cell lines was 0.65 log units for CF while it increased to 0.85 log units for (single target) SVR. The collaborative QSAR framework, combined with a feature selection methodology based on collaborative ltering and the content-based recommender systems (a system used by electronic retailers and content providers such as Amazon.com), 116 enabled the denition of weights for the compound descriptors (drug-like index). When interpreting their models the authors could determine that molecular volume, polarity, and the cyclic degree are the most inuent compound features for multi-cell line inhibitors for this particular pathway (which, from the chemical standpoint, would however be sometimes difficult to interpret structurally). Erhan et al. 117 also used CF with a large library of compounds against a family of 12 related targets screened in AstraZeneca's HTS campaigns. The authors elegantly demonstrated how the principles of CF ltering can be used to derive a predictive model with the capability to extrapolate on the target side. However, better results were obtained when using target descriptors (binding pocket ngerprints of 14 bins in this case, where each bin accounts for a type of interactionionic, polar, or hydrophobicin the binding site). Another novelty of this work was the introduction of the kernel-based method Jrank (a kernel perceptron algorithm), which was able to outperform the multi-task neural network in most cases and it never produced signicantly worse models. Indeed, in 6 out of 7 cases, this kernel outperformed the random retrieval of compounds. Moreover, the authors also noted that improvements are still possible since Jrank not always outperformed the single-target models.
The overview presented above shows that PCM heavily draws on recent developments in the machine-learning eld. However, given that the methods used are only the means to an end, we will in the following also summarize PCM applications in the medicinal chemistry and chemical biology elds, to different target classes as well as different types of biological readout.

PCM applied to protein target families
As was touched upon above, PCM has been applied to a very diverse selection of protein targets. Here we will focus on a small selection of targets relevant for drug discovery, namely G Protein-Coupled Receptors (GPCRs), kinases, epigenetic markers, viral enzymes, and human cancer cell lines.

G protein-coupled receptors
Early PCM virtual screening studies by Bock and Gough to identify ligands of orphan GPCRs (oGPCRs) used physiochemical properties of the amino acids of the entire primary sequence of GPCRs, such as accessible surface area or surface tension, rather than binding site residues. The authors screened 1.9 million ligand-oGPCRs combinations and were able to identify 4357 highly active ligands of oGPCRs. The method, based on SVM, outputs a ranked list of putative oGPCRs ligands. In practice, the most relevant feature of their predictive pipeline is the description of GPCRs with only physicochemical descriptors, thus avoiding the usage of exact 3dimensional information of the receptors. 38 Subsequently, Jacob et al. 118 demonstrated that the usage of bioactivity data from 4051 GPCR-ligand combinations (80 human GPCRs from classes A, B and C, and 2446 ligands) extracted from the GLIDA GPCR ligand database 119 in PCM models improves the performance over single receptor models, leading to more reliable predictions. The authors used Tanimoto 2D and pharmacophore 3D kernels to describe the ligands, and 5 kernels to describe the GPCRs, namely: Dirac, multitask, hierarchy, binding pocket and poly binding pocket. The best combination thereof was shown to be 2D Tanimoto on the compound side and the binding pocket kernel for the GPCRs, as they reported an accuracy of 78.1% when predicting ligands for orphan receptors. These ndings were further capitalized upon in the papers of Frimurer et al., 120 and Weill and Rognan. 121 Both papers devised features for the 7TM core ligand-binding site and cavity ngerprints to improve the structure guided drug discovery approaches and provide a general class A GPCR similarity metric. 120,121 The former approach introduced an in silico pipeline to relate 7TM GPCRs based upon the physicochemical properties of the ligand binding site, taken from the crystal structure of the bovine rhodopsin. The pipeline is composed of ve steps, which are: (i) sequence alignment of the TM domain of the GPCRs of interest, (ii) selection of the residues in the core binding site important for ligand binding, (iii) denition of binding site signatures and generation of physicochemical descriptors for them, and (iv) use these descriptors to rank, cluster or compare 7TM GPCRs. The authors applied this pipeline to identify ligands for the rhodopsin-like receptor, CRTH2, which by that time only had one annotated ligand besides prostaglandin D2, namely indomethacin. The screening of a library of 1.2 million compounds yielded 600 candidate hit compounds. 10% thereof were conrmed as ligands in a CRTH2 receptor-binding assay, with a IC 50 cut-off value to consider a compound as active of 10 mM. On the other hand, Weill and Rognan 121 introduced a new type of protein-ligand ngerprint (PLFP), which encodes pharmacophoric properties of ligands and their binding cavities. These ngerprints were applied to two GPCRs datasets, namely: (i) 168 536 GPCR-ligand combinations (160 286 inactive and 8250 active combinations), and (ii) 234 137 GPCR-ligand combinations (202 019 inactive and 32 118 active combinations). The total number of GPCRs considered was 160. The authors reported a cross-validated classication accuracy higher than 0.9 when using SVM, though the most predictive models on external datasets were not those presenting the highest accuracy values in cross-validation. 122 Overall, PCM models trained on GPCRs binding site amino acid descriptors have proven to be a powerful approach to identify the GPCRs targets for a given compound, and to predict ligands for orphan GPCRs. The increasing availability of bioactivity data on GPCRs of interest and orthologous sequences, 75 as well as the development of novel methodologies to assess GPCRs similarity, is likely to increase the application of PCM on this target family in drug discovery campaigns.

Kinases
Another important protein family in drug discovery subjected to PCM studies is the kinase superfamily which comprises more than 500 different human proteins. 123 The role of kinases in cell signalling and their involvement in more than 400 human diseases have rendered this protein family an attractive target. 124,125 Kinases generally contain a conserved kinase domain that binds ATP in their active site, though some contain more than one kinase domain. Inhibitors targeting this conserved binding site are known as Type I inhibitors. The activation loop of kinases, necessary for the transfer of a phosphate group, exhibits two different conformations, namely DFG-in and DFG-out (where DFG stands for the catalytic triad, Asp-Phe-Gly). Type II inhibitors bind to both the conserved ATPbinding site and to an adjacent pocket present in the DFG-out conformation. These compounds are more selective and thus attractive as drug candidates. Given the ability of PCM to model bioactivities against related targets, it is very well suited to model the affinity of small molecule inhibitors to the kinase family. 16 Different PCM models have been reported to analyze drug selectivity and predict bioactivity proles against kinases. 66,126 In a recent study by Cao et al., 126 the full kinase sequence space was described by alignment-independent 'Composition, Transition and Distribution' (CTD) features, 127  Hence the statistically soundness of this PCM model enabled the classication of compound-kinase pairs as interacting, using a 100 nM concentration as cut-off, or non-interacting. The high predictive ability of the models should be considered nevertheless with caution as the degree of completeness of the bioactivity matrix used in the training was only 0.65%. Therefore, these PCM models should be iteratively updated as more bioactivity values become available. Interestingly, kinases similar in the sequence space exhibited high dissimilarity when assessing their similarity with the inhibitors bioactivities. This was assessed using 120 kinases with more than 15 bioactivity annotations, 14 400 datapoints in total. Thus, these data highlights the adequacy of considering chemical and target space to optimize kinase inhibitors.
While high affinity is generally desired for drugs (except possibly in case of multicomponent therapeutics), 128 selectivity is equally important when targeting a protein family with highly similar binding sites, such as in this case kinases. Subramanian et al. 66 applied PCM models to a kinase dataset comprising 50 different proteins in the DFG-in conformation to better understand both the residue and compound features which determined whether the ATP-binding site of kinases are involved in compound binding. The resulting PLS models, which included cross-terms (see Section 2.3), demonstrated the added value of PCM over ligand based approaches, as statistically satisfactory QSAR models were reported for only 44% of the targets. More importantly, the models could be visually interpreted, thus enhancing the practical usefulness of PCM for the optimization of compound selectivity. (Further details on the study are given in Section 4.4, as models targets were encoded with 3-dimensional information.) The distinction between Type I and Type II inhibitors has been proved to be amenable to PCM by Mendez-Lucio et al. 129 In order to distinguish between Type I and Type II inhibitors, the authors trained a PCM model on a dataset consisting of 463 data points from the interaction matrix dened by 50 known kinase Type I (ATP-competitive) inhibitors against 12 different sequences of ABL1 (ve of them) in both the phosphorylated and non-phosphorylated state. 130 The model exhibited sound predictive ability, assessed by cross-validation, with RMSE and Q 2 values of 0.420 and 0.887 respectively. In addition, the model allowed the full interpretation of both compound (inhibitor) and protein (kinase) features. Hence, along with the prediction of pK d , a PCM model can provide information about the effect of both compound structural features and protein amino acid residues. [131][132][133] The importance of a given compound substructure, or a given amino acid residue, can be evaluated by the calculation of the difference in bioactivity between the predicted value for a compound with and without that substructure. 75 Fig. 4 displays how this information can be presented in practice and shows the average (over the whole data set) effect of presence of a number of features on the pK d of inhibitorkinase pairs.
As shown by these recent PCM studies on the kinase superfamily, PCM can support new concepts for kinase inhibition implicating the simultaneous interaction of kinase inhibitors with several targets leading to multi-target kinase chemotherapy. 129,134 Therefore, PCM constitutes a suitable technique to help in the design of kinase inhibitors with respect to their potency and selectivity (Fig. 4). 129

Histone modication and DNA methylation
Epigenetic markers have been identied as emerging therapeutic targets in various malignancies and diseases by correlating phenotypes and differential expression patterns. 135 Key protein families involved in these processes are readers (bromodomains), writers (DNA modifying enzymes, histone acetylases, methyltransferases) and erasers (histone deacetylases). 136 Most of the bromodomain epigenetic targets have the ability to selectively modulate the gene expression pattern and contribute to post-translational modications, chromatin binding, inammation, oncogenesis. 137 Moreover there is a clear linkage to some diseases, e.g. multiple myeloma. [138][139][140] Vidler et al. 141 studied the druggability of the different members of the bromodomain family focusing on amino acid signatures in the bromodomain acetyl-lysine binding site, which resulted in a bromodomain family classication more correlated with the binding of small molecules in comparison with a wholesequence similarity classication. Numerous successful chemical probes like JQ1 have also been identied as bromodomain inhibitors by the Structural Genomics Consortium (SGC). 142 However, the bromodomain family still has unexplored therapeutic potential. To date there are no PCM studies performed on this family.
Recently, Wu and co-workers utilized structural similarity between three classes of HDACs and generated a predictive model for a novel candidate anti-tumour drug. 92 They implemented various descriptors (physicochemical properties) and similarity descriptors (sequence and structure) of compounds and targets in the PCM model and successfully identied the class-selective inhibitors for class-I and class-II HDACs. The best model exhibited high predictive ability, as the authors reported a Q 2 value on the external set of 0.754. Overall, the increasing importance of epigenetic targets in drug discovery as well as the availability of large-scale resources of epigenetic targets and its modulators, 143,144 will facilitate the application of PCM to this target family.

Viral mutants
Previous sections highlighted the ability of PCM to model bioactivities of several human protein superfamilies, yet PCM based approaches are not bound to human protein targets. PCM has also been applied in a number of studies to predict activity proles of ligands against different viral protein variants. 26 In the eld of HIV, van Westen et al. 26 used 451 compounds tested against 14 HIV reverse transcriptase sequences to train a model that was able to predict the bioactivity of 317 new compoundmutant pairs. Interestingly, when the prediction was validated prospectively with 'wet lab' experiments it was found that the prediction error (RMSE of 0.62 log units) was comparable to experimental uncertainty of the assay (0.50 log units). In a similar setting, Huang et al. 41 showed that the inclusion of Protein-Ligand Interaction Fingerprints (PLIFs) of viral residues and ligand structures as cross-terms improved model predictive power over models lacking them. PCM models were trained on 92 compounds and 47 HIV-1 protease variants with about 160 K i values. The best PCM model exhibited a Q 2 value of 0.827 on the external set. Next to these applications, PCM has been used to model the sensitivity of viral mutants to antiretroviral drugs, which could potentially guide HIV treatment. 145 Resistance testing and prediction using these models is achieved by incorporating genotypic (protein) and drug (chemical) data and subsequently linking them to phenotypic data (resistance). PCM then allows the prediction of optimal treatment regimens. The advantage of PCM over established sequence-based approaches is that interpretation of a single model allows the combined elucidation of residues responsible for the change in efficacy and the complementary chemical features affected. [146][147][148][149] For instance, van Westen et al. 145 trained PCM models based on a large clinical dataset composed of circa 300 000 datapoints combining both phenotypic and genotypic data. The application of PCM enabled the integration of the similarity of marketed drugs together with protein sequence similarity. The best model exhibited a fold change error of 0.76 log units, which constitutes an improvement of 0.15 log units with respect to previously reported models trained on only protein sequence similarity (0.91 log fold change error). In addition, the authors identied novel mutations of both HIV reverse transcriptase and HIV protease conferring drug resistance, underlining the ability of PCM models not only to model bioactivity information, but to also learn about features relevant for activity from both the ligand and the protein target side.
Similarly, drug susceptibility proles were predicted based on PCM. In that way, two models have been reported for the prediction of: (i) the susceptibility (bioactivity prole) of a given HIV protease genotype to seven commonly used protease inhibitors; 146 and (ii) the susceptibility of HIV reverse transcriptase to eight nucleoside/nucleotide reverse transcriptase inhibitors. 149 PCM models were trained on 4792 HIV protease-inhibitor combinations, being the Q 2 value on the external set for the best model 0.87. These models have been made publically available via web-services available at http://www.hivdrc.org/services, allowing free use of these algorithms. 150 While the ligands of most PCM studies discussed here were small molecules, protease peptide substrates are also amenable to PCM. This has been demonstrated recently by Prusis et al. 151,152 to study the enzyme kinetics parameters for designed small peptide substrates on four dengue virus NS3 proteases using PCM modelling. It was found that the PCM models for K m and K cat were signicantly different. Therefore, by optimizing peptide amino acid properties important for K m activity it was possible to improve peptide affinity to protease, while losing their catalytic activity, hence obtain peptides, which were dengue protease inhibitors.
These studies by Prusis et al. and van Westen et al. are some of the few reports in which predictions have been validated prospectively, demonstrating the predictive power of PCM in different scenarios.

Novel target similarity measure
In the context of GPCRs studies, developing better similarity metrics have helped to determine key binding residues within the GPCR trans-membrane (TM) helical bundle, 51,63,120 aided intra family similarity determination using cavity ngerprints, 153 and boosted high-throughput homology models that supported cavity detection programs. 65,153-155 PCM approaches including these features have also helped in off-target predictions, retrieval of new lead compounds, and target prediction for GPCR-focused combinatorial chemolibraries. 156,157 The binding site focused techniques used in above described studies allowed for the identication of orthosteric and allosteric sites on the same target for different ligand families. In this line, Gao et al. 93 showed the higher predictive ability of models trained on trans-membrane identity descriptors (Q 2 ¼ 0.74) over Z-scales (Q 2 ¼ 0.72) when modelling the inhibition constant of 9 human aminergic GPCRs and 128 ligands, (310 ligand-target combinations). Similarly, Shiraishi et al. 158 revealed specic chemical substructures binding to relevant TM pocket residues, which is not only relevant to mutational analysis but also serves as a complementary approach to Structure-Based Drug Discovery (SBDD). 62,158 TM identity descriptors and TM kernels behave more discriminatingly than Z-scales for GPCRs and allow identication and interpretation of GPCR residues associated with binding of ligands (of a particular chemotype). Therefore, the identication of chemical moieties and residues involved in ligand binding enables the development and optimization of GPCRs inhibitors with respect to both potency and selectivity.

Including 3D information of protein targets in PCM
The binding of a ligand to a protein is a complex process, governed on the structural level by the 3-dimensional (3-D) composition of the protein binding site, the 3-D conformation of the ligands approaching, and the complementarity of their pharmacophoric features. Hence it is expected that inclusion of spatial information from the protein binding sites would improve the predictive power of PCM. Unfortunately, this approach is frequently limited by the lack of high quality 3-D structures, poor understanding of ligand-induced conformational changes, and inaccurate superimposition of protein structures The latter can be (partly) overcome by the use of alignment-free protein descriptors, 65,81 but usually at the cost of lower resolution, loss of target-related information and poor interpretability.
Jacob et al. 118 found no improvement through the use of 3-D information. In this study an analysis of 2446 ligands interacting with 80 human GPCRs was performed using a linear vector representing conserved amino acids in the binding pockets. While the binding pocket kernel implicitly encodes 3-D information, the spatial arrangements were derived from the comparison to only two template proteins. Overall, the 3-D kernels ($77% prediction accuracy) did not show improvements compared to lower dimensional protein descriptions ($77% prediction accuracy with a protein similarity kernel). Likewise Wassermann et al. 159 found little improvement using 3-D information in their analysis of interactions of 12 proteases with 1359 ligands using the TopMatch similarity score, 160 which used all amino acids within 8Å around the catalytic residues to describe the target proteins. This 3-D description did not perform better ($61% recovery rate) than the sequence ($57%) and protein class-based ($62%) kernels used in this publication.
Conversely, early work by Strömbergsson et al. 161 used local protein substructures, encoded as motifs of 5 amino acid stretches, which are closer than 6.5Å to each other. This local substructure method showed for a set of 104 enzymes an improvement over the use of global SCOP (Structural Classication of Proteins) folds and the RMSE values on the external validation set decreased from 2.06 to 1.44 pK i units. Additionally, it was found that local substructures close to the ligand binding sites were assigned more importance in the models than more distant ones, which is intuitively understandable. Similarly, Meslamani and Rognan did nd an improvement by using 3-D information. 60 581 diverse proteins were described by the 3-D cavity descriptor FuzCav, 65 which is a vector of 4834 integers reporting counts of pharmacophoric feature triplets mapped to Ca-atoms of binding site-lining residues. The use of cavity 3-D kernels showed a clear advantage (F-measure 0.66) over sequence-based descriptions (F-measure 0.54) in predicting target-ligand pairings for a large external test set (>14 000 ligands, 531 targets), especially in local models. This difference seems to be even more pronounced for datasets with limited ligand data (<50 ligands). Likewise, a recent study by Subramanian et al. 66 described the superimposed binding sites of 50 (unique) kinases by molecular interaction elds derived from knowledge-based potentials and Schrödinger's Water-Maps. 162,163 Also in this example a signicant improvement for 3-D methods (r 2 ¼ 0.66, q 2 ¼ 0.44) compared to sequence-based methods (r 2 ¼ 0.50, q 2 ¼ 0.34) was reported. Additionally, this combination of methods allows interpretation and easy visualization of PCM results within the context of ligands and binding pockets.
Earlier studies have not clearly shown the advantages of 3D PCM over solely sequence-based approaches, whereas more recent studies show that including 3D information appears to improve performance. The particular data set used (e.g. number of ligands), and the quality of the data provided, likely determines if there is a possible gain in this type of description. However, the constantly increasing number of protein structures, more robust alignment-free methods (e.g. Nisius and Gohlke 164 or Andersson et al. 81 ), and introduction of protein descriptors with easier interpretability (e.g. Desaphy et al. 165 ), might help the interpretation and the visualization of PCM models in the future.

PCM in predicting ligand binding free energy
The application of PCM to docking might not be directly obvious. Yet, the concepts used in PCM, quantitatively relating ligand-and protein-side descriptors to affinity/activity, very much resemble empirical scoring functions. Molecular docking has led to the discovery of active compounds, 166 yet it suffers from several well described limitations, among which is the relatively low performance in prediction of interaction energies. 167,168 In contrast, PCM models can predict the difference in Gibbs free energy (DG ¼ ÀRT ln K d ) between the initial state, where the protein and the compound do not interact, and the nal ligand-target complex. Therefore, the principles of PCM can be applied to develop PCM-based scoring functions.
Kramer et al. 169 demonstrate this concept by building a structure-based PCM scoring function. Their method inducts a bagged stepwise multiple linear regression model with a subset of 1387 protein-ligand complexes extracted from the PDBbind09-CN database. 170 Subsequently a new compoundtarget interaction descriptor based upon distance-binned Crippen-like atom type pairs was introduced. The best model outperformed commercially available scoring functions assessed on the PDBbind09 database and was able to explain 48% of the variance of the external set, providing a RMSE equal to 1.44. Although similar methods had been previously proposed, [171][172][173][174][175] this was the rst study where a sufficiently large validation was accomplished to ascertain model's predictive power. Additionally, the implementation of bagged stepwise multiple linear regression (MLR) and PLS enabled the evaluation of the importance of ligand and target descriptors for the PCM model.
Similarly, a subsequent study reported the development of a scoring function based upon the CSAR-NRC HiQ benchmark dataset (http://csardock.org). 176 The best model exhibited acceptable statistics with a cross-validated R 2 ¼ 0.55 and RMSE ¼ 1.49. 176 Finally, Koppisetty et al. 177 were able to predict for the rst time ligand binding free energies where the enthalpic and entropic contributions for a given binding event were deconvoluted. Therein, the authors demonstrated the importance of including ligand descriptors (QIKPROP and LIGPARSE calculated in Schrödinger suite) 178 to the models in addition to 3-dimensional ligand-protein interaction descriptors.
As demonstrated above, PCM overlaps with methods that are originally coming from the structure-based eld due to PCM describing in principle any method to relate ligand features and protein/target features on a large scale to an output variable of interest. Another source of complementary information is the information from divergent and convergent homologous sequences. This allows PCM models to extrapolate the bioactivity of ligands to the same protein target in different species as shown below.

PCM as an approach to extrapolate bioactivity data between species
Given that PCM considers bioactivity data from related targets, these related targets can also include similar targets from different species. Given a group of related targets, a distinction can be made from an evolutionary standpoint between gene pairs originated from intra-species gene duplication events (paralogy, within species) or from speciation events (orthology, across species). 179 Since orthologous genes will tend to maintain the original function, binding modes will also tend to be more conserved than in paralogues, where the original protein function is less conserved.
This has also been shown to be true for affinities of ligands binding to these orthologues by analyzing bioactivity data, such as in a recent study by Kruger et al. 21 the authors demonstrate that the same small molecule exhibits similar binding affinities when acting on orthologues (though some exceptions were found, e.g. Histamine H 3 receptor). Moreover, the authors veried that larger differences in binding affinity are observed for paralogues with respect to orthologues by analyzing the differences in binding for a total number of 20 309 compounds on 516 human targets, with 651 being the nal number of orthologous pairs. These observations aid in optimizing ligands for their interaction with conserved residues across a given protein family, thus making them more desirable lead compounds (thus avoiding their interaction with unrelated targets). 180 In the eld of PCM, Lapinsh et al. 37 demonstrated for the rst time the capability of PCM to successfully combine the pK i values of 23 organic compounds on 17 human (paralogues) and 4 rat (orthologues) amine GPCRs. The authors were able to deconvolute the binding site interactions into two types, namely: those involved in specicity and those involved in affinity. Therefore, compound design can be envisioned from the viewpoint of affinity or specicity. Similarly, the contribution of TM regions involved in the interactions of amine GPCRs and compounds to compound affinity was also quantied. For example, TM regions 2, 3, 4, 6 and 7 are responsible for low overall affinity in b 2 receptors; however, the same regions are positive contributors to overall high affinity in a 1a receptors. van Westen et al. 22 built on this by including in a PCM model bioactivity data from four human and rat adenosine receptors (A 1 , A 2A , A 2B and A 3 ). The authors screened a commercial chemolibrary composed of 791 162 compounds with the most predictive PCM model obtained, which exhibited Q 2 and RMSE values of 0.73 and 0.61 pK i units, respectively. Prospective experimental validation led to the discovery of new high-affinity inhibitors, among which a compound with a pK i value of 8.1 on the A 1 receptor. Finally, the authors have applied PCM to model the pIC 50 value of 3228 distinct compounds on 11 mammalian cyclooxygenases (COX) using ensemble PCM. 55 The nal ensemble PCM model, trained on the cross-validation predictions of a panel of 282 RF, SVM and Gradient Boosting Machine (GBM) models, each trained with different values of the hyperparameters, led to predictions on the test set with RMSE and R02 values of 0.71 and 0.65, respectively. Additionally, the description of compounds with unhashed Morgan ngerprints permitted a chemically meaningful model interpretation, which highlighted chemical moieties responsible for selectivity towards COX-2 in agreement with the literature. 55 The ability of PCM to embrace multispecies information using sequence descriptors allows the creation of models capable to predict compound activity on targets with little available data points on the human orthologue. The existing large body of bioactivity data collected on organisms other than human (e.g. rat and mouse) provides a good resource. This data was derived from the traditional usage of rodent tissues as a source of proteins for biochemical and pharmacological assays. Moreover, the difference in bioactivity between a compound acting on its human target with respect to its orthologue in another species (e.g. the CCR1 antagonist BX471) hampers the utilization of animal models to study human diseases at a molecular level. 181 Thus, PCM can help not only to reduce the number of experiments required to complete the compoundtarget interaction matrix, 29 but also appears as a practical tool to understand complex diseases in scenarios where current experimental settings are insufficient (e.g. undeveloped enzymatic assays for a given protein). Similarly, PCM might be applied as a supporting tool in allometric scaling to predict the behavior of clinical candidate drugs in humans. 182,183 Nonetheless, the extrapolation capabilities of PCM models are subjected to the completeness of the bioactivity matrix (Fig. 1). In practice, even though high performance can be attained with a matrix completeness level below 3%, the variability of the chemical space plays a key role in determining the extrapolation capability of a PCM model on the chemical side. 75 Therefore, a balance has to be found between the coverage of chemical and target space, and the degree of completeness of the bioactivity matrix.

PCM applied to pharmacogenomics and toxicogenomics data
The biological space in a PCM model can be further extended from single proteins to whole cell lines. A step forward in this regard is the inclusion of cell line descriptors in a PCM model in order to model cell line sensitivity to cancer drugs or toxic compounds. Given that individual cell lines have been shown to demonstrate diverse proles with respect to drug sensitivity, the variability on the cell line side, which constitutes now the target side of PCM, can be exploited to concomitantly predict both drug potency and cell line selectivity. 17 Additionally, PCM can also facilitate the interpretation of differential gene expression or mechanism of toxicity of compounds, 88 as will be shown below.
The availability of pharmacogenomics and toxicogenomics data has enabled predictive modelling of cancer cell line sensitivity. These models consider as the dependent variable the response of a whole cell to a given drug, such as in the form of EC 50 values, which determines the concentration at which a chemical exerts half of its maximal effect. Therefore, the 'target' component in the PCM model is no longer a single protein, described in terms of binding site properties, but by more complex (usually genomic) features such as oncogene mutations, cell karyotypes or gene expression levels.
In the context of human cell lines, the work on the NCI-60 cell line panel, which covers cells from 9 different cancer types, has helped to nd novel molecular determinants of drugs sensitivity, as well as to develop drugs targeting concrete tumor types (disease-oriented); e.g. 9-Cl-2-methylellipticinium acetate for central nervous system tumours. 184 However, the number of cancer cell lines with drug sensitivity data has vastly increased with the release in 2012 of two major cancer cell line panels, namely: the Cancer Cell Line Encyclopedia (CCLE) consisting of 947 cancer cell lines 185 and the Genomics of Drug Sensitivity in Cancer (GDSC) consisting of 727 cancer cell lines. 186 The setup of both cell line collections, sharing a total number of 471 cell lines, enabled large scale pharmacological proling thereof. In that way, Barretina  The availability of public bioactivity proles for compounds in combination with detailed genetic information of the cell lines constitutes a scenario where ML can be applied for predictive cell line sensitivity modelling. In this area, Menden et al. 29 exploited cell line drug sensitivity information from the GDSC and incorporated genomic features in combination with chemical descriptors in non parametric models, i.e. neural networks and random forests. These models allowed the authors to determine the missing drug response (IC 50 ) values in the original cell-line compound matrix. The best model predicted the sensitivity on the external (blind) test with a correlation between observed and predicted of 0.64, while a value of 0.61 was obtained when predicting the response on a tissue unseen by the model in the training phase. Recently, the authors have integrated PCM random forest models with conformal prediction for the large-scale prediction of cancer cell line sensitivity with error bars. 17,189 Compounds were described with Morgan ngerprints, whereas a total of 16 cell line proling datasets were benchmarked for their predictive signal. Gene expression data constantly led to the highest predictive power. Interestingly, the authors found statistically signicant differences in predictive power between PCM models trained on cell line identity ngerprints (inductive transfer knowledge between cell lines) 190 and cell line proling data, suggesting that the explicit inclusion of cell line information improves the prediction of cell line sensitivity. Of practical relevance, the predicted bioactivities enabled the prediction of growth inhibition patterns on the NCI60 panel and the identication of genomic markers of drug sensitivity.
The cancer cell line collections described above still remain to be fully exploited. While they constitute a great opportunity for PCM to integrate both drug sensitivity and genomics data in single models, this data integration still remains challenging due to the disagreement of drug sensitivity measurements between the CCLE and the GDSC. 191,192 Overall, the principles of PCM, namely the combination of chemical and cell line (target) information in single machine learning models, are suited to integrate and exploit the increasing availability of drug sensitivity measurements on cancer cell line panels. The application of PCM in pharmacogenomics is a recent sub-eld of which the authors are certain it will grow in the near future. Moreover, in silico drug sensitivity prediction is a cost-efficient method capable to relate large-scale pharmacogenomics data, which is likely to foster the identication of chemotherapeutic lead compounds in both the academic and pharmaceutical cancer drug discovery pipeline.

Other potential PCM applications
As reviewed above PCM has been applied in a wide range of drug discovery settings, yet more applications remain unexplored. The prediction of compound toxicity on cell lines (toxicogenomics), [193][194][195][196] beyond the aforesaid cancer cell line collections, is also amenable to PCM. Recently, Kaggle, 197 a crowd-sourcing platform, hosted two competitions in the eld of chemoinformatic modelling. Two pharmaceutical companies, Boehringer Ingelheim and Merck, provided structureactivity relationship datasets to the community in order to nd the most predictive machine learning algorithms. The Merck challenge consisted of 15 datasets, each of which containing the bioactivities of a series of molecules on a different target. The winners of the competition applied restricted Boltzmann machines (deep learning). 198 Interestingly, the winning team noted that the similarity between the datasets (targets) could be exploited by inducting a single neural network with all datasets, which output a layer with een different units (neurons). On the other hand, Boehringer Ingelheim provided a dataset with 1776 compound descriptors. The response variable was binary, 0 corresponded to a compound not eliciting the expected activity whereas 1 corresponded to a compound showing activity. In this case, the highest predictive ability was obtained with model ensembles (random forests, gradient boosting machines, and K-nearest neighbors). In a similar vein, the modelling challenge DREAM8 was proposed to the scientic community to model the toxicity of 106 compounds on 884 lymphoblastoid cell lines, which were characterized by SNP genotypes and gene transcript levels quantied by RNA sequencing. [199][200][201] As described in this review, a large variety of protein targets have been modelled using PCM. Beyond the modelling of the activity of compounds on targets of diverse nature, the interaction between nucleic acids and proteins is also amenable to PCM modelling. In this context, Bellucci et al. predicted protein-RNA interaction based upon the physicochemical properties of both the polypeptide and the nucleotide chains. 202 However, to date few studies have been published in this area. 50,202

PCM limitations
The usefulness of PCM in computational drug design has been extensively proven in silico (see Section 2.7) and in prospective experimental validation. Nevertheless, there are a number of limitations that should not be overlooked. Publicly available bioactivity databases contain a non-negligible degree of experimental uncertainty, 108-111 which should be certainly included in the modelling phase, as recently proposed by Cortes-Ciriano et al. 75 Similarly, intervals of condence for individual predictions should be reported, which can be calculated with algorithm-dependent approaches, e.g. Gaussian processes, 75 or with algorithm-independent techniques, such as conformal prediction. 17,189 In addition to being informative for biologists, these condence intervals constitute a valuable source of information about the applicability domain (AD) of a given model. 75 The AD is dened as the amount of ligand and target space to which a given model can be reliably applied. Thus, in addition to the model validation schemes presented above, an estimation of model AD should accompany any reported model in order to be of practical usefulness.
Another limitation which is oen inherently related to bioactivity data is that of data skewness. Some datasets mostly report active 203 or inactive molecules, 204 and thus compoundtarget combinations untested experimentally are normally considered as inactive or active interactions, respectively. Moreover, public data in general tend to favor a relatively small number of proteins classes that have been extensively explored (e.g. GPCRs and kinases). [23][24][25]205 As such, for some targets the available data might not be sufficient for PCM projects given that imbalanced datasets can lead to models with high negative or false positive rates. Nevertheless, the modelling of cell line sensitivity has shown that PCM displays high interpolation power, as the accuracy of prediction reached a plateau when 20% of the whole compound-cell line matrix was included in the training set. 29 Beyond the quality of the data, the descriptor choice still constitutes a eld of active research, specially with respect to protein descriptors, which development will deeply inuence the success of PCM in the coming years. 45 A recent paper by Brown et al. 190 suggested that PCM mostly relies on inductive transfer knowledge and that protein descriptors mostly act as labels and do not account for structural differences among them. However, we have recently shown that both amino acid descriptors and cell line proling datasets account for structural information of eukaryotic, mammal and bacterial DHFR, and cancer cell lines, where the difference in performance on the test set between inductive transfer and PCM models was statistically signicant. 17,56 PCM requires the concatenation of ligand and target descriptors, and sometimes also cross-terms, which substantially increases the dimensionality of the input space with respect to QSAR. Although this higher dimensionality might lead to overtting in PCM, 206 in practice, PCM has been shown to exhibit higher predictive power on the test set than QSAR. 22,26,75 7 Conclusions PCM is becoming a mature technique that allows the simultaneous use of both the chemical and the biological spaces in predictive bioactivity modelling. Both retrospective validation and prospective validation have underscored the advantages of PCM over ligand-based methods. However, it is the extensive expertise developed in the elds of QSAR and chemoinformatics on which PCM can build. Nowadays, a wide choice of properly benchmarked ligand and protein descriptors is available as well as different linear and nonlinear modelling algorithms. Nonetheless, conceptually diverse machine learning algorithms (e.g. GP), the inclusion of three-dimensional information of both ligands and targets, and the use of pharmacogenomics data are still under exploration.
Overall, the ability of PCM to become a customary technique in both the public and the private domain in the following years will certainly rest on its capability to capitalize on biological data of diverse nature, including personalized 'omics' data (personalized medicine), in combination with structural data of ligands, be those small molecules, antibodies or peptides.