Andreas
Mayr‡
a,
Günter
Klambauer‡
a,
Thomas
Unterthiner‡
a,
Marvin
Steijaert
b,
Jörg K.
Wegner
c,
Hugo
Ceulemans
c,
Djork-Arné
Clevert
d and
Sepp
Hochreiter
a
aLIT AI Lab and Institute of Bioinformatics, Johannes Kepler University Linz, Austria. E-mail: hochreit@bioinf.jku.at; Fax: +43-732-2468-4539; Tel: +43-732-2468-4521
bOpen Analytics NV, Belgium
cJanssen Pharmaceutica NV, Belgium
dBayer AG, Germany
First published on 6th June 2018
Deep learning is currently the most successful machine learning technique in a wide range of application areas and has recently been applied successfully in drug discovery research to predict potential drug targets and to screen for active molecules. However, due to (1) the lack of large-scale studies, (2) the compound series bias that is characteristic of drug discovery datasets and (3) the hyperparameter selection bias that comes with the high number of potential deep learning architectures, it remains unclear whether deep learning can indeed outperform existing computational methods in drug discovery tasks. We therefore assessed the performance of several deep learning methods on a large-scale drug discovery dataset and compared the results with those of other machine learning and target prediction methods. To avoid potential biases from hyperparameter selection or compound series, we used a nested cluster-cross-validation strategy. We found (1) that deep learning methods significantly outperform all competing methods and (2) that the predictive performance of deep learning is in many cases comparable to that of tests performed in wet labs (i.e., in vitro assays).
Conducting these experiments is a time- and cost-intensive process. Usually, a cell line must be cultivated to obtain a single data point. For example, even the Tox21 project,2 an unprecedented multi-million-dollar effort, could test only a few thousand compounds for as few as twelve toxic effects. Therefore, accurate computational target prediction methods are of great value in supporting and improving the drug discovery process.
Deep learning, a new computational technique that has made an impact in many research areas, has recently not only been applied very successfully to target prediction,3,4 but also to certain other tasks in chemistry. Examples are the automatic generation of molecules,5–9 chemical synthesis planning,10 drug synergy prediction11 or modelling quantum interactions12 and speeding quantum mechanical computations up,13 which might further help in the design of new efficient molecular organic light-emitting diodes.14 The main goal of this study was therefore to compare the performance of deep learning with that of other methods for drug target prediction.
Deep learning architectures seem well suited to target prediction because they both allow multitask learning15–17 and automatically construct complex features.17 First, multitask learning has the advantage that it naturally allows for multi-label information and can therefore utilize relations between targets. Multitask learning allows hidden unit representations to be shared among prediction tasks. This is particularly important because for some targets very few measurements are available, and therefore single task prediction may fail to construct an effective representation. In contrast, deep learning can exploit representations learned across different tasks and can boost the performance on tasks with few training examples. Fig. 1 shows that many compounds were measured by multiple assays (left), and – based on this observation – that there are strongly correlated assays available (right). Second, deep networks provide hierarchical representations of a compound, where higher levels represent more complex properties.18 A hierarchy of features naturally emerges: single atoms are grouped together as functional groups and reactive centers, which in turn define pharmacophores. Such features are the state-of-the-art way in which chemists and drug designers think about the properties of each chemical compound.19
Fig. 1 Assay correlation [left: number of compounds (log-scaled) measured on both assays, right: Pearson correlation on commonly measured compounds]. |
There are several pitfalls in comparing drug target prediction methods, which especially concern the selection of a comparison dataset, the compound series bias inherent in chemical datasets and hyperparameter selection.
First, many method comparison studies comprise only single or very few assays or targets,3,20–22 whereas compound databases, such as ChEMBL,23 contain many more assays. Therefore, these studies both restrict the conclusions of the method comparisons to a certain subset of assays and underestimate the multitask learning effect in spite of the large amount of data being available publicly. Some target prediction algorithms can exploit the information of similar assays to improve the predictive performance of a particular assay of interest. Such algorithms are referred to as multitask learning algorithms. Assays with few measurements in particular can benefit from information from similar assays. Aside from the underestimated predictive performance, other potential benefits of multitask settings are ignored: a multitask model is able to provide predictions for a large number of assays at once, which can help chemists and biologists to conceptualize how certain compounds might act at the cellular level. It is therefore highly desirable to include a large number of assays in a method comparison study to evaluate the benefits of multitask learning in terms of predictive performance and to provide more general, comparative statements on target prediction methods.
Second, most comparative studies suffer from the compound series bias24 and hence overestimate the performance of certain methods. The compound series bias arises from the way chemical compounds are generated: chemists typically generate new chemical scaffolds rather than individual compounds, and derive new substances from these scaffolds by adding various functional groups. Target activity prediction for a compound from a new compound series is a more difficult task than target activity prediction for compounds, that are from a series, which is contained in the training set. Hence, if the estimated prediction performance suffers from the compound series bias, it is overoptimistic compared to the situation in which the prediction method is used in practice for the prediction of compounds from new compound series.
Third, performance estimates are biased by hyperparameter selection (hyperparameter selection bias). This is especially pronounced in deep learning because it allows many combinations of architectures, activation functions, learning rates, and regularization parameters. The bias may appear if label information from the test set influences the adjustment of hyperparameters for building predictive models. However, in practice no test set labels are available to adjust hyperparameters. In many cases the prediction performance estimation is therefore overoptimistic. Since different learning algorithms have different numbers of hyperparameters, and the hyperparameters also have different adjustment capabilities, different learning algorithms have different tendencies to overfit. Hence, a method comparison that is affected by the hyperparameter selection bias is typically unfair.
To avoid the first pitfall, we extracted a large benchmark dataset from the ChEMBL database that allows reliable assessment of the performance of machine learning methods for compound target prediction. The dataset contains about 500000 compounds and more than 1000 assays. These assays correspond to a variety of target classes (e.g. enzymes, ion channels and receptors) and differ in size. In particular, the dataset has many assays that comprise only relatively few measurements (a hundred to several hundreds), but there are also several assays with a very large number of measured compounds (tens of thousands).
The second problem is solved by cluster-cross-validation4 (for details see section “Methods”). In conventional cross-validation, the set of data points is partitioned randomly into several folds. Processing iteratively, each fold serves once as a test set while the remaining folds form the training set. In each iteration, the training set is available to build a new predictive model, while the prediction performance of this model is estimated on the test set. Instead of randomly assigning data points to folds, cluster-cross-validation distributes whole clusters of compounds across folds. As a consequence, chemical compounds from the same cluster are either in the training set or in the test set. Specifically, cluster-cross-validation avoids that some of the data points that belong to a certain cluster fall into the training set while other data points from the same cluster fall into the test set. In a cluster-cross-validation benchmark, a machine learning method must therefore successfully predict the activity of compounds from new scaffolds in a very large number of cases. Cluster-cross-validation provides performance estimates of methods for the prediction of compounds that are based on new chemical scaffolds and thus takes into account the way chemical compounds are generated.
The third problem is tackled by applying a nested cross-validation scheme.25,26 An outer loop measures the prediction performance of the algorithms, while an inner loop is used to adjust the hyperparameters of the individual methods such that the methods can choose their best settings for building predictive models in the outer loop. In total, we used three different folds in our nested cluster-cross-validation setting. In each iteration, the inner loop uses one of the three folds from our benchmark dataset for training and one fold for validating the hyperparameter combinations searched while keeping the last fold aside as a test fold for the outer loop. The outer loop uses both the training and the test fold of the inner loop for training a model. The hyperparameters in the outer loop are selected based on a prediction performance criterion from inner loop cross-validation. Thus, the performance estimates provided by nested cross-validation are not biased by hyperparameter selection.
Aside from carrying out an in silico prediction performance comparison study, we also carried out an experiment, that compares the accuracy of in silico predictions to the accuracy of in vitro measurements. Here we explicitly consider the case in which two assays are different but should measure the same biological effect of a compound. Since we could consider our in silico prediction method as a virtual assay, we tried to compare whether a virtual assay or a surrogate in vitro assay is more accurate at predicting the outcome of an assay of interest.
We compared the prediction performances of several deep learning architectures with a variety of methods, in particular with support vector machines28 (SVMs) and K-nearest-neighbours (KNNs) as representatives of similarity-based classification methods and with random forests29 (RFs) as a representative feature-based classification method. Furthermore, we included naive bayes (NB) and SEA30–32 in the comparison, which we considered as representatives of target prediction methods that were constructed specifically for the purpose of drug discovery. More details on the individual methods are given in the “Methods” section and in ESI Section S3.†
For deep learning methods, we considered three main architectures of deep neural networks (DNNs): feed-forward neural networks (FNNs), convolutional neural networks33 (CNNs), and recurrent neural networks (RNNs). FNNs take vectorial inputs and consist of several layers of (affine) linear maps followed by an activation or the final output function. CNNs are highly successful at image processing tasks.34–37 They usually take a whole 2D image as an input and an important characteristic of this network type is that parameters are shared across neurons. CNNs consist of several convolution and pooling layers where the convolution layer outputs are typically computed by a parametrized kernel and the pooling layer outputs are computed by a simple aggregation function. We consider graph convolutional networks, that make use of neighbourhoods as they are defined by a molecular graph topology instead of a pixel neighbourhood as in 2D images. Explicitly, we looked at two implementations. One is referred to as GC38,39 and the second one is referred to as Weave.40 Both were available in the DeepChem39,41 package. RNNs are successfully used in applications that have to process sequence data, such as natural language processing42–44 or speech recognition.45 In RNNs, network parameters are shared across the time steps of the sequences. As vanishing gradients46,47 are a problem in learning these networks, memory was introduced, which led to the LSTM architecture.48 Here we consider LSTM networks, that take SMILES49 strings as an input. We refer to this architecture as SmilesLSTM.
For comparisons of target prediction methods, we used the area under the receiver operating characteristic curve50 (abbreviated as ROC-AUC or since it is our default metric, if there is no ambiguity, as AUC) as a performance assessment criterion. AUC is a commonly used criterion for assessing computational target prediction methods.2,51
In order to compare in silico predictions to in vitro measurements, we identified assay pairs in ChEMBL, that measure the same biological effect. We considered the assay with fewer measured compounds as the ground truth and the assay with the higher number of measured compounds as surrogate assay. We then compared the in silico prediction accuracy against the in vitro prediction accuracy of the surrogate assay. Using stringent criteria and manual supervision, we found 22 such assay pairs (see Table 2, ESI Section S2.4.1, ESI Tables S14 and S15†).
FNN | SVM | RF | KNN | NB | SEA | GC | Weave | SmilesLSTM | |
---|---|---|---|---|---|---|---|---|---|
StaticF | 0.687 ± 0.131 | 0.668 ± 0.128 | 0.665 ± 0.125 | 0.624 ± 0.120 | |||||
SemiF | 0.743 ± 0.124 | 0.704 ± 0.128 | 0.701 ± 0.119 | 0.660 ± 0.119 | 0.630 ± 0.109 | ||||
ECFP6 | 0.724 ± 0.125 | 0.715 ± 0.127 | 0.679 ± 0.128 | 0.669 ± 0.121 | 0.661 ± 0.119 | 0.593 ± 0.096 | |||
DFS8 | 0.707 ± 0.129 | 0.693 ± 0.128 | 0.689 ± 0.120 | 0.648 ± 0.120 | 0.637 ± 0.112 | ||||
ECFP6 + ToxF | 0.731 ± 0.126 | 0.722 ± 0.126 | 0.711 ± 0.131 | 0.675 ± 0.122 | 0.668 ± 0.118 | ||||
Graph | 0.692 ± 0.125 | 0.673 ± 0.127 | |||||||
SMILES | 0.698 ± 0.130 |
Assay | Surrogate assay | Target | Surrogate assay accuracy | Deep learning accuracy | p-value |
---|---|---|---|---|---|
CHEMBL1909134 | CHEMBL1613777 | CYP450-2C19 | 0.54 [0.4136, 0.653] | 0.95 [0.9198, 0.9658] | 1.95 × 10−17 |
CHEMBL1909200 | CHEMBL1614521 | ERK | 0.56 [0.4012, 0.7005] | 0.98 [0.9615, 0.9912] | 9.57 × 10−17 |
CHEMBL1909136 | CHEMBL1614110 | CYP450-2D6 | 0.51 [0.3923, 0.6197] | 0.91 [0.8734, 0.9319] | 2.76 × 10−15 |
CHEMBL1909135 | CHEMBL1614027 | CYP450-2C9 | 0.68 [0.5567, 0.7853] | 0.95 [0.9213, 0.9664] | 1.35 × 10−9 |
CHEMBL1909138 | CHEMBL1614108 | CYP450-3A4 | 0.86 [0.8041, 0.9071] | 0.95 [0.9278, 0.9713] | 1.76 × 10−4 |
CHEMBL1614310 | CHEMBL1614544 | Lamin A | 0.65 [0.5797, 0.7092] | 0.74 [0.657, 0.8105] | 7.83 × 10−2 |
CHEMBL1737863 | CHEMBL1614255 | PMI | 0.62 [0.3869, 0.8105] | 0.75 [0.6504, 0.8337] | 2.77 × 10−1 |
CHEMBL1614016 | CHEMBL1794352 | Luciferase | 0.86 [0.8086, 0.9043] | 0.88 [0.8108, 0.9238] | 7.53 × 10−1 |
CHEMBL1794393 | CHEMBL1614512 | MKP-3 | 0.67 [0.4469, 0.8357] | 0.70 [0.5945, 0.7853] | 8.07 × 10−1 |
CHEMBL1614355 | CHEMBL1614052 | NPSR1 | 0.70 [0.5913, 0.7947] | 0.70 [0.5549, 0.8126] | 1.00 × 100 |
CHEMBL1614479 | CHEMBL1614052 | NPSR1 | 0.94 [0.835, 0.9791] | 0.91 [0.7858, 0.9647] | 7.31 × 10−1 |
CHEMBL1614197 | CHEMBL1614087 | ROR | 0.74 [0.6572, 0.8058] | 0.71 [0.6045, 0.7994] | 6.54 × 10−1 |
CHEMBL1614539 | CHEMBL1614052 | NPSR1 | 0.88 [0.724, 0.9531] | 0.79 [0.6393, 0.888] | 3.95 × 10−1 |
CHEMBL1738575 | CHEMBL1614247 | NOD2 | 0.87 [0.6533, 0.9657] | 0.75 [0.6235, 0.8462] | 3.78 × 10−1 |
CHEMBL1737868 | CHEMBL1738317 | ATAD5 | 1.00 [0.6555, 1] | 0.81 [0.7098, 0.8893] | 2.05 × 10−1 |
CHEMBL1613806 | CHEMBL1613949 | PTPN7 | 0.92 [0.5975, 0.9956] | 0.69 [0.5598, 0.8046] | 1.61 × 10−1 |
CHEMBL1614105 | CHEMBL1614290 | SUA1 | 0.96 [0.9267, 0.9814] | 0.92 [0.8715, 0.9575] | 1.20 × 10−1 |
CHEMBL1963940 | CHEMBL1794352 | Luciferase | 1.00 [0.8076, 1] | 0.87 [0.775, 0.9344] | 1.15 × 10−1 |
CHEMBL1741321 | CHEMBL1614110 | CYP450-2D6 | 0.99 [0.9889, 0.9956] | 0.83 [0.8184, 0.8352] | 5.80 × 10−164 |
CHEMBL1741325 | CHEMBL1614027 | CYP450-2C9 | 0.99 [0.9839, 0.993] | 0.75 [0.7428, 0.762] | 1.82 × 10−198 |
CHEMBL1741323 | CHEMBL1613777 | CYP450-2C19 | 0.99 [0.9822, 0.9911] | 0.77 [0.7602, 0.7789] | 1.20 × 10−204 |
CHEMBL1741324 | CHEMBL1613886 | CYP450-3A4 | 1.00 [0.9925, 0.9971] | 0.74 [0.7328, 0.7522] | <1.0 × 10−250 |
The ChEMBL benchmark dataset which we created and used to compare various target prediction methods consists of 456331 compounds. Chemical compounds are described by their molecular graphs. However, only graph convolutional networks can process graphs directly. For the other compared machine learning methods, we generated a sequence or a vectorial representation of the compounds. We generated the SMILES representation that serves as an input for LSTM. For methods that need numerical vectors, we computed a number of chemical descriptors by means of standard software.52,53 We grouped different feature types together into four categories: static features, semisparse features, toxicophore features and dynamic features. Static features are typically those identified by experts as indicating particular molecular properties. Their number is typically fixed, while dynamic features are extracted on the fly from the chemical structure of a compound in a prespecified way. Typically, dynamic features exhibit sparse binary or count distributions, which means that only a small subset of compounds possess the feature. In contrast, static features typically exhibit continuous and/or less-sparse distributions. As with static features, the number of semisparse features is predefined, but the construction idea is similar to that of dynamic features. Toxicophore features describe a compound by the absence or presence of a set of predefined structural alerts, so-called toxicophores. More details on the description of chemical compounds and on which feature types form the individual feature categories, are given in ESI Section S2.2.† In the following, we consider extended connectivity fingerprint features54 (ECFP) and depth first search features55 (DFS) as an own dynamic feature category respectively and compared the prediction performances for the following feature categories or combinations of feature categories individually: common static features52 (StaticF), common semisparse features (SemiF) including MACCS descriptors,56 ECFP features54 with radius radius 3 (ECFP6), DFS features55 with diameter 8 (DFS8) and a combination of ECFP6 and toxicophore features4 (ECFP6 + ToxF).
Note that, for the NB statistics, we could not run the method on the static features, as the method required binary features; for the other feature categories in NB we used a binarized version of the features, that mapped all count values above zero to one. Further, for SEA, we calculated only results for ECFP6 features. Given the low performance compared to other methods and the non-negligible computational demand, we skipped computation of the other feature categories.
We observed that FNNs significantly outperform (α = 0.01) the other tested methods for each feature category. Furthermore, FNNs are significantly better than graph convolutions (GC, Weave) or SmilesLSTM, for almost all feature categories except StaticF features. The second best method are SVMs. They are significantly better than graph convolution networks or SmilesLSTM, if ECFP6, ECFP6 + ToxF or SemiF features are used. Interestingly, the SmilesLSTM has a higher average AUC than the two graph-based approaches. Further, it can be observed that classical machine learning methods, such as SVMs or RFs, perform better than typical chemoinformatics methods. In general, ECFP6 + ToxF features work well for many algorithms, although the best performance was achieved by FNNs based on the feature category “SemiF”. Overall, FNNs exhibit the best performance across prediction tasks.
The average accuracy for predicting a selected assay by a surrogate in vitro assay measuring the same target was 0.81 ± 0.17, and the average accuracy by using DNN models for the prediction of the selected assay was 0.82 ± 0.10. For 13 of 22 assay pairs, there was no significant difference between the accuracy of the surrogate in vitro assay and that of the computational method (see Fig. 3 and Table 2). In 5 of 22 cases, the computational method outperformed the surrogate in vitro assay in terms of accuracy. In 4 of 22 assays, the surrogate in vitro assay was significantly better than deep learning. Overall, the predictive performance of deep learning for an assay with a certain target is on par with surrogate assays measuring the same target.
We identified clusters in ChEMBL by applying single-linkage clustering to all 1456020 compounds of the ChEMBL database. Single-linkage clustering is an agglomerative clustering method which is able to find a clustering with guaranteed minimum distances between any two clusters. This is an important property for cluster-cross-validation, as it avoids that a compound in the training set is closer than some minimum distance to a compound in the test set. We used the Jaccard distance on binarized ECFP4 features as a distance measure between any two compounds.
A critical parameter in the agglomerative clustering process is the threshold that determines the granularity of the clustering, that is, how many clusters of what sizes emerge. The exact procedure of how we obtained a value for this parameter and further details on clustering are given in ESI Section S2.3.†
We investigated the difference of the performance estimates based on cluster-cross-validation to that of random cross-validation. As the distribution of actives and inactives can be different for the folds of a target, we tried to keep the unbalancedness structure to a certain degree. We estimated the performance of FNNs with ECPF + ToxF features (see Table 3). The performance estimates are on average 0.02 (p-value of paired Wilcoxon signed rank test 2.8E-51) higher than the ones from cluster-cross-validation indicating that random cross-validation is more optimistic about the performance on future data than cluster-cross-validation.
Cluster-cross-validation | Fold 1: | 0.722 ± 0.138 | Fold 2: | 0.729 ± 0.148 | Fold 3: | 0.743 ± 0.147 |
Random cross-validation | Fold I: | 0.747 ± 0.145 | Fold II: | 0.760 ± 0.136 | Fold III: | 0.754 ± 0.142 |
A common property, is, that these methods comprise neurons that are arranged hierarchically in layers. The first layer is usually considered as the input layer, where neurons are considered to represent the input features. The following hidden layers consist of hidden neurons with weighted connections to the neurons of the layer below. The activation pattern of these neurons can be considered as an abstract representation of the input, built from features below. The last (i.e., output) layer provides the predictions of the model. A formal description of DNNs is given in ESI Section S3.1.1.† While classical artificial neural networks consist of a small number of neurons, DNNs may comprise thousands of neurons in each layer.57
Deep learning naturally enables multitask learning by incorporating multiple tasks into the learning process. This can support the construction of indicative abstract features that are shared across tasks. Especially for tasks with small or imbalanced training sets, multitask learning allows these tasks to borrow features from related tasks, thereby increasing the performance considerably. Fig. 7 emphasizes the fact that multitask learning may be of interest, since it shows that there are many compounds measured by more than one assay.
Fig. 7 Number of different assay labels (log-scaled) per compound for the finally used benchmark dataset, numbers occurring only once are marked with a star. |
For FNNs, the input features are usually properties of the compound structure, e.g., whether a substructure is present in the molecule. We used either a network architecture with rectified linear units58,59 or a network architecture with scaled exponential linear units60 in all hidden layers to avoid vanishing gradients, and logistic sigmoid units in the output layer to constrain values to the range between 0 and 1. The actually used network architecture was determined by hyperparameter selection. For regularization, we used dropout61 to avoid overfitting. More details, especially on the hyperparameter combinations searched, are given in ESI Section S3.1.2.†
The idea of graph-based neural networks is that these networks learn to extract promising features directly from the graph structure in a similar way as CNNs in image processing do this from raw pixel inputs. Usually, neurons have a hidden state and they represent an atom of the graph. Messages on the hidden states are exchanged between neighboring atoms in the graph and based on incoming messages, the hidden states are updated.62 An essential property of graph convolution ideas is that they introduce operations that are invariant to graph isomorphisms. There are several variations of graph convolutional neural networks, where e.g. also edges may have hidden states. Further details on the usage of graph convolutional networks and the hyperparameter combinations searched are given in ESI Section S3.1.3.†
RNNs work on sequences of input feature vectors that may even differ in their length. We used the LSTM architecture, which is based on LSTM memory cells, that were constructed to avoid vanishing gradients. In analogy to FNNs, it is possible to stack multiple layers of LSTM cells. We used an architecture in which we did not predict the assay outcomes, but also used structural properties of the input compound as auxiliary prediction tasks, which should help in training such a sequence-based network architecture. Further, as a compound may have multiple equivalent string representations, we used random representations of the compounds while training to avoid overfitting. Furthermore, dropout was used against overfitting (details on our used LSTM architecture can be found in ESI Section S3.1.4†).
The choice of similarity measure is crucial to the performance of SVMs. In chemoinformatics, the Tanimoto kernel is a popular and performant similarity measure for chemical compounds. We used the Minmax kernel, an extension of the Tanimoto kernel, that can also be applied to real-valued features when analyzing compounds.4
A formal description of SVMs, implementation details and the formula of the Minmax kernel can be found in ESI Section S3.2.†
Footnotes |
† Electronic supplementary information (ESI) available: Overview, Data Collection and Clustering, Methods, Results, Appendix. See DOI: 10.1039/c8sc00148k |
‡ These authors contributed equally to this work. |
This journal is © The Royal Society of Chemistry 2018 |