Ziduo
Yang‡
a,
Weihe
Zhong‡
a,
Lu
Zhao
ab and
Calvin
Yu-Chian Chen
*acd
aArtificial Intelligence Medical Center, School of Intelligent Systems Engineering, Sun Yat-sen University, Shenzhen, 510275, China. E-mail: chenyuchian@mail.sysu.edu.cn; Tel: +862039332153
bDepartment of Clinical Laboratory, The Sixth Affiliated Hospital, Sun Yat-sen University, Guangzhou, 510655, China
cDepartment of Medical Research, China Medical University Hospital, Taichung, 40447, Taiwan
dDepartment of Bioinformatics and Medical Engineering, Asia University, Taichung, 41354, Taiwan
First published on 5th January 2022
Predicting drug–target affinity (DTA) is beneficial for accelerating drug discovery. Graph neural networks (GNNs) have been widely used in DTA prediction. However, existing shallow GNNs are insufficient to capture the global structure of compounds. Besides, the interpretability of the graph-based DTA models highly relies on the graph attention mechanism, which can not reveal the global relationship between each atom of a molecule. In this study, we proposed a deep multiscale graph neural network based on chemical intuition for DTA prediction (MGraphDTA). We introduced a dense connection into the GNN and built a super-deep GNN with 27 graph convolutional layers to capture the local and global structure of the compound simultaneously. We also developed a novel visual explanation method, gradient-weighted affinity activation mapping (Grad-AAM), to analyze a deep learning model from the chemical perspective. We evaluated our approach using seven benchmark datasets and compared the proposed method to the state-of-the-art deep learning (DL) models. MGraphDTA outperforms other DL-based approaches significantly on various datasets. Moreover, we show that Grad-AAM creates explanations that are consistent with pharmacologists, which may help us gain chemical insights directly from data beyond human perception. These advantages demonstrate that the proposed method improves the generalization and interpretation capability of DTA prediction modeling.
Structure-based methods can explore the potential binding sites by considering the 3D structure of a small molecule and a protein. Docking is a well-established structure-based method that uses numerous mode definitions and scoring functions to minimize free energy for binding. Molecular dynamics simulation is another popular structure-based method that can provide the ultimate detail concerning individual particle motions as a function of time.6 However, the structure-based methods are time-consuming and can not be employed if the 3D structure of the protein is unknown.7
Feature-based methods for DTA prediction modeling are also known as proteochemometrics (PCM),8–10 which relies on a combination of explicit ligand and protein descriptors. Any pairs of drugs and targets can be represented in terms of biological feature vectors with a certain length, often with binary labels that determine whether the drug can bind to the target or not. The extracted biological feature vectors can be used to train machine/deep learning models such as feed-forward neural networks (FNNs), support vector machine (SVM), random forest (RF), and other kernel-based methods.11–19 For example, DeepDTIs20 chose the most common and simple features: extended connectivity fingerprints (ECFP) and protein sequence composition descriptors (PSC) for drugs and targets representation, and then used a deep belief network for DTA prediction. Lenselink et al.11 compared FNNs with different machine learning methods such as logistic regression, RF, and SVM on one single standardized dataset and found that FNNs are the top-performing classifiers. A study conducted by Mayr et al.12 also found a similar result that FNNs outperform other competing methods. MDeePred21 represented protein descriptors by the combination of various types of protein features such as sequence, structural, evolutionary, and physicochemical properties, and a hybrid deep neural network was used to predict binding affinities from the compound and protein descriptors. MoleculeNet22 introduced a featurization method called grid featurizer that used structural information of both ligand and target. The grid featurizer considers not only features of the protein and ligand individually but also the chemical interaction within the binding pocket.
Over the past few years, there has been a remarkable increase in the amount of available compound activity and biomedical data owing to the emergence of novel experimental techniques such as high throughput screening, parallel synthesis among others.23–25 The high demand for exploring and analyzing massive data has encouraged the development of data-hungry algorithms like deep learning.26,27 Many types of deep learning frameworks have been adopted in DTA prediction. DeepDTA2 established two convolutional neural networks (CNNs) to learn the representations of the drug and protein, respectively. The learned drug and protein representations are then concatenating and fed into a multi-layer perceptron (MLP) for DTA prediction. WideDTA28 further improved the performance of DeepDTA by integrating two additional text-based inputs and using four CNNs to encode them into four representations. Lee et al.29 also utilized CNN on the protein sequence to learn local residue patterns and conduct extensive experiments to demonstrate the effectiveness of CNN-based methods. On the other hand, DEEPScreen represented compounds as 2-D structural images and used CNN to learn complex features from these 2-D structural drawings to produce highly accurate DTA predictions.30
Although CNN-based methods have achieved remarkable performance in DTA prediction, most of these models represent the drugs as strings, which is not a natural way to represent compounds.31 When using strings, the structural information of the molecule is lost, which could impair the predictive power of a model as well as the functional relevance of the learned latent space. To address this problem, graph neural networks (GNNs) have been adopted in DTA prediction.31–36 The GNN-based methods represent the drugs as graphs and use GNN for DTA prediction. For instance, Tsubaki et al.34 proposed to use GNN and CNN to learn low-dimensional vector representation of compound graphs and protein sequences, respectively. They formulated the DTA prediction as a classification problem and conducted experiments on three datasets. The experimental results demonstrate that the GNN-based method outperforms PCM methods. GraphDTA31 evaluated several types of GNNs including GCN, GAT, GIN, and GAT–GCN for DTA prediction, in which DTA was regarded as a regression problem. The experimental results confirm that deep learning methods are capable of DTA prediction, and representing drugs as graphs can lead to further improvement. DGraphDTA37 represented both compounds and proteins as graphs and used GNNs on both the compound and protein sides to obtain their representations. Moreover, to increase the model interpretability, attention mechanisms have been introduced into DTA prediction models.32,36,38–40
On the other hand, some researches focused on improving DTA prediction by using structural-related features of protein as input.37,41 For example, DGraphDTA37 utilized contact maps predicted from protein sequences as the input of the protein encoder to improve the performance of DTA predictions. Since protein structural information is not always available, they use contact maps predicted from the sequences, which enables the model to take all sorts of proteins as input.
Overall, many novel models for DTA prediction based on shallow GNNs have been developed and show promising performance on various datasets. However, at least three problems have not been well addressed for GNN-based methods in DTA prediction. First, we argue that GNNs with few layers are insufficient to capture the global structure of the compounds. As shown in Fig. 1(a), a GNN with two layers is unable to know whether the ring exists in the molecule, and the graph embedding will be generated without considering the information about the ring. The graph convolutional layers should be stacked deeply in order to capture the global structure of a graph. Concretely, to capture the structures make up of k-hop neighbors, k graph convolutional layers should be stacked.42 However, building a deep architecture of GNNs is currently infeasible due to the over-smoothing and vanishing gradient problems.43,44 As a result, most state-of-the-art (SOTA) GNN models are no deeper than 3 or 4 layers. Second, a well-constructed GNN should be able to preserve the local structure of a compound. As shown in Fig. 1(b), the methyl carboxylate moiety is crucial for methyl decanoate and the GNN should distinguish it from the less essential substituents in order to make a reasonable inference. Third, the interpretability of graph-based DTA models highly relies on the attention mechanism. Although the attention mechanism provides an effective visual explanation, it increases the computational cost. In addition, the graph attention mechanism only considers the neighborhood of a vertex (also called masked attention),45,46 which can not capture the global relationship between each atom of a molecule.
To address the above problems, we proposed a multiscale graph neural network (MGNN) and a novel visual explanation method called gradient-weighted affinity activation mapping (Grad-AAM) for DTA prediction and interpretation. An overview of the proposed MGraphDTA is shown in Fig. 2. The MGNN with 27 graph convolutional layers and a multiscale convolutional neural network (MCNN) were used to extract the multiscale features of drug and target, respectively. The multiscale features of the drug contained rich information about the molecule's structure at a different scale and enabled the GNN to make a more accurate prediction. The extracted multiscale features of the drug and target were fused respectively and then concatenated to obtain a combined descriptor for a given drug–target pair. The combined descriptor was fed into a MLP to predict binding affinity. Grad-AAM used the gradients of the affinity flowing into the final graph convolutional layer of MGNN to produce a probability map highlighting the important atoms that contribute most to the DTA. The proposed Grad-AAM was motivated by gradient-weighted class activation mapping (Grad-CAM) that can produce a coarse localization map highlighting the important regions in the image.47 However, the Grad-CAM was designed for neural network classification tasks based CNNs. Unlike Grad-CAM, Grad-AAM was activated by the binding affinity score based on GNNs. The main contributions of this paper are twofold:
(a) We construct a very deep GNN for DTA prediction and rationalize it from the chemical perspective.
(b) We proposed a simple but effective visualization method called Grad-AAM to investigate how GNN makes decisions in DTA prediction.
Fig. 3 Molecule representation and graph embedding. (a) Representing a molecule as a graph. (b) Graph message passing phase corresponding eqn (1). (c) Graph readout phase corresponding eqn (2). |
(1) |
(2) |
Fig. 4 Overview of the MGNN. (a) The network architecture of the proposed MGNN. (b) The detailed design of the multiscale block. |
(3) |
(4) |
(5) |
(6) |
(7) |
We then performed a weighted combination of the forward activation maps followed by a ReLU activation as
(8) |
Finally, min–max normalization was used to map the probability map PGrad-AAM ranging from 0 to 1. The chemical probability map PGrad-AAM can be thought of as a weighted aggregation of important geometric substructures of a molecule that are captured by a GNN as shown in Fig. 6.
Fig. 6 The chemical probability map is a weighted sum of vital substructures of a molecule captured by a GNN. |
We also formulated DTA prediction as a binary classification problem and evaluated the proposed MGraphDTA in two widely used classification datasets, Human and Caenorhabditis elegans (C. elegans).34,38,46
Moreover, we conducted a case study to evaluate the Grad-AAM using the ToxCast dataset.35 Since the ToxCast dataset contains multiple assays which means that one drug–target pair may have different binding affinity depending on the type of assay. For simplicity, we only selected one of the assays containing the largest drug–target pairs. Table 1 summarizes these datasets. Fig. S1–S3† show the distribution of binding affinities, SMILES length, and protein sequence length of these datasets.
Dataset | Task type | Compounds | Proteins | Interactions |
---|---|---|---|---|
Davis | Regression | 68 | 442 | 30056 |
Filtered davis | Regression | 68 | 379 | 9125 |
KIBA | Regression | 2111 | 229 | 118254 |
Metz | Regression | 1423 | 170 | 35259 |
Human | Classification | 2726 | 2001 | 6728 |
C. elegans | Classification | 1767 | 1876 | 7786 |
ToxCast | Regression | 3098 | 37 | 114626 |
Table 2 summarizes the quantitative results. For the Human dataset, the proposed method yielded a significantly higher precision than that of other methods for DTA prediction. For the C. elegans dataset, the proposed method achieved considerable improvements in both precision and recall. These results reveal MGraphDTA's potential to master molecular representation learning for drug discovery. Besides, we observed that replacing CNN with MCNN can yield a slight improvement, which corroborates the efficacy of the proposed MCNN.
Dataset | Model | Precision | Recall | AUC |
---|---|---|---|---|
Human | GNN-CNN | 0.923 | 0.918 | 0.970 |
TrimNet-CNN | 0.918 | 0.953 | 0.974 | |
GraphDTA | 0.882 (0.040) | 0.912 (0.040) | 0.960 (0.005) | |
DrugVQA(VQA-seq) | 0.897 (0.004) | 0.948 (0.003) | 0.964 (0.005) | |
TransformerCPI | 0.916 (0.006) | 0.925 (0.006) | 0.973 (0.002) | |
MGNN-CNN (ours) | 0.953 (0.006) | 0.950 (0.004) | 0.982 (0.001) | |
MGNN-MCNN (ours) | 0.955 (0.005) | 0.956 (0.003) | 0.983 (0.003) | |
C. elegans | GNN-CNN | 0.938 | 0.929 | 0.978 |
TrimNet-CNN | 0.946 | 0.945 | 0.987 | |
GraphDTA | 0.927 (0.015) | 0.912 (0.023) | 0.974 (0.004) | |
TransformerCPI | 0.952 (0.006) | 0.953 (0.005) | 0.988 (0.002) | |
MGNN-CNN (ours) | 0.979 (0.005) | 0.961 (0.002) | 0.991 (0.002) | |
MGNN-MCNN (ours) | 0.980 (0.004) | 0.967 (0.005) | 0.991 (0.001) |
For the regression task on the filtered Davis dataset, we compared the proposed MGraphDTA with SOTA methods in this dataset, which were MDeePred,21 CGKronRLS,63 and DeepDTA.2 We used root mean square error (RMSE, the smaller the better), CI, and Spearman rank correlation (the higher the better) as performance indicators following MDeePred. The whole dataset was randomly divided into six parts; five of them were used for fivefold cross-validation and the remaining part was used as the independent test dataset. The final performance was evaluated on the independent test dataset following MDeePred. Note that the data points in each fold are exactly the same as MDeePred for a fair comparison.
Tables 3 and 4 summarize the predictive performance of MGraphDTA and previous models on the Davis, KIBA, and Metz datasets. The graph-based methods surpassed CNN-based and recurrent neural network (RNN) based methods, which demonstrates the potential of graph neural networks in DTA prediction. Since CNN-based and RNN-based models represent the compounds as strings, the predictive capability of a model may be weakened without considering the structural information of the molecule. In contrast, the graph-based methods represent compounds as graphs and capture the dependence of graphs via message passing between the vertices of graphs. Compared to other graph-based methods, MGraphDTA achieved the best performances as shown in Tables 3 and 4. The paired Student's t-test shows that the differences between MGraphDTA and other graph-based methods are statistically significant on the Metz dataset (p < 0.05). Moreover, MGraphDTA was significantly better than traditional PCM models on three datasets (p < 0.01). It is worth noting that FNN was superior to other traditional PCM models (p < 0.01), which is consistent with the previous studies.11,12Table 5 summarizes the results of four methods in the filtered Davis dataset. It can be observed that MGraphDTA achieved the lowest RMSE. Overall, MGraphDTA showed impressive results on four benchmark datasets that exceed other SOTA DTA prediction models significantly, which reveals the validity of the proposed MGraphDTA.
Dataset | Davis | KIBA | ||||||
---|---|---|---|---|---|---|---|---|
Model | Proteins | Compounds | MSE | CI | r m 2 index | MSE | CI | r m 2 index |
a These results are taken from DeepDTA.2 b These results are taken from WideDTA.28 c These results are taken from GraphDTA.31 d These results are taken from DeepAffinity.32 — These results are not reported from original studies. | ||||||||
DeepDTAa | CNN | CNN | 0.261 | 0.878 | 0.630 | 0.194 | 0.863 | 0.673 |
WideDTAb | CNN + PDM | CNN + LMCS | 0.262 | 0.886 | — | 0.179 | 0.875 | — |
GraphDTAc | CNN | GCN | 0.254 | 0.880 | — | 0.139 | 0.889 | — |
GraphDTAc | CNN | GAT | 0.232 | 0.892 | — | 0.179 | 0.866 | — |
GraphDTAc | CNN | GIN | 0.229 | 0.893 | — | 0.147 | 0.882 | — |
GraphDTAc | CNN | GAT–GCN | 0.245 | 0.881 | — | 0.139 | 0.891 | — |
DeepAffinityd | RNN | RNN | 0.253 | 0.900 | — | 0.188 | 0.842 | — |
DeepAffinityd | RNN | GCN | 0.260 | 0.881 | — | 0.288 | 0.797 | — |
DeepAffinityd | CNN | GCN | 0.657 | 0.737 | — | 0.680 | 0.576 | — |
DeepAffinityd | HRNN | GCN | 0.252 | 0.881 | — | 0.201 | 0.842 | — |
DeepAffinityd | HRNN | GIN | 0.436 | 0.822 | — | 0.445 | 0.689 | — |
KronRLSa | SW | PS | 0.379 | 0.871 | 0.407 | 0.411 | 0.782 | 0.342 |
SimBoosta | SW | PS | 0.282 | 0.872 | 0.655 | 0.222 | 0.836 | 0.629 |
RF | ECFP | PSC | 0.359 (0.003) | 0.854 (0.002) | 0.549 (0.005) | 0.245 (0.001) | 0.837 (0.000) | 0.581 (0.000) |
SVM | ECFP | PSC | 0.383 (0.002) | 0.857 (0.001) | 0.513 (0.003) | 0.308 (0.003) | 0.799 (0.001) | 0.513 (0.004) |
FNN | ECFP | PSC | 0.244 (0.009) | 0.893 (0.003) | 0.685 (0.015) | 0.216 (0.010) | 0.818 (0.005) | 0.659 (0.015) |
MGraphDTA | MCNN | MGNN | 0.207 (0.001) | 0.900 (0.004) | 0.710 (0.005) | 0.128 (0.001) | 0.902 (0.001) | 0.801 (0.001) |
Model | Proteins | Compounds | MSE | CI | r m 2 index |
---|---|---|---|---|---|
DeepDTA | CNN | CNN | 0.286 (0.001) | 0.815 (0.001) | 0.678 (0.003) |
GraphDTA | CNN | GCN | 0.282 (0.007) | 0.815 (0.002) | 0.679 (0.008) |
GraphDTA | CNN | GAT | 0.323 (0.003) | 0.800 (0.001) | 0.625 (0.010) |
GraphDTA | CNN | GIN | 0.313 (0.002) | 0.803 (0.001) | 0.632 (0.001) |
GraphDTA | CNN | GAT–GCN | 0.282 (0.011) | 0.816 (0.004) | 0.681 (0.026) |
RF | ECFP | PSC | 0.351 (0.002) | 0.793 (0.001) | 0.565 (0.001) |
SVM | ECFP | PSC | 0.361 (0.001) | 0.794 (0.000) | 0.590 (0.001) |
FNN | ECFP | PSC | 0.316 (0.001) | 0.805 (0.001) | 0.660 (0.003) |
MGraphDTA | MCNN | MGNN | 0.265 (0.002) | 0.822 (0.001) | 0.701 (0.001) |
(1) Orphan–target split: each protein in the test set is unavailable in the training set.
(2) Orphan–drug split: each drug in the test set is inaccessible in the training set.
(3) Cluster-based split: compounds in the training and test sets are structurally different (i.e., the two sets have guaranteed minimum distances in terms of structure similarity). We used Jaccard distance on binarized ECFP4 features to measure the distance between any two compounds following the previous study.12 Single-linkage clustering12 was applied to find a clustering with guaranteed minimum distances between any two clusters.
Given that the DTA prediction models are typically used to discover drugs or targets that are absent from the training set, the orphan splits provide realistic and more challenging evaluation schemes for the models. The cluster-based split further prevents the structural information of compounds from leaking to the test set. We compared the proposed MGraphDTA to GraphDTA and three traditional PCM models (RF, SVM, and FNN). For a fair comparison, we replaced the MGNN in MGraphDTA with GCN, GAT, GIN, and GAT–GCN using the source code provided by GraphDTA with the hyper-parameters they reported. We used the five-fold cross-validation strategy to analyze model performance. In each fold, all methods shared the same training, validation, and test sets. Note that the experimental settings remain the same for the eight methods.
Fig. 7 shows the experimental results for eight methods using the orphan-based and cluster-based split settings. Compared with the results using the random split setting shown in Tables 3 and 4, we found that the model's performance decreases greatly in the orphan-based and cluster-based split settings. Furthermore, as shown in Fig. 7(a) and (c), the MSE for MGraphDTA on Davis, KIBA, and Metz datasets using the orphan–drug split were 0.572 ± 0.088, 0.390 ± 0.023, and 0.555 ± 0.043, respectively while those using the cluster-based split were 0.654 ± 0.207, 0.493 ± 0.097, and 0.640 ± 0.078, respectively. In other words, the cluster-based split is more challenging to the DTA prediction model compared to the orphan–drug split, which is consistent with the fact that the cluster-based split setting can prevent the structural information of compounds from leaking to the test set. These results suggest that improving the generalization ability of the DTA model is still a challenge. From Fig. 7(a), we observed that MGNN exceeded other methods significantly in the Davis dataset using the orphan–drug split setting (p < 0.01). On the other hand, there were no statistical differences between MGNN, GAT, and RF (p > 0.05) in the KIBA dataset while these three methods surpassed other methods significantly (p < 0.01). In addition, SVM and FNN methods were superior to other methods significantly in the Metz dataset (p < 0.01). Overall, the traditional PCM models showed impressive results that even surpassed graph-based methods in the KIBA and Metz datasets using the orphan–drug split setting as shown in Fig. 7(a). These results suggest that it may be enough to use simple feature-based methods like RF in this scenario, which is consistent with a recent study.64 Since the number of drugs in the Davis dataset is significantly less than that in KIBA and Metz datasets as shown in Table 1, the generalization ability of a model trained on limited drugs can not be guaranteed for unseen drugs. Fig. 8 shows the correlations between predictive values and ground truths of five graph-based models in the Davis dataset using orphan–drug splitting. The predictive value of MGNN was broader than that of other graph-based models as shown in Fig. 8(a). We also noticed that the ground truths and predictive values of MGNN have the most similar distributions as shown in Fig. 8(b). The Pearson correlation coefficients of GCN, GAT, GIN, GAT–GCN, and MGNN for DTA prediction were 0.427, 0.420, 0.462, 0.411, and 0.552, respectively. These results further confirm that MGNN has the potential to increase the generalization ability of the DTA model. From Fig. 7(b), we observed that MGNN outperforms other models significantly in three datasets using the orphan–target split setting (p < 0.01). MGNN also exceeded other methods significantly in KIBA and Metz datasets using the cluster-based split setting as shown in Fig. 7(c) (p < 0.05). It is worth noting that graph-based methods outperformed traditional PCM models in the random split setting as shown in Tables 3 and 4, while the superiority of the graph-based methods was less obvious in the orphan-based and cluster-based split settings as shown in Fig. 7. Overall, the results show the robustness of MGNN in different split setting schemes and prove that both local and nonlocal properties of a given molecule are essential for a GNN to make accurate predictions.
Fig. 8 (a) Scatter and (b) kernel density estimate plots of binding affinities between predictive values and ground truths in Davis dataset using the orphan–drug split setting. |
Model | RMSE | CI | Spearman |
---|---|---|---|
Without dense connection | 0.726 (0.008) | 0.726 (0.008) | 0.620 (0.019) |
Without batch normalization | 0.746 (0.032) | 0.719 (0.014) | 0.604 (0.008) |
MGraphDTA | 0.695 (0.009) | 0.740 (0.002) | 0.654 (0.005) |
Furthermore, an ablation study was performed on the filtered Davis dataset to investigate the effect of the receptive field of MCNN on the performance. Specifically, we increased the receptive field gradually by using convolutional layers with a more and more large kernel (i.e., 7, 15, 23, 31). From the results shown in Table 7, it can be observed that the model performance was slightly decreased as increasing the receptive field. Since there are usually a few residues that are involved in protein and ligand interaction,65 increasing the receptive field to cover more regions may bring noise information from the portions of the sequence that are not involved in DTA into the model.
Max receptive field | RMSE | CI | Spearman |
---|---|---|---|
31 | 0.718 (0.002) | 0.732 (0.005) | 0.636 (0.013) |
23 | 0.713 (0.008) | 0.732 (0.004) | 0.635 (0.008) |
15 | 0.710 (0.006) | 0.734 (0.005) | 0.639 (0.011) |
7 | 0.695 (0.009) | 0.740 (0.002) | 0.654 (0.005) |
We also conducted an experiment to show that the MGraphDTA uses both compound and protein information for DTA prediction instead of learning the inherent bias in the dataset as reported in previous studies.66,67 In particular, we computed the activation values of each unit in the last layer of the two encoders (i.e., and described in Sections 2.3 and 2.4) in the Davis, filtered Davis, KIBA, and Metz datasets, respectively. The higher activation value indicates more contribution of the unit in the model's decision-making.68Fig. 9 shows the distribution of these activation values from the protein and ligand encoders. It can be observed that MGraphDTA used both protein and ligand information to make inferences. However, the model was biased toward the protein information in the Davis dataset. The bias is partly due to the unbalanced distribution of binding affinities (labels) of the Davis dataset as shown in Fig. S1† because it is lessened in the filtered Davis dataset as shown in Fig. 9. More precisely, the Davis dataset contains 63 proteins where any drug–target pairs relating to these proteins are labeled with 10 μM bioactivity values (i.e., pKd value 5). Therefore, a predictive model may simply output binding affinity values around 10 μM for drug–target pairs associated with these proteins, which causes the model biasing toward protein encoder. While these proteins are removed in the filtered davis dataset; thus the bias is alleviated. Another possible reason is that the Davis dataset only contains 68 compounds, which is insufficient to learn a robust drug encoder.
Fig. 9 Distribution of activation values of the last layers in the ligand and protein encoders on the Davis, filtered Davis, KIBA, and Metz datasets. |
(1) Visualizing MGNN model based on Grad-AAM.
(2) Visualizing GAT model based on Grad-AAM.
(3) Visualizing GAT model based on graph attention mechanism.
Specifically, we first replaced MGNN with a two layers GAT in which the first graph convolution layer had ten parallel attention heads using the source code provided by GraphDTA.31 We then trained MGNN-based and GAT-based DTA prediction models under the five-fold cross-validation strategy using the random split setting. Finally, we calculated the atom importance using Grad-AAM and graph attention mechanism and showed the probability map using RDkit.48
Table 8 shows the quantitive results of MGNN and GAT. MGNN outperformed GAT by a notable margin (p < 0.01), which further corroborates the superiority of the proposed MGNN. Fig. 10 shows the visualization results of some molecules based on Grad-AAM (MGNN), Grad-AAM (GAT), and graph attention (more examples can be found in ESI Fig. S4 and S5†). According to previous studies,71–75 epoxide,73 fatty acid,72,75 sulfonate,71 and aromatic nitroso74 are the structural alerts that correlate with specific toxicological endpoints. We found that Grad-AAM (MGNN) does give the highest weights to these structural alerts. Grad-AAM (MGNN) can not only identify important small moieties as shown in Fig. 10(a)–(d) but also reveal the large moieties as shown in Fig. 10(f), which proves that the MGNN can capture the local and global structures simultaneously. Grad-AAM (GAT) also discerned the structural alerts as shown in Fig. 10(b), (c), (e), and (f). However, Grad-AAM (GAT) sometimes failed to detect structural alerts as shown in Fig. 10(a) and (d) and we might also notice that the highlighted region involves more extensive regions, and does not correspond to the exact structural alerts as shown in Fig. 10(b), (c), and (e). These results suggest that the hidden representations learned by GAT were insufficient to well describe the molecules. On the other hand, the graph attention can only reveal some atoms of structural alerts as shown in Fig. 10(c), (d), and (f). The attention map contained less information about the global structure of a molecule since it only considers the neighborhood of an atom.45 The superiority of graph attention was that it can highlight atoms and bonds simultaneously, which the Grad-AAM can only highlight the atoms. Fig. 11 shows the distribution of atom importance for Grad-AAM (MGNN), Grad-AAM (GAT), and graph attention. The distribution for Grad-AAM (MGNN) was left-skewed, which suggested that MGNN pays more attention to some particular substituents contributing most to the toxicity while suppressing the less essential substituents. We also found that Grad-AAM (GAT) tends to highlight extensive atoms from the distribution which was consistent with the results shown in Fig. 10(b), (c), and (e). Conversely, the distribution of graph attention was narrow with most values less than 0.5, which suggested that graph attention often failed to detect important substructures. It is worth noting that some studies utilize global attention mechanisms while dropping all structural information of a graph to visualize a model and may also provide reasonable visual explanations.46,76 However, these global attention-based methods are model-specific so that the methods can not easily transfer to other graph models. Conversely, Grad-AAM is a universal visual interpretation method that can be easily transferred to other graph models. Moreover, the visual explanation results produced by Grad-AAM may be further improved by applying regularization techniques during the training of MGraphDTA.77
Model | Proteins | Compounds | MSE | CI | r m 2 index |
---|---|---|---|---|---|
GraphDTA | MCNN | GAT | 0.215 (0.007) | 0.843 (0.005) | 0.330 (0.007) |
MGraphDTA | MCNN | MGNN | 0.176 (0.007) | 0.902 (0.005) | 0.430 (0.006) |
Fig. 10 Atom importance revealed by Grad-AAM (MGNN), Grad-AAM (GAT), and graph attention in structural alerts of (a) and (b) epoxide, (c) and (d) fatty acid, (e) sulfonate, and (f) aromatic nitroso. |
Fig. 11 Distribution of atom importance for Grad-AAM (MGNN), Grad-AAM(GAT), and graph attention. Note that we do not consider the bond importance for Grad-AAM (GAT). |
Overall, Grad-AAM tends to create more accurate explanations than the graph attention mechanism, which may offer biological interpretation to help us understand DL-based DTA prediction. Fig. 12 shows Grad-AAM (MGNN) on compounds with symmetrical structures. The distribution of Grad-AAM (MGNN) was also symmetrical, which suggests that representing compounds as graphs and using GNNs to extract the compounds' pattern is able to preserve the structures of the compounds.
Fig. 13 The receptive field of layer 1, layer 2, and layer 3 of GNN in compound 4-propylcyclohexan-1-one. (a) The receptive field of atom C2. (b) The receptive field of atom C1. |
Footnotes |
† Electronic supplementary information (ESI) available: Details of machine learning construction, vertex features of graphs, data distributions, hyperparameters tuning, and additional visualization results. See DOI: 10.1039/d1sc05180f |
‡ Equal contribution. |
This journal is © The Royal Society of Chemistry 2022 |