Open Access Article
Xinke
Zhan
a,
Tiantao
Liu
a,
Changqing
Yu
b,
Yu-An
Huang
c,
Zhuhong
You
c and
Shirley W. I.
Siu
*a
aCentre for Artificial Intelligence Driven Drug Discovery, Faculty of Applied Sciences, Macao Polytechnic University, Macau SAR, China. E-mail: shirleysiu@mpu.edu.mo
bSchool of Information Engineering, Xijing University, Xi'an 710123, China
cSchool of Computer Science, Northwestern Polytechnical University, Xi'an 710072, China
First published on 16th September 2025
Accurate prediction of drug–target interactions (DTIs) is indispensable for discovering novel drugs and repositioning existing ones. Recently, numerous methods based on deep learning have made promising progress in DTI predictions. These methods often utilize a single attention mechanism, which limits their ability to capture the complex features of both drugs and proteins. As a result, feature representation can be incomplete, training can become more complex and prone to overfitting. These together can impair the generalizability of the model. To address these problems, we propose an end-to-end neural network drug–target interaction approach called Multi-perspective Attention AggRegating (MAARDTI). Here, a multi-perspective attention mechanism is introduced that combines channel attention and spatial attention to capture a more comprehensive feature representation. The dual-context refocusing module is used to enhance the attention representation capability and improve the generalizability of the model. Experiments show that our proposed model outperforms ten state-of-the-art methods in three public datasets, achieving AUC values of 0.8975, 0.9248, and 0.9330 in DrugBank, Davis and KIBA, respectively. In the cold-splitting test with novel targets, drugs, and their bindings, MAARDTI performs on par with some methods for cold drug predictions. It outperforms in predicting unseen targets and bindings, underscoring the effectiveness of the novel multi-perspective attention mechanism in challenging scenarios. Hence, MAARDTI has the potential to serve as an effective tool for rapid identification of novel DTIs in drug research.
In recent years, a number of machine learning-based methods have been developed to predict DTIs.9–13 These methods typically use the amino acid sequences of proteins and the Simplified Molecular Input Line Entry System (SMILES) of drugs as input. For instance, Wang et al.14 reported a computational method that extracts feature vectors from drug structures and protein sequences and employs rotation forest to predict DTIs. Li et al.15 obtained PSSM features from protein amino acid sequences and substructure fingerprints from drug chemical structures. To avoid the curse of dimensionality, the authors used principal component analysis (PCA) to reduce feature dimensions, and finally applied the local binary pattern (LBP) to predict DTIs. The aforementioned methods have made significant progress in the field, particularly by narrowing the search space for potential drug–target candidates. However, they suffer from the limitation that manual feature selection is required prior to model construction. This process can introduce bias, leading to poor model generalizability and insensitivity to noisy data.
Recently, deep learning16,17 has become a research focus since it can learn the latent feature representation through backward propagation without the need to manually engineer features. Moreover, a number of deep learning frameworks have shown outstanding performance at affordable costs compared to classical machine learning methods. In this regard, Öztürk et al.18 proposed a model called DeepDTA, which only considers one-dimensional sequence representation. It includes two convolutional neural network (CNN) blocks to extract features from the amino acid sequence of proteins and SMILES of drugs separately, and then feeds them into a fully connected network (FCN) to obtain the final prediction results. Lee et al.19 developed DeepConv-DTI, which expands the diversity of proteins to include diverse protein lengths and various target protein classes. In these studies, extensive experiments were conducted to validate the effectiveness of CNN-based methods. The difference between DeepDTA and DeepConv-DTI is that DeepConv-DTI adopted the extended connectivity fingerprint (ECFP) algorithm to extract drug features. Interestingly, neither of the two methods takes into account interaction features of the known protein–drug pairs, but treats protein and drug separately. Meanwhile, Zheng et al.20 proposed a novel end-to-end deep learning framework called DrugVQA to predict DTIs, which defines the prediction task as a classical visual question answering problem. The method employs dynamic convolutional neural network (DynCNN) and bidirectional long short-term memory (LSTM) to extract features from 2D pairwise distance maps and represent drugs using molecular linear notation. Zhu et al.21 reported a drug–target affinity prediction model named RRGDTA. The framework enhances correlations between molecular substructures and contextual features through a multi-scale interaction module (MSI), captures local structural correlations via a rotary encoding module (ROE), and preserves critical interaction patterns using an association prediction module (APM) with intra-mask retention (IMR). Wei et al.22 reported a model named LAM-DTI to address the sequence length discrepancy between drugs and targets, and a learnable association information matrix dynamically adjusts to capture DTI pair information, effectively identifying interactions.
In addition, due to the remarkable performance of the transformer network, attention-based and BERT-based methods23,24 have also been successfully used for predicting DTIs.25–29 Notably, Huang et al.30 proposed a transformer-based model named MolTrans. In their research, a 2D binding map of proteins and drugs is used as input and the molecular representation features are extracted by the augmented transformer module. Zhu et al.31 proposed TDGraphDTA, a transformer and diffusion-based model for drug–target affinity prediction. The framework integrates multi-scale information interaction to capture relationships between molecular substructures and employs a diffusion-based graph optimization module to enhance molecular graph representation and interpretability. Zhao et al.32 proposed an end-to-end model named HyperAttentionDTI, which adopts the attention mechanism of feature matrices. This method utilizes the original features of proteins and drugs, but high spatio-temporal complexity and the CNN block receptive field limit its performance. Bian et al.33 proposed a shared-weighted-based multi-head cross attention network, called MCANet, which uses the cross-attention mechanism to compute attended protein and drug features. Ouyang et al.34 introduced a BERT-inspired model called Pre-trained Multi-view Molecular Representations (PMMRs), an innovative neural network approach that leverages pre-trained models to enhance the generalizability and accuracy of drug–target binding predictions. By integrating multi-view molecular representations, they attempted to address the challenges posed by limited and diverse training data. These studies have proposed increasingly complex models with a large number of parameters, which makes training of these models particularly difficult and prone to overfitting. Moreover, these attention-based methods primarily focus on the unitary inter-subsequence or inter-substructure but ignore the positional features of the subsequence or substructure.
In view of the problems above, we propose an end-to-end neural network approach called MAARDTI for predicting DTIs. The drug SMILES strings and protein amino acid sequences are utilized as the input of our model. Two independent CNN blocks are then used to extract the protein and drug embedding features. In addition, different from the single attention mechanism of earlier studies, the MAAR module is designed to generate an aggregating matrix which fuses channel attention and spatial attention for strengthening subspace representation. Meanwhile, the bi-contextual refocusing module is adopted, which fuses attention matrices to obtain a multi-perspective attention feature representation, for improving attention generalizability. Finally, the drug and protein feature vectors are concatenated and fed into the prediction block. We conducted extensive experiments on several widely used benchmark datasets and compared our results with state-of-the-art methods. The superior prediction results of our method demonstrate the effectiveness of MAARDTI in predicting DTIs. Overall, this work has the following novelties: (i) A novel computational method outperforming state-of-the-art methods in DTI prediction is proposed. (ii) A multi-perspective attention aggregating module is designed, which captures the channel and spatial attention features to strengthen the feature learning ability of the model. (iii) A bi-contextual refocusing module including the drug-contextual refocusing block and the protein-contextual refocusing block is designed to further improve multi-dimensional feature expression of attention. (iv) To the best of our knowledge, this is the first attempt in which the fusion attention weight matrix and aggregation matrix are utilized for protein and drug subspace feature representations. We demonstrate that effective multi-perspective attention fusion can improve DTI prediction.
and the drug embedding matrix
are obtained, where Lprot is the length of the protein sequence and Ldrug is the length of the drug string, and Dp and Dd are the embedding dimension of the protein and drug, respectively. To enhance feature representation, we adopt two independent CNN blocks. Each CNN block contains three 1D-convolutional layers in which different sized filters are used for better capturing important local dependencies. The CNN block for the protein and the drug can be formulated as follows:| Lp(l+1) = σ(conv(Wp(l), bp(l), Lp(l))) | (1) |
| Ld(l+1) = σ(conv(Wd(l), bd(l), Ld(l))) | (2) |
When the drug embedding matrix Dembed and the protein embedding matrix Pembed pass through the CNN block, the protein feature matrix
and the drug feature matrix
are generated in the latent feature space, where mp and md are the dimensions of protein and drug feature vectors. Hence, these matrices contain semantic information and spatially associated information among the features.
the matrix is first fed into the max-pooling layer and the average-pooling layer to obtain average-pooled and max-pooled features, respectively.| Favg = avgpool(Fcnn) | (3) |
| Fmax = maxpool(Fcnn) | (4) |
where ρ denotes the reduction ratio. After each descriptor is fed into the weight sharing network to obtain the channel attention map,
element-wise summation operation is then applied to merge the output features. The definition of channel attention is computed as follows:| Fc(M) = σ(Wa(Wb(Fpavg)) + Wa(Wb(Fpmax))) | (5) |
and
are the weight matrices which are shared between both inputs Favg and Fmax. The ReLU activation function is followed by Wa.
A convolution layer is employed to learn the 2D spatial attention matrix, which can be formulated as follows:| Fc(M) = σ(kn×n([Favg; Fmax])) | (6) |
Finally, the channel attention matrix Fc(M) and the spatial attention matrix Fs(M) are multiplied to obtain an aggregating attention matrix:
| Fatt = Fc(M) × Fs(M) | (7) |
After obtaining the aggregation attention matrix Fatt, the input drug feature matrix Fdrug and protein feature matrix Fprot are processed using the bi-contextual refocusing module. Taking drug feature as an example, the drug feature map is first projected to the Query matrix Qd, the Key matrix Kd and the Value matrix Vd; then, we fuse the original attention matrix and the aggregation attention map. For each head, the combining attention result is calculated as follows:
![]() | (8) |
Afterwards, the drug output of the multi-head attention operation is computed as follows:
![]() | (9) |
is the learnable matrix for feature mapping. We concatenate the outputs of all the heads and compute the final drug feature vector. Likewise, the protein operation is performed as follows:![]() | (10) |
The protein output of the multi-head attention operation is as follows:
![]() | (11) |
Hence, we concatenate the outputs of all the heads and obtain the final protein feature vector.
After applying the bi-contextual refocusing module, we obtain the drug augmentation matrix
and the protein augmentation matrix
The latent drug and protein feature matrices FD and FP are updated as follows:
![]() | (12) |
![]() | (13) |
511 validated positive DTIs in our DrugBank dataset. In order to obtain an equal number of negative DTIs, we randomly selected 17
511 unlabeled drug–target pairs to generate a complete dataset of 35
022 DTIs. The summary of the three benchmark datasets is shown in Table 1.
| Datasets | Drug | Protein | Interaction | Positive | Negative |
|---|---|---|---|---|---|
| Davis | 68 | 379 | 25 772 |
7320 | 18 452 |
| KIBA | 2068 | 225 | 116 350 |
22 154 |
94 196 |
| DrugBank | 6655 | 4294 | 35 022 |
17 511 |
17 511 |
In addition, we experiment the fusion weights α and β used in the bi-contextual refocusing module. We set up nine groups of parameter experiments, with the α value ranging from 0.1 to 0.9, while β is set to 1 − α. As shown in Fig. 3, the model achieves peak performance across key metrics, including accuracy, AUC, and AUPR, when both α and β are set to 0.5. This validates our parameter selection, demonstrating that an equal weighting optimally balances the contributions of raw features and augmented features.
| Fold | Accuracy | Precision | Recall | AUC | AUPR |
|---|---|---|---|---|---|
| 1 | 0.8287 | 0.8232 | 0.8358 | 0.9028 | 0.9113 |
| 2 | 0.8169 | 0.8066 | 0.8323 | 0.8933 | 0.8941 |
| 3 | 0.8286 | 0.8237 | 0.8346 | 0.8969 | 0.9026 |
| 4 | 0.8291 | 0.8143 | 0.8513 | 0.8999 | 0.9070 |
| 5 | 0.8195 | 0.8135 | 0.8277 | 0.8944 | 0.9008 |
| Average | 0.8246 | 0.8163 | 0.8364 | 0.8975 | 0.9032 |
| Testing | Accuracy | Precision | Recall | AUC | AUPR |
|---|---|---|---|---|---|
| 1 | 0.8766 | 0.7946 | 0.7637 | 0.9216 | 0.8442 |
| 2 | 0.8791 | 0.7855 | 0.7903 | 0.9284 | 0.8570 |
| 3 | 0.8740 | 0.7868 | 0.7637 | 0.9234 | 0.8494 |
| 4 | 0.8731 | 0.7876 | 0.7575 | 0.9280 | 0.8527 |
| 5 | 0.8702 | 0.7751 | 0.7650 | 0.9226 | 0.8445 |
| Average | 0.8746 | 0.7859 | 0.7680 | 0.9248 | 0.8496 |
| Testing | Accuracy | Precision | Recall | AUC | AUPR |
|---|---|---|---|---|---|
| 1 | 0.8988 | 0.7269 | 0.7617 | 0.9334 | 0.8213 |
| 2 | 0.9005 | 0.7289 | 0.7713 | 0.9342 | 0.8196 |
| 3 | 0.8989 | 0.7233 | 0.7708 | 0.9334 | 0.8209 |
| 4 | 0.9002 | 0.7340 | 0.7559 | 0.9328 | 0.8160 |
| 5 | 0.9067 | 0.7350 | 0.7590 | 0.9305 | 0.8064 |
| Average | 0.8998 | 0.7296 | 0.7637 | 0.9330 | 0.8168 |
As shown in Table 2, our proposed method yields good performance on the DrugBank dataset, with high average values of 0.8246 for accuracy, 0.8163 for precision, 0.8364 for recall, 0.8975 for AUC, and 0.9032 for AUPR. When tested with the Davis imbalance dataset, it achieves high average values of 0.8746 for accuracy, 0.7859 for precision, 0.7680 for recall, 0.9248 for AUC, and 0.8496 for AUPR. Similarly, for the KIBA dataset, our method achieves high average values of 0.8998 for accuracy, 0.7296 for precision, 0.7637 for recall, 0.9330 for AUC, and 0.8168 for AUPR. Meanwhile, the ROC curves and PR curves of these three datasets are shown in Fig. 4 for visual comparison.
![]() | ||
| Fig. 4 The five-fold ROC curves and PR curves for the DrugBank (a and b), Davis (c and d) and KIBA (e and f) datasets. | ||
We train the ten baseline models with the DrugBank dataset. The experimental results in the five-fold cross-validation are summarized in Table 5. We can observe that our model achieves an improvement of 0.66%, 0.71%, 0.55%, 0.63% and 0.69% in accuracy, precision, recall, AUC and AUPR, respectively, over the best baseline model, MCANet. The results indicate that our model can predict DTIs more accurately.
| Models | Accuracy | Precision | Recall | AUC | AUPR |
|---|---|---|---|---|---|
| NB | 0.5415 | 0.5468 | 0.5415 | 0.5417 | 0.6248 |
| KNN | 0.6736 | 0.6750 | 0.6736 | 0.6736 | 0.7526 |
| DeepDTA | 0.7772 | 0.7615 | 0.8052 | 0.8607 | 0.8679 |
| DeepConv-DTI | 0.7990 | 0.7942 | 0.7990 | 0.8683 | 0.8682 |
| MolTrans | 0.7621 | 0.7355 | 0.8264 | 0.8403 | 0.8407 |
| TransformerCPI | 0.7838 | 0.7722 | 0.8116 | 0.8565 | 0.8606 |
| HyperAttentionDTI | 0.8098 | 0.8026 | 0.8221 | 0.8900 | 0.8935 |
| MCANet | 0.8180 | 0.8092 | 0.8309 | 0.8912 | 0.8963 |
| Rep-ConvDTI | 0.7773 | 0.7620 | 0.7969 | 0.8590 | 0.8634 |
| MGNDTI | 0.8089 | 0.8037 | 0.8148 | 0.8816 | 0.8826 |
| MAARDTI | 0.8246 | 0.8163 | 0.8364 | 0.8975 | 0.9032 |
| MCANet-B | 0.8477 | 0.8461 | 0.8488 | 0.9089 | 0.9109 |
| MAARDTI-E | 0.8524 | 0.8476 | 0.8581 | 0.9109 | 0.9203 |
We then train and assess our proposed model with the Davis dataset. Different from the DrugBank dataset, this is an imbalance dataset which is notoriously difficult to train and can lead to unrealistically high precision but low recall. As shown in Table 6, our model demonstrates enhanced performance on most metrics, achieving 0.55%, 0.42%, 0.26%, and 0.89% improvements in accuracy, recall, AUC and AUPR, respectively. It is noteworthy that although the precision of our model is slightly lower than that of DeepDTA, its recall is 9.3% higher, suggesting that our training strategy effectively addresses the challenge posed by dataset imbalance, resulting in enhanced generalization ability.
| Models | Accuracy | Precision | Recall | AUC | AUPR |
|---|---|---|---|---|---|
| NB | 0.6082 | 0.6429 | 0.6082 | 0.5641 | 0.4845 |
| KNN | 0.7240 | 0.6856 | 0.7240 | 0.5477 | 0.4703 |
| DeepDTA | 0.8568 | 0.7898 | 0.6776 | 0.9145 | 0.8337 |
| DeepConv-DTI | 0.8590 | 0.7761 | 0.7000 | 0.9163 | 0.8304 |
| MolTrans | 0.7847 | 0.6387 | 0.7121 | 0.8628 | 0.7299 |
| TransformerCPI | 0.8345 | 0.7400 | 0.6423 | 0.8863 | 0.7802 |
| HyperAttentionDTI | 0.8579 | 0.7428 | 0.7642 | 0.9142 | 0.8318 |
| MCANet | 0.8691 | 0.7750 | 0.7602 | 0.9222 | 0.8407 |
| Rep-ConvDTI | 0.8663 | 0.7983 | 0.7159 | 0.9222 | 0.8439 |
| MGNDTI | 0.8244 | 0.6553 | 0.7359 | 0.9093 | 0.8266 |
| MAARDTI | 0.8746 | 0.7859 | 0.7680 | 0.9248 | 0.8496 |
| MCANet-B | 0.8919 | 0.8260 | 0.7848 | 0.9441 | 0.8804 |
| MAARDTI-E | 0.8946 | 0.8354 | 0.7835 | 0.9480 | 0.8956 |
Furthermore, we train and assess our model with the KIBA dataset, which is also an imbalance dataset with nearly four times interaction pairs compared with the Davis dataset. Table 7 summarizes the comparative performance results. Our model achieves an improvement of 0.42% accuracy, 2.30% recall, 0.22% AUC and 0.42% AUPR over the best baseline MCANet. Meanwhile, the precision metric of our MAAR model is again lower than that of DeepConv-DTI but with a higher recall. This result further confirms that our training strategy is effective and that the resulting model is superior in predicting imbalance datasets. Finally, we examine the performance of the ensemble models. It is noteworthy that MAARDTI-E improves over its non-ensemble counterpart by 3.4, 2.3, and 1.6% on the DrugBank, Davis, and KIBA datasets, respectively, and it performs on par with MCANet-B.
| Models | Accuracy | Precision | Recall | AUC | AUPR |
|---|---|---|---|---|---|
| NB | 0.6395 | 0.7231 | 0.6395 | 0.5570 | 0.3885 |
| KNN | 0.8206 | 0.7911 | 0.7206 | 0.5600 | 0.4667 |
| DeepDTA | 0.8931 | 0.7738 | 0.6324 | 0.9223 | 0.7935 |
| DeepConv-DTI | 0.7208 | 0.7967 | 0.6582 | 0.9332 | 0.8212 |
| MolTrans | 0.8891 | 0.7042 | 0.7353 | 0.9232 | 0.7949 |
| TransformerCPI | 0.8828 | 0.7087 | 0.6679 | 0.9070 | 0.7640 |
| HyperAttentionDTI | 0.8775 | 0.6730 | 0.7149 | 0.9161 | 0.7721 |
| MCANet | 0.8956 | 0.7277 | 0.7407 | 0.9308 | 0.8126 |
| Rep-ConvDTI | 0.8979 | 0.7763 | 0.6547 | 0.9255 | 0.8039 |
| MGNDTI | 0.8448 | 0.5611 | 0.7766 | 0.9227 | 0.8191 |
| MAARDTI | 0.8998 | 0.7296 | 0.7637 | 0.9330 | 0.8168 |
| MCANet-B | 0.9132 | 0.7852 | 0.7572 | 0.9488 | 0.8588 |
| MAARDTI-E | 0.9143 | 0.7788 | 0.7771 | 0.9498 | 0.8609 |
• MAARDTI-OA: the MAAR block is removed and the output features of the CNN blocks are fed directly into the corresponding transformer modules.
• MAARDTI-PA: the drug MAAR block is removed and the protein MAAR block remains. For protein, the basic framework is the same as our proposed model. However, the output features of the CNN block for drugs are fed directly into the drug-contextual refocusing module.
• MAARDTI-DA: contrary to MAARDTI-PA, the protein MAAR block is removed and the drug MAAR block remains to predict drug–target interactions.
In addition, statistical tests, t-SNE, are conducted to evaluate the significance of the improvements reached by MAARDTI compared to each variant. Table 8 presents the prediction results with different variant models on the DrugBank, Davis and KIBA datasets from five random training models. Statistical tests are adopted to evaluate the significance of the improvement achieved by MAARDTI compared to each baseline model. Lower p values correspond to higher t values, with p values less than 0.05 indicating statistical significance. The results in Table 8 show that the MAARDTI model performs well in the DTI prediction with five random training models, outperforming its variants (MAARDTI-OA, MAARDTI-PA, and MAARDTI-DA). These results are statistically significant, further verifying the effectiveness and reliability of the MAARDTI model. The ROC curves of the three datasets are shown in Fig. 5 and we plot the attention heatmap and t-SNE feature distribution in Fig. 6. The heatmap shows the distribution of attention weights of different models on drug and protein features that brighter colors indicate higher attention weights, and the model pays more attention to these features. The heatmap of MAATDTI shows a more uniform and dispersed attention distribution. The t-SNE diagram shows the model's dimensionality reduction representation of DTI pair features in two-dimensional space. The blue points in the figure represent true interactions, and the red points represent non-interactions. The t-SNE graph of MAATDTI shows clearer clustering, and the boundaries between true interactions (blue dots) and non-interactions (red dots) are more obvious. This indicates that MAATDTI can better distinguish drug–protein interactions from non-interactions. We can conclude that MAATDTI performs better in drug–protein interaction prediction tasks. Its advantages in feature capture and classification boundaries enable it to more accurately predict drug–protein interactions. These advantages make MAATDTI a more reliable and effective tool that can be applied in drug discovery and biomedical research.
| Dataset | Methods | Accuracy | AUC | AUPR | t-Value | p-Value |
|---|---|---|---|---|---|---|
| DrugBank | MAARDTI-OA | 0.8204 | 0.8889 | 0.8979 | 3.485 | <0.05 |
| MAARDTI-PA | 0.8172 | 0.8905 | 0.8970 | 2.332 | <0.05 | |
| MAARDTI-DA | 0.8131 | 0.8861 | 0.8948 | 8.829 | <0.005 | |
| MAARDTI | 0.8222 | 0.8948 | 0.9031 | — | — | |
| Davis | MAARDTI-OA | 0.8642 | 0.9160 | 0.8253 | 3.886 | <0.005 |
| MAARDTI-PA | 0.8653 | 0.9161 | 0.8237 | 3.774 | <0.05 | |
| MAARDTI-DA | 0.8639 | 0.9173 | 0.8318 | 3.012 | <0.05 | |
| MAARDTI | 0.8703 | 0.9239 | 0.8454 | — | — | |
| KIBA | MAARDTI-OA | 0.8912 | 0.9240 | 0.7960 | 2.367 | <0.05 |
| MAARDTI-PA | 0.8899 | 0.9249 | 0.7955 | 2.897 | <0.05 | |
| MAARDTI-DA | 0.8872 | 0.9225 | 0.7901 | 3.107 | <0.05 | |
| MAARDTI | 0.8982 | 0.9322 | 0.8117 | — | — |
![]() | ||
| Fig. 5 The ROC curves of MAARDTI and the variant models for the ablation experiments on (a) DrugBank, (b) Davis, and (c) KIBA. | ||
![]() | ||
| Fig. 6 Comparison of attention heatmaps and t-SNE visualization results of the four variant models of MAARDTI. | ||
| Drug | True positive | True negative | False positive | False negative | Accuracy |
|---|---|---|---|---|---|
| DB00157-NADH | 10 | 8 | 0 | 2 | 90% |
| DB00589-lisuride | 10 | 9 | 0 | 1 | 95% |
| Protein | True positive | True negative | False positive | False negative | Accuracy |
|---|---|---|---|---|---|
| P21728-DRD1 | 10 | 9 | 0 | 1 | 95% |
| P25100-ADRA1D | 10 | 9 | 0 | 1 | 95% |
Furthermore, to avoid biases caused by the coincidental selection of sequence or structurally similar proteins in the training set for the test set, we adopt MMseq2 (ref. 49) to cluster protein sequences in the DrugBank dataset. Then, we randomly remove a cluster of targets from the dataset, which contains five targets at least and uses it as a test set. The remaining data are used as the training set.
For protein clustering, we investigate the effects of clustering parameters on the performance of the model to simulate different scenarios as in real applications. The sequence identity (seq-id), coverage (c) and coverage mode (cov-mode) are three basic parameters in MMseq2. We build three cluster groups based on these parameters (Fig. 7):
![]() | ||
| Fig. 7 The cluster size distribution of the (A) strict, (B) balanced, (C) loose groups of the DrugBank dataset. | ||
(i) Strict group: this group has high similarity and coverage in each cluster and is suitable for assessing the risk of overfitting the model on closely related targets. The seq-id is set to 0.9, c is set to 0.9 and cov-mode is set to 1.
(ii) Balanced group: the number and size of clusters in this group are moderate. It reflects the real protein family distribution and is suitable for evaluating the universality of the model. The seq-id is set to 0.6, c is set to 0.7 and cov-mode is set to 2.
(iii) Loose group: this group has low similarity and low coverage, which can test the generalization ability of the model for cross-family targets. The seq-id is set to 0.3, c is set to 0.5 and cov-mode is set to 3.
Our model (MAARDTI) and MCANet (the second-best model in our benchmark) are tested in all clustering groups and the prediction results are shown in Table 10. Our model achieves a prediction accuracy of 78.12% for the strict group, 87.5% for the loose group, and 93.75% for the balanced group. Compared to its performance in the DrugBank benchmark (see Table 5) with an accuracy of 82.2%, the model has only a 5% lower performance in predicting unseen targets. However, based on the loose group result, the model has an improved accuracy of 6% in predicting distant targets. This indicates that our model has good generalizability. We focus more on AUC and AUPR since the number of positive and negative examples per protein is imbalanced, and accuracy could be misleading. We found that the AUC and AUPR values for the loose group were higher than those for the balanced and strict groups. We speculate that this is due to the following reasons: (i) the clustering structure of the loose group is looser, which allows the model to more flexibly capture the differences and connections between different samples when processing the data. (ii) The clustering structure of the loose group may be more consistent with the actual data distribution. (iii) The clustering structure of the loose group may allow the model to be exposed to a more diverse combination of samples during training. Compared to MCANet, our method is superior in accuracy by 8–40% in all clustering groups. These results show that our method is highly adaptive and robust in predicting test targets with different degrees of similarity to those in the training set.
| Cluster groups | Method | Accuracy | AUC | AUPR |
|---|---|---|---|---|
| Strict | MCANet | 0.5625 | 0.8118 | 0.6977 |
| Ours | 0.7812 | 0.8275 | 0.7271 | |
| Loose | MCANet | 0.6562 | 0.7460 | 0.8081 |
| Ours | 0.8750 | 0.9667 | 0.9437 | |
| Balanced | MCANet | 0.8125 | 0.8213 | 0.6623 |
| Ours | 0.9375 | 0.8841 | 0.8722 |
![]() | ||
| Fig. 8 The predictive performance of MAARDTI compared with four existing models in cold-splitting dataset prediction. | ||
700 unique olfactory receptor (OR) and odor molecule pairs, which were carefully collated and analyzed from 31 scientific papers. It covers 11 different mammalian species, including 1237 unique OR sequences and 596 different molecules. To evaluate the ability of our model to predict for unseen proteins, we report the performance of the i.i.d. case, and followed the previous work53 to predict in two scenarios, namely, random or cluster, i.e., a randomly selected individual OR or a group of structurally similar ORs were put into the test set while removing their occurrences in the training set. The prediction results are presented in Table 11. Our model performs generally better in precision but has a low recall, which indicates that the model is more careful for positive sample prediction and could miss some true positive samples. As expected, predictions in the random scenarios are overall better than the cluster scenarios, in alignment with the model behavior reported by Matej et al. (performance values directly taken from this work).53 In comparison, while our model has improved performance in terms of MCC by 19–83% in the cluster scenarios, the model by Hladiš et al. shows better generalization ability with enhanced MCC of 22–43% in the random scenarios. In the cluster group, the prediction results show that our model has better generalization ability at the molecular level but has difficulty in handling a wider range of entities. In conclusion, the weak performance of both MAARDTI and Matej's model indicates that the M2OR dataset is highly challenging. Further efforts will focus on enhancing recall and generalization to unseen data.
| Split | Method | AveP | Precision | Recall | F-Score | MCC | |
|---|---|---|---|---|---|---|---|
| i.i.d. | Hladiš et al.53 | 0.780 | 0.689 | 0.698 | 0.693 | 0.605 | |
| Ours | 0.700 | 0.700 | 0.595 | 0.643 | 0.555 | ||
| Cluster | Molecule | Hladiš et al.53 | 0.580 | 0.544 | 0.342 | 0.418 | 0.334 |
| Ours | 0.423 | 0.795 | 0.242 | 0.371 | 0.399 | ||
| OR | Hladiš et al.53 | 0.558 | 0.535 | 0.132 | 0.203 | 0.088 | |
| Ours | 0.186 | 0.545 | 0.055 | 0.099 | 0.147 | ||
| OR-keep | Hladiš et al.53 | 0.625 | 0.576 | 0.095 | 0.161 | 0.091 | |
| Ours | 0.190 | 0.211 | 0.227 | 0.173 | 0.167 | ||
| Random | Molecule | Hladiš et al.53 | 0.729 | 0.657 | 0.629 | 0.638 | 0.533 |
| Ours | 0.445 | 0.633 | 0.344 | 0.446 | 0.409 | ||
| OR | Hladiš et al.53 | 0.684 | 0.636 | 0.491 | 0.552 | 0.417 | |
| Ours | 0.610 | 0.757 | 0.237 | 0.361 | 0.323 | ||
| OR-keep | Hladiš et al.53 | 0.710 | 0.670 | 0.470 | 0.548 | 0.430 | |
| Ours | 0.526 | 0.736 | 0.147 | 0.246 | 0.242 | ||
Although our proposed model has improved prediction performance, it still has some limitations. These include: (i) our model can only predict whether a protein and a drug interact without gaining insights into the underlying interaction mechanism. (ii) The potential of the multi-perspective attention mechanism for interpretability requires further exploration. (iii) While our optimization strategy proved effective, there is always room for improvement. The integration of automated hyperparameter tuning tools could streamline the optimization process and potentially uncover even more optimal configurations. (iv) The current single feature framework could limit the predictive performance of the model. Other embedding methods could be explored in future work. For example, pre-trained protein language models have been shown to strengthen sequence latent representations, which can be combined with other sequence semantic information to obtain a more comprehensive representation. For drugs, graph representations, drug fingerprints, and drug motif representations can increase the diversity of feature representations to further improve the reliability of prediction. In the future, we would also like to apply MAARDTI on AI-based virtual screening of therapeutic targets for hit identification and drug reproposing, followed by validation through wet-lab experiments.
Supplementary information: detailed prediction results from the MAARDTI model and provide representative case studies of the model's predictive performance on drug-target interaction tasks. See DOI: https://doi.org/10.1039/d5dd00311c.
| This journal is © The Royal Society of Chemistry 2025 |