Open Access Article
Hongli Hou
ab,
Qi Wei
b,
Dian Huang
*b,
Minglu Zhao
b,
Hongliang Duan
a and
Shengzhong Feng
*b
aFaculty of Applied Sciences, Macao Polytechnic University, R. de Luís Gonzaga Gomes, 999078, Macao, China
bGuangdong Institute of Intelligence Science and Technology, Hengqin, Zhuhai, 519031, Guangdong, China. E-mail: huangdian@gdiist.cn; fengshengzhong@gdiist.cn
First published on 10th April 2026
Accurate prediction of drug-target affinity (DTA) is crucial for drug screening and reducing drug development costs. Despite the significant progress made by deep learning methods in DTA prediction, most existing approaches neglect two critical factors: the influence of drug molecular size and the different contribution of amino acid residues to DTA. Here, we propose an affinity prediction model, MolRes-DTA, which introduces multiview drug characterization and a dynamic residue-aware network to capture the influence of molecular size on affinity prediction and weigh the contributions of different residues. Experiments on the Davis and KIBA datasets demonstrate that MolRes-DTA outperforms baseline models by 15.58% and 20.11%, respectively. Further analysis shows that our multiview drug representation improves prediction accuracy across different types of molecular sizes, with particularly notable gains for larger compounds. To our knowledge, this is the first study to explore the impact of molecular size on DTA prediction, providing a novel perspective for enhancing the accuracy of DTA prediction.
The increasing availability of DTA datasets and rapid progress in machine learning have expanded the scope and capability of affinity prediction.14 Early machine learning studies relied on handcrafted descriptors for both drugs and proteins, whereas deep learning has introduced more expressive feature extractors and more flexible nonlinear interaction models. Recent work has advanced along two main directions: the expansion of input modalities, which broadens from linear sequences to graphs structural representations, and the innovation of model architectures, which captures the increasingly complex biochemical relationships.15,16
From a feature perspective, DTA prediction has progressively expanded information dimensions with the rapid advancement of deep learning methods.17 Early SMILES-based works, DeepDTA and WideDTA, treated drugs as one-dimensional sequences,18,19 and relied on convolutional or language-inspired encoders to capture local sequence motifs. Subsequent works such as ArkDTA, FingerDTA and TEFDTA, introduced molecular fingerprints to inject domain knowledge and improve chemical interpretability, demonstrating how predefined substructure encodings and global descriptors provide complementary information to sequence-derived features.20–22
As a further advance in DTA prediction, the adoption of explicit graph representations of small molecules, exemplified by GraphDTA and NG-DTA, has shown that graph neural networks can capture atomic topology and bonding patterns more effectively than one-dimensional encodings.23,24 More recent graph-centered methods emphasize richer intra-molecular relations and multi-scale interactions. For example, TDGraphDTA and MDCT-DTA incorporated modules for substructure interaction and enhanced node-level expressivity.25,26 To jointly model chemical and structural determinants of binding, PocketDTA extend this line of work by combining atomic-level graph connectivity with three-dimensional pocket-aware information.27 Although these developments, they collectively reveal a central challenge: how to capture atomic-level topological connectivity while also acquiring global functional substructure information from complementary molecular modalities, and how to integrate these views to overcome the limitations of single-modal representations in functional and structural characterization. We further note that drug molecular size may affect pharmacokinetics and delivery, including administration route, metabolic clearance and tissue penetration,28 and that such effects could in turn influence the behavior of DTA prediction models.
From a model perspective, protein feature modeling has proceeded remarkably from local capture to global optimization. Early convolutional encoders, such as MT-DTI and MATT-DTI, used multi-layer and three-layer CNNs respectively, to extract local residue motifs from primary sequences.29,30 Later, to link local residue signals with overall sequence context, GDilatedDTA applied bidirectional LSTM to perform recurrent information interaction.31 As the emergence of the Transformer method, TransformerCPI enabled explicit modeling of long-range dependencies through self-attention and in practice has improved the ability to localize interaction regions between protein sequence and ligand atoms.32 Subsequently, to capture both short-range substructure features and long-range contextual relationships, hybrid architectures such as MT-DTA and RRGDTA combine convolutional blocks with attention mechanism for local–global feature extraction.33,34 With the advancement of Large Language Models in recent years, AttentionMGT-DTA and LLMDTA used pretraining models ESM-2 to further enrich the protein embeddings and improve the generalization of downstream tasks.35,36 Despite these advances, most existing methods treat amino acid residues as uniformly contributing units, which can dilute signals from functionally critical residues.
To address the challenges identified above, we propose MolRes-DTA, a molecular-modality fusion and residue-aware model for DTA prediction. On the drug side, MolRes-DTA integrates a hybrid graph neural module, which captures fine-grained atomic topology with a fingerprint branch that encodes global functional semantics. This multiview fusion enables the model to represent both local bonding patterns and overall substructure chemistry, improving robustness for small and structurally complex molecules. On the protein side, we introduce a Dynamic Residue-aware Network (DRN) that combines learnable residue-wise modulation with self-attention and multi-scale convolution, allowing the model to amplify functionally relevant residues while attenuating less informative regions. An overview of the MolRes-DTA architecture is presented in Fig. 1.
Our contributions are summarized as follows:
• We present MolRes-DTA, a model that integrates a dual-modality drug encoder and a residue-aware protein network. On the drug side, the model combines atomic-level graph representations and fingerprint-derived global semantics to mitigate the limitations of single-view encodings. On the protein side, the Dynamic Residue-aware Network assigns context-dependent weights to residues to better highlight functionally important regions.
• We identify and characterize a relationship between drug SMILES length and DTA prediction performance. A molecular-size stratification analysis reveals performance gaps across molecular classes and offers suggestions for designing models and datasets that are robust to molecular-size variation.
• We provide extensive experimental and biological validation, containing ablation studies, molecular-size analysis, visualization of DRN-derived residue importance, comparison with docking results, and molecular dynamics simulations. Experiment results show MolRes-DTA improves predictive accuracy while providing reasonable interpretability.
The Davis dataset contains 30
056 drug-target interaction pairs involving 68 compounds and 442 proteins, with binding affinities measured by the dissociation constant (Kd). We adopted the corrected protein sequences provided by Li et al.22 to ensure data consistency and integrity.
The KIBA dataset includes 118
254 drug-target interaction pairs between 2111 small molecules and 229 protein targets. Binding affinities are represented by KIBA scores, which integrate multiple biochemical measurements, including Kd, the inhibition constant Ki, and the half-maximal inhibitory concentration IC50. We used the preprocessed and categorized version published by Li et al.,22 enabling consistent partitioning and fair model comparison.
The BindingDB dataset includes 2.7 million affinity records involving over 9000 proteins and 1.2 million small molecules. After removing ambiguous and duplicate entries by Li et al.,22 the curated version used in this study comprises 80
324 compounds, 5561 targets, and 1
254
402 interactions.
To ensure a fair and reproducible comparison with the baseline, we adopted the dataset partitioning procedures from TEFDTA. For Davis and KIBA, we used the fixed train/test split lists released by TEFDTA, corresponding to a random interaction-pair split with an 5
:
1 train
:
test ratio. For BindingDB, we used the predefined split provided by the TEFDTA benchmark without re-splitting to maintain strict comparability. More detailed information is shown in Table S1.
![]() | (1) |
This representation provides fixed dimensionality and strong interpretability, effectively mitigating the interference caused by varying molecular lengths during model training.
To further capture dependencies between substructures, we introduce a Transformer encoder for global modeling. Specifically, the input fingerprint matrix is projected into query, key, and value spaces:
| QD = MS × WQ, KD = MS × WK, VD = MS × WV | (2) |
The multi-head self-attention mechanism is then computed as:
![]() | (3) |
Finally, a max-pooling operation is applied to the attention output to generate the fingerprint-level drug representation:
| D1 = MaxPooling(Attention(QD, KD, VD)) | (4) |
This vector encodes contextual dependencies among molecular substructures and enhances the capacity of the model to represent functional fragments of the drug.
to extract five atomic attributes—atomic number, degree, formal charge, number of radical electrons, and aromaticity—as node features V, and construct an adjacency matrix E based on covalent bonds, forming an undirected graph G = (V, E).A GCN is employed to extract local structural features. At each GCN layer, the representation of a node is updated by aggregating information from its neighbors:
![]() | (5) |
denotes the updated embedding of node i, and
represents the embeddings of its neighboring nodes. The propagation rule of GCN is based on the graph Laplacian and can be expressed as:
H(l+1) = σ( −1/2Ã −1/2H(l)W(l))
| (6) |
is the corresponding degree matrix, H(l) is the node representation at the l-th layer, W(l) is a learnable weight matrix, and σ is the activation function.
After obtaining the initial node embeddings
via GCN, we further apply a Graph Attention Network (GAT) to capture more complex, asymmetric dependencies between nodes. The GAT layer assigns different attention weights to neighbor nodes and is formulated as:
![]() | (7) |
Specifically, for each neighboring node j, the attention coefficient is computed as:
![]() | (8) |
![]() | (9) |
The final node representation is obtained as a weighted sum over its neighbors:
![]() | (10) |
After joint GCN and GAT encoding, we apply mean pooling over all node embeddings to obtain a graph-level representation D2, which summarizes atomic-level information into a global structural embedding.
To construct the final drug representation, we concatenate the fingerprint-based feature vector D1 with the graph-level embedding D2, and apply a fully connected transformation with a non-linear activation:
| D = ReLU(Wfusion[D1‖D2] + bfusion) | (11) |
Traditional protein encoding methods often assume uniform contributions from all residues, ignoring the functional diversity among amino acids. To address this, we introduce DRN that applies a learnable modulation matrix G to assign distinct weights to each residue across its feature dimensions, where g ∈ [γmin, 1.0]. In practice, G is initialized as a channel-wise learnable mask that is jointly optimized with the network parameters, enabling the model to adaptively scale each residue's embedding according to its contextual relevance. This lightweight parameterization ensures fine-grained weighting without adding architectural complexity. The reweighted feature is computed as:
| p′ = G⊙p | (12) |
The weighted embeddings p′ are passed to a self-attention module to capture global dependencies among residues:
![]() | (13) |
To further capture local structural interactions, we apply three layers of 1D convolution after attention:
| C0 = PT, Ci = ReLU(Conv1D(Ci−1,Wi)), i = 1, 2, 3 | (14) |
Finally, global max pooling is applied to the output of the final convolutional layer to capture the most salient activations across channels, yielding the protein-level embedding P. This representation effectively integrates both global sequence dependencies and local structural features, and serves as input for downstream DTA prediction tasks.
![]() | (15) |
| ŷ = ĥ(L) | (16) |
MolRes-DTA is optimized using mean squared errors (MSE) loss function:
![]() | (17) |
MSE is a widely used metric in regression tasks that measures the squared difference between predicted values and the truth of the ground. It is defined as:
![]() | (18) |
is the corresponding predicted value, and N is the total number of samples. A lower MSE indicates a better model fit. However, this metric is sensitive to outliers and thus should be interpreted in conjunction with other evaluation metrics.
Concordance Index (CI) assesses the consistency between the predicted and actual value ranking. Evaluates the model's ability to correctly rank sample pairs and is defined as:
![]() | (19) |
![]() | (20) |
CI ranges from 0 to 1, with higher values indicating stronger ranking performance of the model.
rm2 is a metric designed to evaluate the degree of regression toward the mean and to assess the external predictive ability of the model. It helps determine whether the model suffers from overfitting or underfitting. It is computed as follows:
![]() | (21) |
| Hyperparameter | Setting |
|---|---|
| Learning rate | 0.0001 |
| Dropout rate | 0.1 |
| GCN layers | 3 |
| MLP layers | 2 |
| Transformer layers | 3 |
| Attention heads | 8 |
| Optimizer | Adam |
| Batch size | 256 |
| Epochs | 300 |
| Method | MSE (↓) | CI (↑) | rm2 (↑) |
|---|---|---|---|
| KronRLS40 | 0.379 | 0.781 | 0.407 |
| SimBoost41 | 0.282 | 0.872 | 0.644 |
| DeepDTA18 | 0.261 | 0.878 | 0.630 |
| DeepCDA42 | 0.248 | 0.891 | 0.649 |
| MATT-DTI30 | 0.229 | 0.890 | 0.682 |
| GraphDTA23 | 0.233 | 0.890 | 0.747 |
| GDilatedDTA31 | 0.232 | 0.885 | 0.686 |
| TEFDTA22 | 0.199 | 0.890 | 0.756 |
| PocketDTA27 | 0.177 ± 0.013 | 0.903 ± 0.005 | 0.731 ± 0.017 |
| AttentionMGT-DTA35 | 0.193 ± 0.001 | 0.891 ± 0.005 | 0.699 ± 0.027 |
| LLMDTA36 | 0.226 ± 0.001 | 0.884 ± 0.001 | 0.717 ± 0.010 |
| MolRes-DTA (ours) | 0.167 ± 0.001 | 0.908 ± 0.001 | 0.784 ± 0.004 |
| Method | MSE (↓) | CI (↑) | rm2 (↑) |
|---|---|---|---|
| KronRLS | 0.411 | 0.782 | 0.342 |
| SimBoost | 0.222 | 0.836 | 0.629 |
| DeepDTA | 0.194 | 0.863 | 0.673 |
| DeepCDA | 0.176 | 0.890 | 0.682 |
| MATT-DTI | 0.151 | 0.889 | 0.756 |
| GraphDTA | 0.158 | 0.887 | 0.674 |
| GDilatedDTA | 0.156 | 0.876 | 0.775 |
| TEFDTA | 0.184 | 0.860 | 0.731 |
| PocketDTA | 0.140 ± 0.004 | 0.892 ± 0.002 | 0.771 ± 0.011 |
| AttentionMGT-DTA | 0.140 ± 0.001 | 0.893 ± 0.001 | 0.786 ± 0.018 |
| LLMDTA | 0.162 ± 0.001 | 0.872 ± 0.001 | 0.768 ± 0.001 |
| MolRes-DTA (ours) | 0.147 ± 0.001 | 0.893 ± 0.001 | 0.778 ± 0.006 |
In contrast, the KIBA dataset presents greater challenges due to its broader distribution of binding affinities. Even under these more complex conditions, MolRes-DTA attains an MSE of 0.147, outperforming TEFDTA (0.184) and GDilatedDTA (0.156) with relative reductions of 20.1% and 5.8%. While PocketDTA and AttentionMGT-DTA obtain slightly lower MSE values, MolRes-DTA attains a tied-best CI of 0.893 and a competitive rm2 of 0.778, indicating that the predictive ranking and reliability of the model remain state-of-the-art. Combined with the leading results on the Davis dataset, these findings highlight the robustness and general applicability of MolRes-DTA across affinity datasets of varying complexity. The model's consistent superiority across datasets with varying levels of complexity underscores its robustness in DTA prediction. Additional significant analysis are presented in Table S3.
To further validate the prediction accuracy of MolRes-DTA, we analyzed the distribution of predicted and actual values. In Fig. 3(a), the Davis dataset results indicate that most data points are highly concentrated around the reference line, especially in the affinity range of [5, 7], where the predictions closely match the true values. This suggests that MolRes-DTA exhibits a strong fitting capability in this interval and accurately captures the interactions of common binding strength samples. In the low-affinity region (<6), the model still demonstrates relatively stable predictions, although some deviations are observed. This may be partly attributed to the fact that affinity values below 5 were rounded to 5 during the dataset construction process, thereby reducing the prediction accuracy of this region.41
Fig. 3(b) presents the results on the KIBA dataset, where the predicted values exhibit a similarly strong linear trend and are densely distributed within the range [10, 14], after filtering out individual samples with very low affinity to improve visualization. Compared to the Davis dataset, fewer outliers are observed in KIBA, highlighting the robustness and generalization capability of MolRes-DTA under larger and more complex distribution scenarios. Overall, the compact scatter distribution and consistent trends across both datasets confirm the high predictive precision and stability of the model in binding affinity prediction tasks.
We further evaluate the generalization ability of MolRes-DTA on the BindingDB dataset, which contains a wider range of drug-target combinations and more diverse molecular structures than the previous two benchmark datasets. Following the experimental settings reported in TEFDTA for consistency, we ensured a fair comparison. As summarized in Table 4, MolRes-DTA consistently outperforms all baselines across evaluation metrics, achieving an 18.1% reduction in MSE relative to TEFDTA, along with notable improvements in CI and rm2. These results confirm that the model maintains strong performance under more complex structural variations and data distributions, supporting the overall effectiveness of its multiview fusion and residue-aware representation strategies.
| Method | MSE (↓) | CI (↑) | rm2 (↑) |
|---|---|---|---|
| DeepDTA | 0.812 | 0.795 | 0.618 |
| DeepCDA | 0.832 | 0.811 | 0.628 |
| TEFDTA | 0.701 | 0.814 | 0.631 |
| MolRes-DTA | 0.573 ± 0.002 | 0.837 ± 0.001 | 0.727 ± 0.002 |
As shown in Table 5, MACCS + graph consistently outperforms unimodal variants, confirming the effectiveness of the multiview drug representation. When the residue-aware module is incorporated, the performance improves markedly, indicating that residue-level modeling contributes to more discriminative protein features. Upon further integration of the attention mechanism, the model achieves the best overall performance, suggesting that attention enhances residue-level interactions by adaptively focusing on functionally relevant regions. This trend is also clearly illustrated in Fig. 4, showing that incorporating additional data modalities and structural representations accelerates convergence and reduces final loss. In addition, we evaluated Morgan (ECFP4) fingerprints in place of MACCS. Results are provided in Table S4, confirming competitive performance across standard fingerprint choices.
| Method | MSE (↓) | CI (↑) | rm2 (↑) |
|---|---|---|---|
| Only MACCS | 0.228 | 0.882 | 0.730 |
| Only graph | 0.221 | 0.890 | 0.747 |
| MACCS + graph | 0.190 | 0.902 | 0.763 |
| +Residue-aware | 0.178 | 0.906 | 0.772 |
| +Attention | 0.168 | 0.906 | 0.778 |
The introduction of the residue-aware module initially leads to slight fluctuations in the loss curve, likely due to the additional learnable parameters and the adjustment process of residue-specific features. However, as training progresses, the model with residue-aware encoding consistently surpasses its baseline counterpart, confirming the long-term benefits of modeling residue-specific variation across protein sequences. Finally, the full MolRes-DTA model achieves both faster convergence and lower minimum loss, demonstrating the synergistic advantages of multiview drug representation and residue-aware protein encoding while also confirming the overall stability and robustness of the model. To further validate the rationality of our model's design, we additionally performed an ablation study on various feature fusion strategies employed in MolRes-DTA, with results detailed in Table S5.
Based on this classification, we compared the model's performance on different molecular sizes with GraphDTA and TEFDTA. As illustrated in Fig. 6, all three models exhibit a general trend of increasing MSE as molecular size increases.
The performance of GraphDTA shows a linear degradation with the increase of molecular size. This suggests that its representational capacity becomes increasingly strained as the molecular graph grows in complexity. Additionally, its performance on small-sized compounds is also relatively poor. This aligns with our hypothesis that when molecular graphs are small, GNN-based models capture limited local structural information, and the absence of global semantic context further constrains predictive capability.
TEFDTA performs more robustly on small compounds, benefiting from its fingerprint-based representation that captures compact pharmacophoric patterns. However, its performance drops significantly for medium and large compounds. This decline reflects the limitations of its fixed-length fingerprint representation, which lacks the structural flexibility to scale with molecular complexity, resulting in the loss of critical topological information.
By contrast, MolRes-DTA consistently achieves lower MSE across all molecular size groups and exhibits particularly notable improvements in the medium and large subsets. These results demonstrate the effectiveness of our multiview fusion strategy, which integrates atom-level topological connectivity with global pharmacophoric context. This hybrid representation not only preserves TEFDTA's strength on small compounds but also introduces the structural adaptability required to model larger and more complex molecules, as shown in Table S6. The observed improvements affirm the robustness and generalizability of MolRes-DTA in addressing molecular size-dependent prediction challenges in DTA tasks.
To further validate these findings, we additionally analyzed molecular size effects on the KIBA dataset (see in Fig. S2 and Table S7), which encompasses a broader spectrum of protein families than Davis, and obtained consistent trends. Moreover, we explored alternative molecular size definitions, including molecular surface area, van der Waals volume, and molecular weight. Results from these analyses also support the conclusion that predictive accuracy decreases as molecular size increases, with the most pronounced differences observed between small and medium molecules. Detailed results are provided in Tables S8–S10.
However, the underlying mechanism driving this performance trend warrants further investigation. Although large molecules are relatively less frequent, the observed performance degradation cannot be fully explained by data imbalance alone. We further discuss that the likely reason is the inherently greater structural complexity and higher functional diversity of large molecules, which increases prediction difficulty. Overall, this consistent pattern across models underscores the significant impact of molecular size on prediction accuracy.
For analysis of weight visualization, the residue importance scores produced by DRN were mapped onto the corresponding experimental 3D structures for each target, and the resulting spatial distributions were visualized using PyMOL. Residues were colored from white to blue to indicate increasing absolute weight.
Under this analysis pipeline, all three representative pairs (SRC-Dasatinib, FLT3-Lestaurtinib, and TRKA-Sorafenib) consistently exhibited higher DRN-assigned importance within residues that directly interact with the ligand, as determined by AutoDock Vina docking simulations (Fig. 7). This demonstrates that DRN effectively captures meaningful structural cues and highlights functionally relevant regions along the protein sequence. The predicted affinities alongside docking-derived binding estimates are summarized in Table S11, and the corresponding molecular dynamics simulations for these protein–ligand pairs are shown in Fig. S3.
| ID | Ligand | Target | Measured affinity | Predicted affinity |
|---|---|---|---|---|
| 1033872 | SARS-CoV-2 PLpro inhibitor | Replicase polyprotein 1ab | 5.886 | 5.630 |
| 1370221 | Acsmedchemlett | B-cell lymphoma 6 protein | 6.876 | 5.762 |
| 46562 | ALPRENOLOL | Beta-3 adrenergic receptor | 6.932 | 6.784 |
| 430023 | Benzenesulfonamide | Carbonic anhydrase 7 | 8.301 | 7.265 |
| 391242 | ARC-3430 | cAMP-dependent protein kinase catalytic subunit alpha | 8.387 | 6.371 |
Across these diverse protein families, MolRes-DTA generated affinity estimates that were generally consistent with experimental measurements. Most predictions fell within an acceptable deviation range, indicating that the model possesses robust generalization ability even when confronted with unseen targets.
Our work highlights the importance of integrating molecular graphs and fingerprints to improve molecular representation. Given that drug-target interactions occur within a dynamic three-dimensional context, future research could benefit from incorporating molecular conformations and binding pocket information. Such an integration would enable richer structural and spatial representations, potentially leading to further improvements in prediction accuracy.
Supplementary information (SI): additional experimental results. See DOI: https://doi.org/10.1039/d5dd00365b.
| This journal is © The Royal Society of Chemistry 2026 |