Fathima
Ridha
,
K.
Harini
,
N. R. Siva
Shanmugam
,
Rahul
Nikam
and
M. Michael
Gromiha
*
Department of Biotechnology, Bhupat and Jyoti Mehta School of Biosciences, Indian Institute of Technology Madras, Chennai 600036, India. E-mail: gromiha@iitm.ac.in
First published on 22nd November 2025
The interaction of proteins with diverse molecular partners, including other proteins, nucleic acids, and carbohydrates, is essential for performing various functions, from signal transduction and gene regulation to immune recognition and cellular transport. These interactions are largely governed by the three-dimensional structures and dynamics of biomolecular complexes, which in turn dictate their binding affinities and functional specificity. While recent advances in AI-driven structure prediction have greatly improved our ability to model such complexes, accurately predicting and engineering their binding affinities remains a key challenge. In this article, we review emerging computational strategies for affinity prediction and rational design across protein–protein, protein–DNA/RNA, and protein–carbohydrate complexes. We discuss the role of machine learning and deep learning in advancing structure-based and sequence-based affinity models, assess current databases and benchmarks, and highlight recent tools for predicting the effects of mutations on binding affinity. We conclude by discussing future opportunities at the intersection of AI, high-throughput screening, and data-driven modeling to enable affinity-guided design of functional biomolecular assemblies.
Binding affinity not only underpins our understanding of the molecular recognition and biological function but also plays a decisive role in drug discovery, synthetic biology, and protein engineering.5,6 Even small changes in binding affinity, often caused by mutations at the interface residues, can lead to profound changes in the cellular functions, disease susceptibility, or therapeutic response.7,8 Experimentally, binding affinities are measured using methods such as isothermal titration calorimetry (ITC), surface plasmon resonance (SPR), and fluorescence-based assays. However, these techniques are labor-intensive and not feasible for high-throughput or proteome-scale analyses.8
The field of computational affinity prediction has advanced rapidly, propelled especially by the convergence of machine learning, deep learning, and structure modeling. The introduction of AI-powered structure prediction tools such as AlphaFold2/Multimer9,10 and RoseTTAFoldNA,11 along with more recent developments like AlphaFold3,12 enables atomic-resolution modeling of protein–protein, protein–nucleic acid, and protein–carbohydrate complexes. These advances drive computational workflows that extend beyond structure reconstruction to the estimation and optimization of binding affinities.13 Modern strategies for binding affinity predictions have evolved from classical energy calculation, such as molecular mechanics Poisson–Boltzmann/surface area (MM/PBSA) methods14 to machine learning/deep learning approaches harnessing sequence, structure, and evolutionary data.15–18 Current state-of-the-art methods apply graph neural networks, transformer-based language models, and diverse ensembles for affinity prediction across complex types.19,20
In parallel, the development of well-curated binding affinity databases such as SKEMPI,21 PDBbind,22 PROXiMATE,23 MPAD,24 ProNAB,25 and ProCaff26 has been pivotal in supporting not only model training and benchmarking but also a deeper understanding of the thermodynamics of molecular recognition. These resources are important for refining energy functions, training machine-learning models, and designing novel protein interfaces.27,28
Despite these advances, accurately predicting binding affinities remains challenging due to the complex interplay of enthalpic and entropic contributions, solvent effects, and conformational flexibility at biomolecular interfaces.29 Traditional methods often struggle to balance computational efficiency with physical accuracy, while machine learning approaches depend heavily on the quality, diversity, and representativeness of available data. Moreover, different types of macromolecular complexes (protein–protein, protein–nucleic acid, and protein–carbohydrate) exhibit distinct physicochemical determinants of binding, further underscoring the difficulty of developing broadly generalizable frameworks for affinity prediction.
Over the past several years, our group has contributed to this field through the development of comprehensive, literature-derived databases of experimentally determined binding affinities for protein–protein,23,24 protein–nucleic acid,25 and protein–carbohydrate complexes,26 which are widely used for benchmarking and training predictive models. In addition, we have developed machine learning and deep learning–based computational tools for predicting both wild-type affinities30–34 and mutation-induced changes in binding affinity of protein complexes.7,20,35–37
In this review, we explore advances in the computational design and prediction of protein complexes with a focus on binding affinity. We describe current strategies for affinity prediction across protein–protein, protein–nucleic acid (DNA/RNA), and protein–carbohydrate complexes, and highlight recent contributions from machine learning and AI. We also survey key databases supporting the field, with special attention to computational methods that predict mutational effects on binding affinity, a crucial need for understanding disease mechanisms and therapeutic engineering. We conclude with a discussion of future directions, including AI-driven structure prediction, high-throughput computational screening, and novel affinity-aware design platforms.
Over the past two decades, several databases have been developed to catalog experimentally measured binding interactions across a wide range of biological systems, namely, protein–protein, protein–nucleic acid, and protein–carbohydrate complexes. A comparison of major binding affinity databases, including their data types, coverage, and features, is provided in Table 1. These resources have become essential not only for uncovering the physical principles of binding but also for enabling computational modeling, mutational analysis, and protein engineering.
| Database | Interaction type | Data available | URL |
|---|---|---|---|
| a Not accessible; last accessed on 05 August 2025. | |||
| SKEMPI v2.021 | Protein–protein | Wild-type and mutant ΔG for structure-known complexes | https://life.bsc.es/pid/skempi2/ |
| PROXiMATE23 | Protein–protein | Binding affinity (Kd/ΔG/ΔΔG) for complexes with both known and unknown structures | https://www.iitm.ac.in/bioinfo/PROXiMATE/ |
| Affinity Benchmark v5.538 | Protein–protein | Benchmark set of protein–protein complexes with experimentally measured Kd and corresponding calculated ΔG values. | https://bmm.cancerresearchuk.org/~bmmadmin/Affinity |
| AB-Bind39 | Protein–protein | Affinity changes upon mutation (ΔΔG) in antibody–antigen complexes | https://github.com/sarahsirin/AB-Bind-Database/tree/master |
| SAbDab40 | Protein–protein | Structural database of antibody–antigen complexes; a few with affinity data | https://opig.stats.ox.ac.uk/webapps/sabdab-sabpred/sabdab |
| ATLAS41 | Protein–protein | T cell receptor–peptide–MHC (TCR–pMHC) complexes, linking experimentally determined binding affinities with structural information. | https://zlab.umassmed.edu/atlas/web/ |
| PPB-Affinity42 | Protein–protein | Experimental affinities for protein–protein complexes compiled from SKEMPI v2.0, AB-Bind, SAbDab, PDBbind v2020, Affinity Benchmark v5.5, ATLAS | https://github.com/Huatsing-Lau/PPB-Affinity-DataPrepWorkflow |
| MPAD24 | Protein–protein | Binding affinity (Kd/ΔG/ΔΔG) specific for membrane protein–protein complexes, along with membrane-based features | https://web.iitm.ac.in/bioinfo2/mpad/ |
| ProNIT43 | Protein–nucleic acid | Experimentally determined thermodynamic data (Kd, ΔG, ΔH, ΔS) | https://www.rtc.riken.go.jp/jouhou/pronit/pronit.html |
| dbAMEPNI44 | Protein–nucleic acid | Alanine mutations (ΔΔG) | https://zhulab.ahu.edu.cn/dbAMEPNI |
| ProNAB25 | Protein–nucleic acid | Dissociation constant (Kd), free energy change (ΔG), and its change upon mutation (ΔΔG), along with experimental conditions | https://web.iitm.ac.in/bioinfo2/pronab/ |
| PNATDB45 | Protein–nucleic acid | Includes molecular interactions information along with binding affinity | https://chemyang.ccnu.edu.cn/ccb/database/PNAT/ |
| ProCaff26 | Protein–carbohydrate | Dissociation constant (Kd), Gibbs free energy (ΔG, ΔΔG), experimental conditions, sequence, structure, and literature information | https://web.iitm.ac.in/bioinfo2/procaff/ |
| CarbDisMut46 | Protein–carbohydrate | Disease-causing mutations in human carbohydrate-binding proteins and predicted free energy change upon mutations (ΔΔG) using PCA-MutPred | https://web.iitm.ac.in/bioinfo2/carbdismut/ |
| ProCarbDB47 | Protein–carbohydrate | Structural database of protein–carbohydrate complexes, some with affinity/mutation data | https://www.procarbdb.science/procarb/ |
| PDBbind+22 | Protein–protein, protein–nucleic acid, protein–ligand, and nucleic acid–ligand | Binding affinity data for biomolecular complexes in PDB | https://www.pdbbind-plus.org.cn/ |
Early efforts were largely focused on protein–protein interactions, with databases like SKEMPI,21 PDBbind,22 and PROXiMATE,23 providing quantitative affinity data alongside structural information. PROXiMATE, developed by our group, is a curated database of thermodynamic effects of missense mutations in heterodimeric protein–protein complexes, enriched with sequence, structural, and functional annotations; it also includes binding affinity data for homodimeric complexes. These databases are critical for training machine learning prediction models, benchmarking scoring functions, and guiding rational protein design. The binding affinity values enable comparative assessment of complex strength.
Recently, protein–nucleic acid interactions have received growing attention, with resources like PDBbind,22 ProNIT,43 and dbAMEPNI44 compiling measured affinities along with detailed annotations. Although smaller in scale compared to protein–protein datasets, these resources are critical for understanding sequence- and structure-specific recognition of DNA and RNA targets by proteins. To address limitations in coverage and consistency, we developed ProNAB,25 currently the largest database of experimentally measured affinities for wild-type and mutant protein–DNA/RNA complexes. ProNAB enables more systematic analysis of binding energetics and supports the development of predictive models tailored to nucleic acid recognition.48
Protein–carbohydrate interactions, despite their biological significance, have remained underrepresented in quantitative binding datasets. To address this gap, we developed ProCaff,26 the first curated database of experimental binding affinity of protein–carbohydrate complexes and their mutants, collected from the literature. ProCaff is a valuable resource to gain insights for understanding the importance of specific interactions at the interface of protein–carbohydrate complexes and the recognition mechanism of protein–carbohydrate complexes.
Membrane protein–protein interactions, despite their central role in signaling and therapeutic targeting, remain underrepresented in affinity databases, with data scattered across the literature and no comprehensive resource available until recently. We addressed this with MPAD,24 the first dedicated database of binding affinities for membrane protein complexes and their mutants, featuring over 5400 curated entries along with membrane-specific features. This resource enables the exploration of energetics in membrane protein complexes and the impact of mutations on binding affinity, providing deeper insights into disease mechanisms and supporting the development of more effective, targeted therapies.
To ensure consistency and reliability across our datasets, we adopted a unified curation pipeline across all our databases, including PROXiMATE, ProNAB, ProCaff, and MPAD. Relevant research articles were identified from PubMed and major journal websites using combinations of keywords related to target proteins, structural and functional classes, binding energetics (e.g., Kd, ΔG, ΔΔG), and experimental techniques (e.g., ITC, SPR). Each article was manually reviewed to extract binding affinity values, experimental conditions, structural details (PDB ID), and literature metadata. We also integrated data from existing databases (e.g., PDBbind, SKEMPI, etc.) by mapping protein IDs and, wherever available, enriching incomplete records with missing details such as experimental methods, data location, etc., and standardized all thermodynamic parameters to ensure comparability. Additional context-specific annotations, such as membrane-specific features for MPAD, were included to enrich downstream analysis. Finally, each dataset was organized into a user-friendly database featuring robust search, filter, and visualization options, along with provisions for data upload and download, enabling easy access and utility for the broader research community.
Altogether, these databases represent significant progress toward making quantitative affinity data accessible across diverse classes of biomolecular complexes. As experimental techniques and computational methods continue to evolve, regular updates and broader coverage will be critical. Looking ahead, an integrative platform combining thermodynamic, structural, and contextual annotations across diverse interaction types would substantially enhance accessibility and interoperability. Mapping all known interaction data for a given protein could offer a comprehensive view of its recognition landscape and support the development of more generalizable predictive models.
Protein–protein binding affinities are commonly predicted by leveraging a combination of sequence information and structure-based features. These predictions are achieved using a range of computational approaches, including traditional machine learning algorithms to more recent deep learning frameworks. Sequence-based methods typically utilize fundamental physicochemical amino acid properties, sequence conservation, and residue-level probability estimates for binding site involvement. Yugandhar and Gromiha15 developed PPA-Pred, a sequence-based prediction method, and demonstrated that stratifying datasets according to protein functional classes led to a notable improvement in predictive accuracy.
Structure-based methods have been developed to incorporate geometric and interaction-specific features. These include features such as surface area49 and interfacial contacts,50 which provide detailed insights into the physical nature of protein–protein interfaces. With the advent of deep learning algorithms, the performance of the models is significantly improved with the incorporation of graph neural networks, and protein language models, trained on large-scale sequence data, introduce contextual features from raw protein sequences.13,51 Moreover, advancements in structure prediction tools such as AlphaFold-multimer, revolutionized the field by enabling the high-confidence prediction of three-dimensional protein complexes directly from sequence data. These predicted structures can be used to derive structural features, thereby enhancing model robustness and predictive power.30 Most recently, Ridha and Gromiha31 developed a prediction model specific to membrane proteins, utilizing the structural and sequence-based features to address the unique challenges posed by these complex structures. Table 2 lists different statistical and machine learning models for predicting binding affinities along with the sequence/structure-based features employed for prediction.
| Tools | Interaction type | Features | URL |
|---|---|---|---|
| PerSpect-EL52 | Protein–protein | Persistent homology and physical properties | https://github.com/ExpectozJJ/PerSpect-Ensemble-Learning |
| PRODIGY53 | Protein–protein | Network of inter-residue contacts and non-interacting surface | https://github.com/haddocking/prodigy |
| PPI-Affinity54 | Protein–protein | Structure-based features (ProtDcal) | https://protdcal.zmb.uni-due.de/PPIAffinity |
| PPA-Pred15 | Protein–protein | Sequence-based features | https://www.iitm.ac.in/bioinfo/PPA_Pred/ |
| ISLAND55 | Protein–protein | Protein sequence information using kernel representation | https://sites.google.com/view/wajidarshad/software |
| PIPR16 | Protein–protein | Robust local features and contextualized information | https://github.com/muhaochen/seq_ppi |
| PPI-Graphomer13 | Protein–protein | Sequence and structural features are extracted using ESM2 and ESM-IF1Graph transformer model | https://github.com/xiebaoshu058/PPI-Graphomer |
| DeepPPAPred30 | Protein–protein | Features from the sequence information and predicted three-dimensional structures. | https://web.iitm.ac.in/bioinfo2/DeepPPAPred/index.html |
| ProAffinity-GNN51 | Protein–protein | Protein language model and graph neural network (GNN) using structures | https://github.com/legendzzy/ProAffinity-GNN |
| ProBAN18 | Protein–protein | Location of atoms and their abilities to participate in various types of interactions | https://github.com/EABogdanova/ProBAN |
| AREA-AFFINITY49 | Protein–protein | Geometric characteristics such as area (both interface and surface areas) | https://affinity.cuhk.edu.cn/ |
| MPA-Pred31 | Protein–protein | Specific for membrane protein complexes, sequence and structure-based features | https://web.iitm.ac.in/bioinfo2/MPA-Pred/ |
| DNAffinity56 | Protein–DNA | Molecular dynamics simulation-based features | https://github.com/Jalbiti/DNAffinity |
| PreDBA57 | Protein–DNA | An ensemble model using sequence and structural features of the protein and DNA | https://predba.denglab.org/ |
| PDA-Pred32 | Protein–DNA | Interaction features, volume and surface area of the interface, DNA base step parameters, and atom contacts | https://web.iitm.ac.in/bioinfo2/pdapred/ |
| emPDBA58 | Protein–DNA | Sequence, structure, and interface features of the complex and the individual partners | https://github.com/ChunhuaLiLab/emPDBA/ |
| PredPRBA59 | Protein–RNA | Interface hydrophobicity, hydration pattern, and change in the conformation due to binding | https://predprba.denglab.org/ |
| PRA-Pred33 | Protein–RNA | Contact-based features, interaction energies, RNA base step parameters, and hydrogen bonding | https://web.iitm.ac.in/bioinfo2/prapred/ |
| PNAB60 | Protein–nucleic acid | Physicochemical properties, protein and nucleic acid sequence-based features | https://pnab.denglab.org/ |
| DeePNAP61 | Protein–nucleic acid | Deep learning method utilizing sequence descriptor of proteins and nucleic acids | http://14.139.174.41:8080/ |
| PCA-Pred34 | Protein–carbohydrate | Structure-based features such as contact potentials, interaction energy, number of binding residues, and contacts | https://web.iitm.ac.in/bioinfo2/pcapred/ |
| SPOT-Struc62 | Protein–carbohydrate | Knowledge-based statistical potential | NA |
| CSM-carbohydrate63 | Protein–carbohydrate | Information on both protein and carbohydrate complementarity, in terms of shape and chemistry, was captured using graph-based structural signatures | https://biosig.lab.uq.edu.au/csm_carbohydrate/ |
The prediction of binding affinities in protein–nucleic acid complexes is also foundational for understanding gene regulation. These prediction methods employ a range of computational approaches, including molecular dynamics simulations, statistical methods, and machine learning techniques.64 Similar to protein–protein interaction studies, structure-based parameters such as buried surface area and interatomic contacts have been identified as key determinants of binding affinity of protein–nucleic acid complexes. In addition, energetic parameters such as contact potentials and electrostatic interactions are related to the binding affinities.32 Further, on the nucleic acid side, features such as base-step parameters, secondary structures, and local structural motifs are known to be important for nucleic acid recognition.33 As a concrete example, Pant et al.64 demonstrated that bicyclo-nucleotide modifications in DNA increase the affinity of protein–DNA complexes. Interestingly, intrinsically disordered regions have been found to significantly influence the binding strength of DNA-interacting proteins.65 Barissi et al.56 developed a random forest model that predicts transcription factor–DNA affinities using structural and mechanical features of DNA, geometry, and flexibility, obtained from molecular dynamics simulations. Recently, leveraging the development of the ProNAB database, a deep learning method, DeePNAP,61 has been developed using the sequence descriptors of proteins and nucleic acids.
Recent focus is shifting towards the prediction of protein–carbohydrate binding affinity. Initially, the prediction was made based on knowledge-based statistical potentials.62 Later, the development of databases such as ProCaff and ProCarbDB accelerated the application of machine learning methods in this field. These methods mainly focus on the energetics of interactions34 and interface contacts. Nguyen et al.63 developed a prediction method based on the geometry and chemistry of interactions between the molecules using graph-based signatures. The performance of binding affinity predictions can be substantially improved through the integration of expanded high-quality experimental datasets and algorithms that better capture the structural and physicochemical determinants of molecular interactions.
Before the advent of machine learning and AI-based approaches, predictions of mutational effects focused on physics-based and knowledge-based methods. Methods like FoldX67 use empirical force fields to estimate free energy changes, while BeAtMuSiC,68 BindProf,69 and BindProfX70 leverage structural, energetic, and evolutionary features, focusing on statistical potentials rather than purely data-driven learning approaches. These traditional approaches laid the groundwork for the development of AI-driven models. Table 3 summarizes key computational methods for predicting mutation-induced binding affinity changes, including both traditional and AI-based approaches.
| Tools | Interaction type | Features | URL |
|---|---|---|---|
| BeAtMuSiC68 | Protein–protein | Statistical potentials derived from structure | https://babylone.ulb.ac.be/beatmusic/index.php |
| BindProf69 | Protein–protein | Interface structure profile, physics-based potentials, and sequence-based profile | https://zhanglab.ccmb.med.umich.edu/BindProf/ |
| BindProfX70 | Protein–protein | Interface profile and FoldX physics potential | https://zhanglab.ccmb.med.umich.edu/BindProfX/ |
| MutaBind271 | Protein–protein | van der Waals energy, solvation energy, unfolding free energy, SASA, conservation score and interfacial contacts | https://lilab.jysw.suda.edu.cn/research/mutabind2/ |
| iSEE72 | Protein–protein | Interface structure profile, evolution and energy-based features | https://github.com/haddocking/iSee |
| mCSM-PPI273 | Protein–protein | Graph-based signature, evolutionary information, complex network metrics, and energetic terms | https://biosig.lab.uq.edu.au/mcsm_ppi2/ |
| TopNetTree74 | Protein–protein | Persistent homology-based topological descriptors and CNN-derived features | https://codeocean.com/capsule/2202829/tree/v1 |
| SAAMBE-3D75 | Protein–Protein | Knowledge-based features from the mutation site environment | https://compbio.clemson.edu/saambe_webserver/ |
| GeoPPI76 | Protein–protein | Geometric deep features from structure | https://github.com/Liuxg16/GeoPPI |
| DDMut-PPI17 | Protein–protein | ProtT5 embeddings and interaction-type graph edges | https://biosig.lab.uq.edu.au/ddmut_ppi/ |
| ProAffiMuSeq7 | Protein–protein | Amino acid properties, PSSM, interface-specific indices and protein functional classes | https://web.iitm.ac.in/bioinfo2/proaffimuseq/ |
| PANDA77 | Protein–protein | Amino acid composition, conservation score, physicochemical properties | https://pandaaffinity.pythonanywhere.com/ |
| SAAMBE-SEQ78 | Protein–protein | Evolutionary, sequence, and physicochemical features of the mutation site | https://compbio.clemson.edu/saambe_webserver/indexSEQ.php |
| DeepPPAPredMut20 | Protein–protein | Physicochemical, evolutionary, and graph-based features | https://web.iitm.ac.in/bioinfo2/DeepPPAPredMut/ |
| MPA-MutPred35 | Membrane protein–protein | Electrostatic interaction, SASA, conservation score and interfacial contacts | https://web.iitm.ac.in/bioinfo2/MPA-MutPred/ |
| PremPRI79 | Protein–RNA | Interface interactions and graph-based features | https://lilab.jysw.suda.edu.cn/research/PremPRI/. |
| PremPDI80 | Protein–DNA | Molecular mechanics, statistical potentials and accessibility | https://lilab.jysw.suda.edu.cn/research/PremPDI/ |
| SAMPDI-3Dv281 | Protein–DNA | Structural features and knowledge-based terms (protein and DNA) | https://compbio.clemson.edu/SAMPDI-3D/ |
| mCSM–NA82 | Protein–NA | Graph-based signatures utilizing the encoded amino acid residue | https://biosig.lab.uq.edu.au/mcsm_na/ |
| PEMPNI83 | Protein–NA | Energy-based and structural interface features, such as contacts and residue-nucleotide pairs | https://liulab.hzau.edu.cn/PEMPNI |
| PRA-Mut-Pred36 | Protein–RNA | Structural, Energy-based and network-based features are utilized for the prediction using Support Vector Algorithm | https://web.iitm.ac.in/bioinfo2/pramutpred/ |
| PCA-MutPred37 | Protein–carbohydrate | Sequence and structure-based features using multiple linear regression techniques | https://web.iitm.ac.in/bioinfo2/pcamutpred |
The predictive accuracy of machine learning (ML) models for estimating binding affinity changes upon mutation depends on two main factors: the choice of features and the algorithmic architecture. Early efforts in this domain relied heavily on features derived from sequences or experimental structures. These features included physicochemical descriptors of mutated residues, changes in solvent-accessible surface area, hydrogen bonding patterns, and electrostatic potentials at the interface. Energetic terms, such as van der Waals contributions or binding free energy approximations obtained from empirical force fields (e.g., FoldX,67 Rosetta84), were often integrated to enhance biophysical interpretability. Sequence-based features, particularly those capturing evolutionary conservation (e.g., position-specific scoring matrices), provided an orthogonal source of information and proved especially useful in identifying mutation-sensitive hotspots at conserved interfaces.7,71
With the emergence of deep learning, the emphasis has shifted from manual feature engineering to data-driven representation learning. Structure-based models often employ convolutional neural networks (CNNs) to capture spatial and geometric patterns around mutation sites, incorporating both local and topological information.74 More recently, graph neural networks (GNNs) have offered a more flexible framework for representing biomolecules as graphs, where nodes correspond to atoms or residues and edges capture interatomic interactions.76 Transformer-based models, originally developed for natural language processing, have also been repurposed to learn contextual embeddings of protein and nucleic acid sequences. When pretrained on large-scale sequence databases, these models (e.g., ProtBert, ProtT5,85 ESM86) can infer structural and evolutionary constraints implicitly, enabling them to generalize to unseen mutations. Hybrid models that combine these sequence embeddings with structural features,17 either explicitly or via attention mechanisms, have demonstrated improved generalization in cross-domain applications.
A key development in recent years has been the integration of structure prediction pipelines into affinity prediction workflows. Tools such as AlphaFold-Multimer10 and RoseTTAFoldNA11 are now commonly used to generate mutant complex models, which serve as inputs for downstream feature extraction. These approaches have enabled mutation scanning even in the absence of high-resolution structural data, broadening the applicability of ML methods to underrepresented systems such as membrane protein–protein or protein–glycan complexes.
Despite substantial progress, challenges remain. Available datasets are limited in size and diversity, with an overrepresentation of mutations that cause minimal changes in binding affinity. Structural complexity, such as nucleic acid flexibility and glycan dynamics, is typically underrepresented in models, which often assume static interfaces. These limitations hinder generalization, especially across diverse mutation types and binding mechanisms, underscoring the need for broader datasets and more adaptable modeling strategies.
Protein–DNA/RNA complexes are pivotal for gene regulation, epigenetics, and cellular signaling, and accurate three-dimensional structure prediction of these complexes enhances our understanding of their atomic-level recognition, molecular functions, and binding affinities. Further, this understanding holds significant potential in computer-aided and structure-based drug design. Protein–nucleic acid complexes are mainly generated using template-based docking. In addition, ab initio and machine learning approaches are also used for modeling their three-dimensional structures.92 Unlike protein–protein interactions, rational design of protein–DNA/RNA complexes remains largely unachieved due to nucleic acid flexibility and conformational heterogeneity (see Section 6 for further discussion).
Protein–carbohydrate interactions play a major role in inflammation, cell proliferation, differentiation, aggregation, signal transduction, host–pathogen recognition, and protein structure stabilization. Computational methods enable the study of diverse carbohydrate systems, providing insights into their structures, dynamics, and interactions.27 However, modeling protein–carbohydrate complexes remains challenging due to low affinity, multivalency, and structural heterogeneity, as many carbohydrate-binding proteins, including lectins and adhesins, achieve specificity by binding multiple identical glycoside units arranged in distinct patterns.
In essence, computational design of protein–protein complexes has achieved notable success. In contrast, designing protein–DNA/RNA and protein–carbohydrate complexes remains highly challenging due to conformational flexibility, multivalency, and structural heterogeneity. The underlying difficulties and potential strategies to overcome these challenges are discussed in detail in the next section, highlighting directions for future research in the field.
A major limitation in current approaches is the quality and diversity of available training data. For protein–protein systems, while curated databases such as SKEMPI and PROXiMATE have enabled affinity prediction and mutation effect estimation, biases in sequence diversity and mutation type, especially overrepresentation of alanine scanning, remain a concern. To advance predictive modeling, there is a growing need for machine-learning-grade datasets93 that are large in size, diverse, well-annotated, standardized, and curated to capture the biochemical and structural complexity necessary for training robust and generalizable models. Such a volume of data will become increasingly feasible with advances in high-throughput experimental methods, such as deep mutational scanning and multiplexed binding assays, which enable systematic and scalable measurement of binding affinities. Heyne et al.94 developed a novel high-throughput approach for obtaining changes in binding free energy data for thousands of protein mutants in a single experiment and opening a new way for studies of mutation effects in PPIs. The approach combines yeast surface display, deep sequencing, and data normalization, producing affinity measurements comparable to traditional low-throughput methods.
To enable robust predictive and design-driven applications, future computational tools must go beyond static sequence and structure, incorporating key determinants such as post-translational modifications, intrinsic disorder, allosteric regulation, and dynamic conformational states. Bridging these aspects will be critical for advancing from affinity prediction to rational design of synthetic interfaces, therapeutic antibodies, and other engineered biomolecular assemblies. The design of multi-specific binders, interface stabilization, and re-engineering of host–pathogen interactions (e.g., SARS-CoV-2 spike–ACE2, or broadly neutralizing antibodies) are promising areas where predictive modeling could yield tangible impact.
For protein–nucleic acid complexes, challenges are more acute. Homology/template-based docking methods are limited by available structural templates, ab initio approaches are computationally intensive, and deep learning-based predictions have yet to achieve robust performance in this space. Significant advances have been made in enhancing the prediction quality through the selection of optimum parameters and large datasets.92,95 However, performance remains limited, and further developments are needed to accurately predict protein–nucleic acid complex structures, as evidenced from the recent CASP experiments.96
Protein–nucleic acid interactions have been characterized by assessing the relationship between three-dimensional structural features and binding affinity. Despite the availability of a few methods for predicting their binding affinity, there exists a substantial demand for improving their performance uniformly across different types of complexes, considering variations in structure and function. Further, there is an opportunity to develop methods that can predict the binding affinity from the protein and nucleic acid sequences directly. Improved understanding of these interactions could enable the design of high-affinity aptamers to selectively disrupt the complex formation, offering potential therapeutic opportunities, for example, by targeting long non-coding RNAs (lncRNAs) and large intergenic non-coding RNAs (lincRNAs) implicated in the epigenetic regulation of cancer-related gene expression.97,98 Moreover, many protein–nucleic acid complexes involve multiple protein subunits (e.g., dimers or trimers) binding to nucleic acids, which must be appropriately accounted for in prediction models. Existing methods predominantly focus on single amino acid mutations, often overlooking the impact of mutations on nucleic acids. Accurately predicting the impact of mutations, including disease-associated variants at protein–nucleic acid interfaces, on binding affinity is critical for understanding molecular mechanisms and guiding drug discovery.
Protein–carbohydrate interactions present a distinct set of challenges compared to protein–protein or protein–nucleic acid complexes. While tools like AlphaFold3 now enable structural modeling of protein–carbohydrate assemblies, the intrinsic flexibility, multivalency, and structural heterogeneity of glycans complicate both binding prediction and rational design. On biological surfaces, glycans are rarely isolated; instead, they are organized into glycolipid- and glycoprotein-derived clusters that form distinct glyco-surface patches. Together, these patches create a complex 3D glyco landscape, in which specific recognition motifs—termed glycotopes—govern selective binding. Accurately modeling these features, including the multivalent and water-mediated nature of glycan interactions,99 is critical for the successful design of protein–carbohydrate complexes.
Future directions in computational protein–carbohydrate interaction design are increasingly being shaped by AI-driven tools. Deep learning models like GlyNet100 and GlyBERT,101 combined with AlphaFold-like structure prediction methods, are improving the modeling of flexible glycan-binding interfaces. Expanding structural databases with glycan-bound complexes through experiments or AI predictions, integrating physics-based and graph neural network approaches,102,103 and employing deep generative models could accelerate the design of synthetic lectins, glycan sensors, and glycan-targeting proteins with tailored affinity and specificity. Iterative workflows that combine computational modeling with experimental feedback (e.g., glycan arrays or SPR) will further refine predictive accuracy. These strategies are essential for therapeutic applications, enabling the rational engineering of glycan-binding proteins that support the development of antibodies, vaccines, and inhibitors targeting lectins, carbohydrate-active enzymes, glycosaminoglycans, and other glycan-modified biomolecules. Computational workflows that account for the organization of glycans on cell surfaces, their interactions within the extended carbohydrate layer (glyco-canopy), and their multivalent binding will be key to enabling predictive and design-oriented strategies in glyco-engineering.
Looking ahead, the design of biomolecular complexes with tailored binding affinity will increasingly rely on the integration of structure prediction, mechanistic modeling, and data-driven inference. Advances in deep learning models have enabled high-resolution structural modeling across diverse interaction types, yet leveraging these structures for accurate affinity prediction and rational design remains challenging. Progress will depend on the development of standardized, diverse, and well-annotated datasets, particularly for underrepresented systems such as membrane protein–protein and protein–carbohydrate complexes. Incorporating biologically relevant features, such as conformational flexibility, post-translational modifications, and cellular context, will improve model robustness and applicability. Coupling predictive tools with high-throughput experimental platforms will enable iterative, feedback-driven design workflows. Also, moving beyond static binary interactions toward more complex assemblies will be critical for advancing our ability to engineer functional biomolecular systems.
Ultimately, future progress will depend on unifying structural, thermodynamic, and mutational data into interoperable, diverse, and openly accessible platforms. By addressing current methodological and data limitations, the community can develop more robust models of affinity and expand the reach of computational design to novel molecular functions and therapeutic strategies.
We discussed the progression of affinity prediction methods, highlighting both classical physics-based approaches and the growing impact of machine learning and deep learning models. These tools have been instrumental in predicting the effects of mutations, guiding interface redesign, and informing therapeutic engineering, particularly in antibody–antigen and host–pathogen systems such as SARS-CoV-2. To support these efforts, databases, including those developed by our group, broaden the coverage of affinity data and contribute to a more inclusive and diverse modeling foundation. Looking ahead, the integration of high-resolution structure prediction tools, high-throughput mutational scanning, and context-aware affinity models will enable more realistic and functionally relevant designs. Unified platforms that combine thermodynamic, structural, and functional annotations will be essential to advance modeling capabilities across biomolecular interaction types.
In summary, advancing the computational design of protein complexes will depend not only on algorithmic innovation but also on the quality and diversity of underlying data. By prioritizing integrative, high-fidelity datasets and refining model evaluation strategies, the field is poised to translate affinity prediction into more reliable and application-driven biomolecular design.
| This journal is © The Royal Society of Chemistry 2026 |