Krishna Gopal
Chattaraj
,
Joana
Ferreira
,
Allan S.
Myerson
and
Bernhardt L.
Trout
*
Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA. E-mail: trout@mit.edu; Tel: +1 617 258 5021
First published on 4th March 2025
Antibody-based therapeutics continue to be an important pharmaceutical development modality. Crystallization of antibodies is important for structural characterization, but in addition has the potential for use as a separation method and for use as a dosage form. Nevertheless, bringing about controlled crystallization of an antibody remains a challenging task due to its large size, high degree of segmental flexibility, and the intricacy of all the occurring interactions (e.g., protein–protein interactions, protein–solvent interactions, etc.). Methods to predict important contact sites could help to develop such crystallization methods. However, limited data and understanding have hitherto not allowed the development of such robust methods. This study employs machine learning combined with in silico modelling of crystal structures using available experimental structures to identify the crucial physicochemical features necessary for successful antibody crystallization in an attempt to remedy that gap. The developed method can with good accuracy distinguish crystal-site residues from non-crystal-site residues. A set of 510 descriptors is utilized to characterize each residue, which is treated as a distinct data point. Moreover, new algorithms have been developed to design novel descriptors that improve the model's predictive capabilities. Fragment antigen-binding (Fab) regions are investigated due to the scarcity of full-length monoclonal antibodies (mAbs) crystal structures. The current findings show that the extreme gradient boosting (XGBoost) algorithm effectively identifies crystal site residues, as evidenced by an AUPRC value that is more than 3-fold higher than that of the baseline model. The top-ranked descriptors indicate that crystal-site residues are primarily characterized by solvent-exposed residues with high spatial aggregation propensity (SAP), signifying hydrophobic patches, and their immediate surface-exposed neighbors. Moreover, these high SAP residues are often surrounded by other solvent-exposed residues that are either polar, charged, or both. In contrast, residues not involved in crystal interfaces generally lack these essential features, though some might be excluded due to specific crystal lattice arrangements. Additionally, reducing the feature set from 510 to the top 15% in the XGBoost model yields similar performance while significantly simplifying the model.
Design, System, ApplicationMonoclonal antibodies (mAbs) are pivotal therapeutic agents; however, their inherent structural complexity and instability pose significant challenges for crystallization, purification, and formulation. Effective prediction of critical contact sites is therefore crucial for advancing crystallization techniques. Despite this need, the development of robust, universally applicable methods has been limited by insufficient data and an incomplete understanding of the underlying molecular mechanisms. This study addresses these challenges by introducing a machine learning-based classification framework designed to identify structural features that differentiate crystal-site residues from non-crystal-site residues. Using in silico crystal modeling of experimentally available Fab fragments, this approach systematically analyzes residue-level properties to predict crystallization-prone regions. The study emphasizes structure-based descriptors over sequence-based features, providing deeper insights into the spatial and physicochemical characteristics influencing crystallization. Antigen-binding fragments (Fabs) are employed as a proof-of-concept due to their structural simplicity and availability, laying the groundwork for extending this methodology to full-length mAbs. Immediate applications include guiding site-directed mutagenesis to enhance crystallization propensity and optimizing antibody engineering for structural studies. In the future, these insights can be integrated into predictive tools for large-scale mAb development, facilitating progress in drug discovery, structural biology, and biopharmaceutical formulation. This framework bridges computational and experimental approaches, offering a streamlined strategy to advance mAb research and applications. |
However, mAbs are inherently unstable due to both physical and chemical degradation, especially since mAbs solutions often need to be formulated as concentrations greater than 100 mg mL−1, which can also cause viscosity issues.20 Moreover, mAbs' structural complexity leads to challenges with production,21 including purification. Investigating mAb structure is therefore important. Advanced techniques such as X-ray crystallography (XRD), nuclear magnetic resonance (NMR) spectroscopy, and cryo-electron microscopy (cryo-EM) are critical for determining the three-dimensional structures of mAbs at atomic resolution. These methodologies provide insights into the conformational dynamics, binding interactions, and functional mechanisms of mAbs. Crystallization stands out as a crucial technique, not only in structural biology (determination of 3D structures of proteins) but also in replacing the traditional downstream processing for mAbs, in addition to potential formulation as crystals.8–25
Despite its potential use in processing and product development, mAb crystallization faces several challenges.18 The variability in mAb sequence and structure requires ad hoc development of crystallization protocols mostly relying on trial-and-error approaches, which leads to limited predictability and lack of universality. Additionally, the large molecular weight of mAbs (approximately 150 kDa) complicates their organization into a crystalline lattice, while flexible regions often result in inconsistent and irreproducible outcomes.8–19 Identifying properties that facilitate mAb crystallization would be important for creating predictive models to streamline the process, saving time and resources.
In the present work, a novel classification scheme is introduced to identify key features that influence antigen-binding fragments (Fab) crystallization through the application of computational modeling and machine learning. The method can also be used to differentiate between crystal site residues and non-crystal site residues by pinpointing critical structural features and predicting distinct elements that characterize each residue type. Understanding these differences provides valuable insights into the processes involved in antibody crystallization. Antibody fragments were investigated as a proof-of-concept due to the abundance of their structures in the Protein Data Bank: only four full-length structures of monoclonal antibodies have been determined and deposited in the PDB: 1IGT (1997, mouse IgG2), 1IGY (1998, mouse IgG1), 1HZH (2001, human IgG1), and 5DK3 (2015, human IgG4).9 In addition, antibody fragments can offer valuable scaffolding for proteins in the early stages of drug discovery.26 The principles derived from studying Fab fragments can be used to investigate the crystallization behavior of full-length mAbs as similar structural features are involved. Previous scientific investigations have proposed binary classification approaches to determine the critical properties necessary for protein crystallization, typically based on sequence data.27–33 These classification models were designed to distinguish crystallizable proteins from non-crystallizable ones, thereby identifying key features for crystallization derived from protein sequence information. In contrast, in the present study, the proposed approach emphasizes key structure-based features and developing a predictive model to find crystal-prone regions, thus uncovering essential structural attributes. Successful crystallization sometimes depends on protein engineering as a final option to create variants when wild-type proteins fail to crystallize. However, even when wild-type proteins do form crystals, engineering may be required to modify crystal contacts and generate new crystal forms, especially for high-resolution structures needed in drug design to assess interactions between lead compounds and the target protein.34 This study can also be applied to site-directed mutagenesis to increase or decrease the crystallization propensity.
To the best of our knowledge, this is the first work applying a classification strategy for crystal site residues versus non-crystal site residues using machine-learning. The employed methodology is used for all site residues from the experimentally available Fabs through in silico crystal structure modeling.
The manuscript is divided into the following sections: section 2 (Results and discussion) explains the process of generating the dataset for machine learning, the design of the feature set, the key observations, the performance of different machine learning algorithms, top-ranked features, and inferences drawn from the obtained results. Section 3 (Summary and conclusions) summarizes the main results and proposes a crystallization mechanism. Additionally, future applications and perspectives for the current research are also addressed. Lastly, section 4 presents a detailed overview of all the methods and materials used during the computational modeling and application of the machine learning algorithms.
We classified each amino acid residue into two distinct classes: crystal-site versus non-crystal-site. Residues located in the crystal interfaces within the crystal lattice were classified by their reduced solvent accessibility upon interaction, as indicated by a non-zero buried surface area (BSA) and a solvent-accessible surface area (SASA) of 50 Å2 or more. To ascertain the appropriateness of the selected SASA cut-off, the SASA values of fully exposed residues (X) within an alanine-X-alanine (ALA-X-ALA) tri-peptide configurations were developed. In Fig. 1(a), the SASA values of fully exposed residues (X) within an alanine-X-alanine (ALA-X-ALA) tri-peptide residue values is shown. Based on this, for the residues in the Fab's, we set an SASA threshold of 50 Å2, which is also consistent with the literature that shows that solvent exposure of a residue was indicative of its likelihood being within the protein–protein interface.35–39 Mishra et al., suggested that residues were considered surface-exposed if more than 50% of the solvent-accessible surface area of their side-chains was exposed.39 Thus, the more solvent exposure of a residue present in the protein–protein interface, the greater the probability of its involvement in the formation and stabilization of crystalline interfaces (Fig. 1(b)). Also, see Fig. 2 for comprehensive classification scheme implemented in the present study. The residues within 12 Å of the Fab's center of mass, despite having higher SASA values, were excluded from the machine learning (ML) model to avoid false positives due to their low likelihood of appearing at crystal sites, as these residues were typically located deep within the core of the Fab region (Fig. 2).
The structures of Fabs were obtained from the Protein Data Bank (PDB)40 and were selected through the structural antibody database (SAbDab).41 Fabs deposited as monomers were specifically utilized, excluding those complexed as dimers, trimers, tetramers, or other higher-order oligomers, or with other antigens as asymmetric unit. Such choice was driven by our intention to ensure that no potential surfaces were obscured due to complexation or the formation of interfaces. Then, the crystalline symmetric structures of these Fabs were constructed in silico using online available software PDBePISA (Fig. 2).42 Subsequently, crystal and non-crystal site residues were identified.
This led to a dataset of 15744 amino acids (Table S1, ESI†). The classification resulted in 3719 amino acids belonging to the crystal-site category (Fig. 2), the rest are in other classes not (i.e., non-crystal-site category). Despite the imbalance, the size is large enough for our purposes.
To construct the machine learning model, a set of features was developed for each residue, taking into consideration both the residue itself and its neighboring residues, to represent them numerically, as depicted in Fig. 3. The overall approach of this work is illustrated in Fig. 4.
The structure-based features also include spatial aggregation propensity (SAP)43–45 and spatial charge map (SCM)43,45,46 to quantify the surface properties of antibodies. SAP quantifies the surface-exposed hydrophobicity of patches, while SCM assesses the exposed charge of patches. Both these properties play a significant role in facilitating antibody interactions.43–48 This led to a set of 510 features, as summarized in Table 1 and detailed in Tables S2–S10, ESI.†
Residue hydrophobicity (using BM scale) |
Protrusion index49 (protrusion from Fab surface) |
Depth index50 (distance from the closest point on the Fab surface) |
Fractional exposure ‘or’ relative exposure (ratio of the sidechain SASA of residues to the standard side-chain exposure of the residue in Ala-X-Ala) |
Net charge (residue net charge) |
Charge of exposed residues (sum of partial charges of atoms with SASA > 0) |
SAP43–45,48 (measure of surface-exposed hydrophobicity of atoms) |
SCM43,45,46,48 (measure of exposed charge patch) |
SASA and side-chain SASA of various types of amino acids |
Number of various types of amino acids |
Various scales related to protein structural properties involved in protein–protein interaction51 |
SAP_adjacent (pertaining to residues adjacent to hydrophobic patches) |
Other structure-based features were also developed to include neighboring residues that were adjacent to those with higher SAP values, contingent upon certain conditions. Residues exhibiting high SAP values were typically found to be encircled by hydrophobic residues and were predisposed to engage in hydrophobic–hydrophobic interactions. Consequently, it has been demonstrated that when residues with elevated SAP values approach each other within the FAB interface, their immediate neighbors are also likely to be implicated in the interface (Fig. 5(a)). Hence, the identification of residues adjacent to those with high SAP values were crucial. These features were termed as SAP-adjacent (see Tables S8–S10, ESI†). More specifically, the residues that met various SAP cut-off values, along with their adjacent residues, were systematically identified. For example, assuming the i-th residue maintains the SAP value, the neighboring residues of the i-th residue were then selected; specifically, the i + 1 and i − 1 residues, termed as ‘nb’ residues. Additionally, residues within a certain distance from the i-th residue, termed ‘nbh’, were also chosen. Features were then established based on surface exposure for both individual residues and pairs of residues—such as (i and i + 1), (i and i − 1), and (i and inbh). These were defined using the sidechain SASA values, the overall SASA calculated for each residue, and the fractional exposure (i.e., the ratio of each residue's sidechain SASA to the standard sidechain exposure of that residue in a fully exposed Ala-X-Ala tripeptide, as shown in Fig. S1 in the ESI†). If the residues satisfied the specified conditions (see Materials and methods section and Tables S8–S10, ESI†), their corresponding feature values were designated as ‘1’; otherwise, their feature values were set to ‘0’. Consequently, the SAP-adjacent features (see Materials and methods section and Tables S8–S10, ESI†), being binary in nature, could only assume two values, 0 or 1, indicative of the absence or presence of a particular attribute, respectively. More specifically, as per the SAP-adjacent feature, 1 represents residues that were highly likely to be present in crystal sites, while 0 indicates those that were not. The summary of steps followed is provided below:
(a) The i-th residue was selected based on its spatial aggregation propensity (SAP),
(b) Neighbors of the i-th residue (both adjacent (nb) and within a specified radius (nbh)) were identified,
(c) Surface exposure conditions were checked for each pair: (i and i + 1), (i and i − 1), and (i and inbh), along with their individual solvent accessible surface areas,
(d) Feature values for each residue were set to ‘1’ if they met the conditions, and ‘0’ if they did not.
In the engineering of SAP-adjacent features, the combinations of diverse conditions (see Materials and methods section and Tables S8–S10, ESI†) relating to sidechain SASA, fractional exposure, and overall SASA were employed to produce same kind of feature but with different binary values for each residue. Furthermore, the amino acid types of the neighbouring ‘inbh’ residues at different distances from the i-th residue were also varied to assess the impact of specific amino acid characteristics in proximity to residues with elevated SAP values (see Materials and methods section and Tables S8–S10, ESI†).
Model | Accuracy (balanced accuracy) | AUPRC |
---|---|---|
RF | 0.84 (0.80) | 0.72 |
XGBoost | 0.86 (0.84) | 0.77 |
SVM | 0.85 (0.80) | 0.72 |
KNN | 0.83 (0.76) | 0.69 |
MLP | 0.84 (0.79) | 0.70 |
Baseline | 0.76 (0.50) | 0.24 |
A built-in “Gini” importance was applied to the random forest (RF) model as feature importance52 to interpret the models better and identify the top-ranked features (Table 3). This methodology probed Fab crystallization patterns based on the top-most features and led to several notable observations:
• (a) The surface-exposed residues were identified as particularly important. The residues exposed on the surface i.e., mainly those adjacent to hydrophobic residues and possessing a high SAP, were frequently observed at crystal sites.
• (b) The surface-exposed residues situated next to those with a high SAP value also played an important role in the formation of crystalline structures. Moreover, including the properties of the surrounding surface-exposed residues, whether polar, charged, or both, to those with a high SAP, were deemed essential for the integrity of the crystal sites.
• (c) Some residues, including those with crystal-site properties, were still excluded due to the growth of the crystal-lattice with specific symmetry and arrangements.
RF | |
---|---|
Feature names | Importance |
SASA_all_0 Å | 0.033 |
SAP-adjacent-SC_0.15_10–10_10–10_10–10_7_polar-charged | 0.025 |
SAP-adjacent-FE_0.15_5–5%_5–5%_5–30%_5_any-type | 0.017 |
SAP-adjacent-FE_0.15_5–5%_5–5%_5–5%_7_polar-charged | 0.016 |
SAP-adjacent-SC_0.20_10–10_10–10_10–10_7_polar-charged | 0.015 |
SAP-adjacent-FE_0.15_5–5%_5–5%_5–10%_5-hydrophobic-charged-polar | 0.014 |
XGBoost | |
---|---|
Feature names | Importance |
SAP-adjacent-FE_0.15_5–5%_5–5%_5–5%_7_polar-charged | 0.479 |
SASA_all_0 Å | 0.077 |
SASA_gly_0 Å | 0.012 |
SASA_hydrophilic_20 Å | 0.006 |
SAP-adjacent-SC_0.15_10–10_10–10_10–10_7_polar-charged | 0.004 |
Fractional exposure | 0.003 |
“Gain” importance of features (Table 3) was also calculated for the XGBoost model.53 In addition to the conclusions similar to those drawn from the RF model, the XGBoost models' feature ranking analysis also emphasized the importance of surface-exposed glycine residues at the crystal sites as one of the top-most features. Glycine, being the smallest amino acid, is commonly present in the flexible regions of proteins, particularly within the loop regions, as indicated by a previous study.54 It was also discovered that the partially exposed glycine residues on the protein surface in loop regions promoted the formation of crystal-packing contacts.55 Thus, the underscored significance of glycine might suggest that these flexible regions were actively involved in the nucleation process.
In the study reported by Hasegawa et al.,56 the essential role of a specific cluster in influencing the crystallization propensity of certain structures was proposed. This cluster, distinguished by five externally exposed negatively charged residues situated on the complementarity determining region (CDR), was identified as being crucial in steering the self-assembly processes, resulting in augmented crystallization tendencies. Their findings were suggestive of the idea that the juxtaposition of a distinct negative electrostatic patch with neighbouring exposed hydrophobic residues within the fragment variable (Fv) domain might have been recognized as a distinguishing feature for identifying IgG1 isotopes with a heightened probability of crystallization. Similarly, Smejkal et al.,57 proposed that hydrophobic patches, when surrounded by charged residues, might have been instrumental for the crystallization propensities of the Fabs. Furthermore, it was posited by Jean-Philippe Julien et al.58 that the role of surface-exposed hydrophobic patches was not merely peripheral but central to the antibody crystallization mechanism. Electrostatic interactions also played an important role. Additionally, hydrophobic patches on an antibody's surface were frequently cited as significant contributors to the propensity for Fab–Fab interactions.59–62 Other studies also indicated the importance of hydrophobicity in crystallization.57,63–67 Hydrophobic interactions were recognized to effect aqueous assemblies, resulting in protein interactions.68–70 A contemporary study concluded that the most hydrophobic protein patches comprised a notable fraction of polar/charged atoms.68 Protein hot-spots were identified in several previous studies as regions marked by a combination of hydrophobic and polar residues.71–73
Our findings were consistent with the conclusions of R. Lieu et al.74 which showed that the replacement of the human kappa constant domain FG loop with a truncated rabbit loop markedly enhanced the crystallization propensity of Fab: 6WGJ. This loop was found to facilitate the crystallization process through its β-sheet structured interactions.74 In line with these observations, our molecular modelling analysis indicated that the loop “QGTTS” exhibited hydrophobic patches with elevated SAP values (Fig. 5(b)). These patches were enveloped by surface exposed polar and charged residues (Fig. 5(b)), a pattern consistent with our model's feature significance results and other studies.56–76 Thus, our model, trained on the dataset, learns and captures the key features of crystallization at the Fab interface.
Additional feature importance metrics were further examined (Table S11, ESI†) for the XGBoost model, such as “cover”53 and “total gain”,53 to robustly validate the conclusions. Despite slight variations observed across different metrics the main conclusions remained consistent. Notably, the top features identified by these techniques were similar to those of “gain” feature importances (Tables 3 and S11, ESI†). Additionally, SHAP77 feature importance were calculated for both XGBoost and RF models, reinforcing our initial findings with consistent results (Fig. S2, ESI†). In summary, all types of feature importance analyses yielded similar and, consistent conclusions for both the XGBoost and RF models, enhancing the robustness of the conclusions drawn from these models.
It is important to emphasize that the performance of machine learning models is inherently dependent on the hyperparameters chosen during the training and testing phases. These hyperparameters of each model play a critical role in shaping the model's ability to generalize and accurately predict outcomes. In the present study, hyperparameter tuning was conducted for all the tested ML models using “GridSearchCV”52 with details provided in Table S12 of the ESI.† The dataset was divided into two segments by stratified splitting to preserve the class distribution: 90% for hyperparameter tuning and 10% for validation. The 90% hyperparameter tuning portion underwent a stratified 10-fold cross-validation, using the AUPRC score as the evaluation metric to determine the optimal hyperparameters. As we conducted hyperparameter tuning exclusively on the 90% data, carefully ensuring that the validation set remained completely unseen during this process, thus preventing any data leakage. These best-performing parameters were then applied to the previously unseen 10% validation set to assess the model's performance. The AUPRC scores of the optimal model using the cross-validation closely matched those of the validation set (Table 4), demonstrating consistent performance and reducing the risk of overfitting, underscoring the reliability of our conclusions.
Method | Cross-validation set | Validation set |
---|---|---|
XGBoost | 0.76 | 0.77 |
RF | 0.72 | 0.70 |
SVM | 0.71 | 0.74 |
KNN | 0.69 | 0.66 |
MLP | 0.70 | 0.73 |
Henceforth, the validated best-hyperparameters as described in the “Materials and methods” section were utilized to freshly retrain the model from scratch on the entire dataset, employing a 10-fold stratified cross-validation repeated 10 times on the full dataset to ensure the model's robustness and generalizability. The comprehensive performance metrics for various models obtained through this methodology are detailed in Table 2. This approach confirms good utilization of the dataset, thereby enhancing the overall performance and reliability of the model, especially given the imbalanced nature of the data.
The AUPRC values were analyzed for each subset, and it was observed that the model's performance steadily improved as more features were included. Notably, when the top 75 features (i.e., approximately the 15% of complete features) were utilized, the AUPRC reached a value near 0.77, which closely approximated the performance of the full 510-feature model (AUPRC = 0.77, Table 2). This result indicated that features beyond the top 75 contributed minimally to the model's predictive power. The top 75 features are provided in Table S13 of the ESI.†
The analysis demonstrated that a substantial reduction in the feature set could be achieved without a significant loss in performance. Using only the top 75 features (Table S13 in the ESI†) provided several advantages, including reduced model complexity, improved computational efficiency, and increased interpretability. Moreover, by limiting the feature set, the risk of overfitting to noise or irrelevant patterns was minimized. The relationship between the number of top-ranked features and the corresponding AUPRC values is depicted in Fig. 6. Upon examining the top 75 features (Table S13 in the ESI† including the top six features detailed in Table 3), we observed that SAP-adjacent features consistently appeared within this subset. Fractional exposure of amino acids made a significant contribution to the model. Additionally, the solvent-accessible surface area (SASA) of amino acids—including both backbone and side chain components—played a crucial role. Specifically, the SASA of glycine, hydrophobic, non-polar-sulfur-containing, aliphatic, hydrophilic, aromatic, polar uncharged with amide, uncharged polar with hydroxyl group, small, long, and negative amino acids were important in building the classification model (Table S13 in the ESI†). Furthermore, the number of specific amino acids was also a significant factor. Counts of long, proline, hydrophilic, polar, very small, negative, non-polar-sulfur-containing, glycine, aliphatic, aromatic, small, and positive amino acids were influential features. Additionally, the protrusion index and depth index played important roles, along with exposed charge, hydrophobic patches and charged patches, which were crucial to the model's performance (Table S13 in the ESI†). Overall, these findings highlight that both the local structural environment and the specific properties of amino acids—including their spatial arrangement, exposure, and type—are key determinants in the model's predictive ability.
(a) RF was configured with n_estimators = 300, min_samples_split = 5, class_weight = {0:
1, 1
:
1}, max_depth = none, criterion = ‘entropy’,
(b) XGBoost was tuned eval_metric = ‘logloss’, colsample_bytree = 0.8, learning_rate = 0.1, max_depth = 10, n_estimators = 500, subsample = 1, scale_pos_weight = (number of neg)/2 × (number of pos), reg_alpha = 0, reg_lambda = 10, gamma = 0.1,
(c) SVM was implemented using an ‘rbf’ kernel with parameters C = 40.0 and gamma = 0.001, class_weight = none,
(d) KNN was implemented using algorithm = ‘auto’, metric = ‘manhattan’, n_neighbors = 41, p = 1, weights = ‘distance’,
(e) MLP was implemented using max_iter = 5000, activation = ‘relu’, alpha = 0.01, early_stopping = false, hidden_layer_sizes = (200), learning_rate_init = 0.001, solver = ‘adam’.
Given the significant class imbalance in the dataset, the repeated stratified K-fold cross-validation was employed, dividing the data into 10 folds and repeating the process 10 times to ensure robustness in our model evaluations. As evaluation metric, the average AUPRC (area under the precision-recall curve), average balanced accuracy, and average accuracy values were employed, which were averaged across all folds and repetitions. The AUPRC is especially advantageous for imbalanced datasets, providing a comprehensive assessment of the model's ability to handle class imbalance. In our study, we aim to identify the key features that predict crystal site residues, distinguishing them from non-crystal site residues. Therefore, AUPRC is an essential metric for our research. A higher AUPRC value signifies superior performance, particularly in imbalanced datasets where traditional metrics like accuracy may be misleading. AUPRC is notably valuable for the positive class because it emphasizes precision and recall, metrics that are directly pertinent to positive predictions. It remains sensitive to class imbalance, ensuring that the model's capability to correctly identify positive instances is accurately assessed. This makes AUPRC an important metric for comparing models, especially in scenarios where accurately identifying the positive class is critical.43,45 The baseline evaluation metric AUPRC was 0.24, while the model demonstrated a baseline accuracy and balanced accuracy of 0.76 and 0.50, respectively. Prior to training the machine learning models, features were rescaled and normalized to have a mean of 0 and a standard deviation of 1.
To search through the predefined hyperparameter space systematically and identify the most optimal set for all five ML methods, the “GridSearchCV” method was utilized.52 The evaluation metric for best hyperparameter employed here was the AUPRC score, indicating the relevancy of model. During this grid search, 10% of the whole data was reserved for the validation set (using stratified splitting). The remaining 90% of the whole data was subjected to stratified 10-fold cross-validation for hyperparameter tuning. In this cross-validation process, the training data was divided into ten parts or folds. In each iteration, nine folds were used as training data, while the remaining fold was used for testing. This process ensured that each fold served as the test set at least once. The “stratified” approach was employed to ensure that each fold retained the same class proportion observed in the original training dataset, an essential consideration for imbalanced datasets. Once this iterative process was concluded, the hyperparameter combination that produced the highest AUPRC across all iterations was identified as the best. The model, fitted with these optimal hyperparameters, was subsequently tested on the 10% validation set, which had been set aside and not involved in the hyperparameter tunning using cross-validation to assess the model's performance. In this phase, the efficacy of the model was evaluated on previously unseen validation data. It was observed that the AUPRC value for the best hyperparameter set of a particular method was notably close to the AUPRC value of the validation set, signalling consistent performance between the training, testing, and validation stages, and the minimization of the potential risk of overfitting. After identifying the best hyperparameters, the model was retrained freshly from scratch on the entire dataset using the stratified 10-fold cross-validation technique with 10-time repetitions to ensure its robustness, generalizability, and to validate our conclusions. This approach maximizes dataset utilization, considering the imbalance nature, thereby improving the overall performance and reliability of the model.
To determine the importance of each feature, we used several metrics for our models. For the RF model, we used its built-in feature importance metric. For the XGBoost model, we evaluated feature importance using metrics such as gain, cover, and total gain. The precision of the results was ensured by avoiding features with high cardinality (i.e., those with a large number of unique values) that might have yielded biased outcomes in gain, cover, or total gain-based feature importance calculations. Since such features were absent from our dataset, thus allowing to avoid any potential bias in the results. Furthermore, the SHAP77 (SHapley Additive exPlanations) feature importance was calculated from the test data for both models to validate our model. Note that feature importance was not evaluated for SVM (with “rbf” kernel), because it does not readily offer such interpretable measures. Additionally, feature importance was not evaluated for the KNN and MLP models due to its inferior performance compared to the other models. Model training and analysis were conducted using Scikit-learn.52
1. Residue selection:
2. Initialization of variable i: i = residueselected.
3. Setting variables inb: inb = {i + 1, i − 1}.
4. Setting variables inbh: inbh = {residue|distance (i, residue ≤ d)}.
5. Checking conditions for (i and i + 1), (i and i − 1) and each (i and inbh) pairs:
(a) Conditions for (i and i + 1) pair: SC SASA of i ≥ cut-off2 and SC SASA of i + 1 ≥ cut-off2.
(b) Conditions for (i and i − 1) pair: SC SASA of i ≥ cut-off2 and SC SASA of i − 1 ≥ cut-off2.
(c) Conditions for each (i and inbh) pair: SC SASA of i ≥ cut-off2 and SC SASA of inbh ≥ cut-off3.
6. Collection of pairs that met criteria set in step 5.
7. Check SASA (≥50 Å2) for each residues collected in step 6.
8. Set feature value (FV):
Steps 1 to 8 were sequentially executed, where “FV” represents the “feature value” of the respective residues. Here surface exposure of residues is calculated based on SC SASA (which represents sidechain solvent accessible surface area). Cut-off1 was varied from 0.15 to 0.20. The distance “d” was varied from 5 to 10 Å, cut-off2 was varied from 5 to 10 Å2 and cut-off3 was varied from 10 to 75 Å2 to generate different binary value SAP-adjacent (specifically termed SAP-adjacent-SC) features for each residue of FAB. Additionally, the types of amino acids constituting the ‘inbh’ neighbouring residues were altered to evaluate the influence of distinct amino acid properties near residues with high SAP values. The various types of ‘inbh’ (neighbouring) residues are considered as “any type residues”, “hydrophobic-polar-charged”, “polar-charged”, “hydrophobic-charged”, “specific hydrophobic (ILE, LEU, TRP, ALA, VAL, and PRO)-polar-charged”, and “charged”.
Furthermore, the surface exposure of residues was determined by calculating their fractional exposure (FE), defined as the ratio of a residue's solvent accessible surface area (SASA) of side-chain to the standard side-chain exposure of the residue in an Ala-X-Ala tri-peptide. This calculation contributed to the generation of a distinct set of SAP-adjacent (specifically termed SAP-adjacent-FE) binary features. The details of steps are provided below:
1. Residue selection:
2. Initialization of variable i: i = residueselected.
3. Setting variables inb: inb = {i + 1, i − 1}.
4. Setting variables inbh: inbh = {residue|distance (i, residue ≤ d)}.
5. Checking conditions for (i and i + 1), (i and i − 1) and each (i and inbh) pairs:
(a) Conditions for (i and i + 1) pair: FE of i ≥ cut-off2 and FE of i + 1 ≥ cut-off2.
(b) Conditions for (i and i − 1) pair: FE of i ≥ cut-off2 and FE of i − 1 ≥ cut-off2.
(c) Conditions for each (i and inbh) pair: FE of i ≥ cut-off2 and FE of inbh ≥ cut-off3.
6. Collection of pairs that met criteria set in step 5.
7. Check SASA (≥50 Å2) for each residues collected in step 6.
8. Set feature value (FV):
It should be noted that the algorithm steps for SAP-adjacent-FE were executed similarly to those for SAP-adjacent-SC. The cut-off1 was set at 0.15. The distance parameter “d” ranged from 5 to 7 Å, cut-off2 was consistently set at 5%, and cut-off3 varied from 5% to 40%. This approach was employed to generate a similar, yet distinct, set of binary value features for each residue of FAB. Furthermore, the amino acid types of the ‘inbh’ neighbouring residues were varied as before, enabling the assessment of the impact of unique amino acid characteristics in proximity to residues with elevated SAP values.
Moreover, our investigation extended to further refinement in residue selection. In the initial stages, along with parameters such as SAP and side chain (SC) SASA, the overall solvent accessible surface area (SASA) of residues (≥50 Å2) was also considered to generate a specific set of binary features (specifically termed SAP-adjacent-overall). The details of steps are provided as follows:
1. Residue selection:
2. Initialization of variable i: i = residueselected.
3. Setting variables inb: inb = {i + 1, i − 1}.
4. Setting variables inbh: inbh = {residue|distance (i, residue ≤ d)}.
5. Checking conditions for (i and i + 1), (i and i − 1) and each (i and inbh) pairs:
(a) Conditions for (i and i + 1) pair: SC SASA of i and i + 1 ≥ 10 Å2 and SASA of i and i + 1 ≥ 50 Å2.
(b) Conditions for (i and i − 1) pair: SC SASA of i and i − 1 ≥ 10 Å2 and SASA of i and i − 1 ≥ 50 Å2.
(c) Conditions for each (i and inbh) pair: SC SASA of i and inbh ≥ 10 Å2 and SASA of i and inbh ≥ 50 Å2.
6. Collection of pairs that met criteria set in step 5.
7. Set feature value (FV):
The distance parameter “d” was set at 5 Å. Also, the types of amino acids constituting the ‘inbh’ neighbouring residues were diversified, as previously done, to facilitate the evaluation of how distinct amino acid properties affect nearby residues with high spatial aggregation propensity (SAP) values.
AUPRC | Area under the precision-recall curve |
BSA | Buried surface area |
CDR | Complementarity-determining region |
CIF | Crystallographic information files |
EM | Electron microscopy |
F v | Variable fragment |
FV | Feature value |
FE | Fractional exposure |
FAB/Fab | Fragment antigen biding region |
IgG1 | Immunoglobulin G1 |
IgG4 | Immunoglobulin G4 |
KNN | k-Nearest neighbours |
mAbs | Monoclonal antibodies |
MLP | Multi-layer perceptron |
ML | Machine learning |
NMR | Nuclear magnetic resonance |
PDB | Protein Data Bank |
RF | Random forest |
SAbDab | Structural antibody database |
SAP | Spatial aggregation propensity |
SASA | Solvent-accessible surface area |
SCM | Spatial charge map |
SVM | Support vector machine |
SC SASA | Sidechain solvent-accessible surface area |
SHAP | SHapley Additive exPlanations |
VMD | Visual molecular dynamics |
XGBoost | Extreme gradient boosting |
XRD | X-ray crystallography |
Footnote |
† Electronic supplementary information (ESI) available: Data set employed for training machine learning algorithms (XLSX format). Additional information on the dataset and antibody structures, including explanations of features used in the study (PDF format). See DOI: https://doi.org/10.1039/d4me00187g |
This journal is © The Royal Society of Chemistry 2025 |