 Open Access Article
 Open Access Article
      
        
          
            Daniel 
            Garzon Otero
          
        
       , 
      
        
          
            Omid 
            Akbari
          
        
       and 
      
        
          
            Camille 
            Bilodeau‡
, 
      
        
          
            Omid 
            Akbari
          
        
       and 
      
        
          
            Camille 
            Bilodeau‡
          
        
        
       *
*
      
University of Virginia, Chemical Engineering Department, 385 McCormick Road, Charlottesville, VA 22903, USA. E-mail: cur5wz@virginia.edu
    
First published on 11th December 2024
Peptides are a powerful class of molecules that can be applied to a range of problems including biomaterials development and drug design. Currently, machine learning-based property prediction models for peptides primarily rely on amino acid sequence, resulting in two key limitations: first, they are not compatible with non-natural peptide features like modified sidechains or staples, and second, they use human-crafted features to describe the relationships between different amino acids, which reduces the model's flexibility and generalizability. To address these challenges, we have developed PepMNet, a deep learning model that integrates atom-level and amino acid-level information through a hierarchical graph approach. The model first learns from an atom-level graph and then generates amino acid representations based on the atomic information captured in the first stage. These amino acid representations are then combined using graph convolutions on an amino acid-level graph to produce a molecular-level representation, which is then passed to a fully connected neural network for property prediction. We evaluated this architecture by predicting two peptide properties: chromatographic retention time (RT) as a regression task and antimicrobial peptide (AMP) activity as a classification task. For the regression task, PepMNet achieved an average R2 of 0.980 across eight datasets, which spanned different dataset sizes and three liquid chromatography (LC) methods. For the classification task, we developed an ensemble of five models to reduce overfitting and ensure robust classification performance, achieving an area under the receiver operating curve (AUC-ROC) of 0.978 and an average precision of 0.981. Overall, our model illustrates the potential for hierarchical deep learning models to learn peptide properties without relying on human engineering amino acid features.
| Design, System, ApplicationPepMNet, a hierarchical graph neural network, addresses the limitations of current peptide property prediction models by integrating atomic and amino acid-level information using a multilevel graph convolutional architecture. Specifically, existing peptide models primarily rely on human-engineered amino acid features which are not compatible with non-natural featuressuch as staples or non-natural sidechains and introduce bias into the learning process. In contrast, PepMNet learns amino acid features from their atomic structure and then subsequently learns global peptide properties from these features. We demonstrate the versatility of PepMNet by evaluating its ability to predict peptide chromatographic retention time (a regression task) and antimicrobial activity (a classification task). The immediate applications of this model lie in peptide-based drug discovery, where rapid and accurate prediction of properties such as retention time and antimicrobial activity are valuable for the development of high throughput discovery assays and the identification of drug candidates, respectively. In the future, PepMNet's design can straightforwardly be extended to other biopolymer or synthetic polymer systems, providing a powerful framework for predicting properties across a wide variety of molecular systems. | 
To address these limitations, we have developed PepMNet, a hybrid, deep learning approach which incorporates both atom-level and amino acid-level information via a hierarchical graph model. The first stage of our model borrows from commonly used deep learning architectures for small molecule property prediction with the model learning directly from the atom-level graph. We then compute the representations of each amino acid based on information about their constituent atoms, resulting in a coarse-grained molecular graph where nodes represent amino acids and edges represent adjacencies between them. Finally, we perform graph convolutions on the amino acid-level graph and sum over amino acid features to obtain a molecular-level representation which can be used for peptide property prediction. Importantly, by learning amino acid features from atom-level information, our model can learn the relationships between the atomic configurations of amino acids allowing it to better represent natural amino acids. Theoretically, this method can be straightforwardly extended to incorporate any peptide chemical groups such as non-natural sidechains or nonlinear peptide features, such as staples or cycles. In this way, the proposed model offers a more flexible and less biased alternative compared with current state-of-the-art property prediction models.
We evaluated our model by applying it to predict two model peptide properties: chromatographic retention time as a model regression task and antimicrobial peptide classification as a model classification task. Liquid chromatography (LC) is one of the most common techniques for identifying and quantifying the composition of peptide mixtures and plays a key role in most peptide discovery workflows.28,29 Different types of LC, such as strong cation exchange (SCX), reversed-phase LC (RPLC), and hydrophilic interaction LC (HILIC), are commonly utilized to effectively separate and analyze peptide samples.28 Chromatographic retention time (RT) is defined as the time required for a peptide to elute from a chromatography column and is determined by the strength of non-covalent interactions (e.g. charge, hydrophobicity, or hydrogen bonding) between the peptide and the stationary phase. Here, we evaluate our model's ability to predict RT for a variety of chromatographic modes because 1) an accurate RT prediction model can be used to facilitate the development of analytical and preparative peptide purification methods, and 2) multiple, high quality, publicly available datasets exist for model training.
We additionally evaluated our model using antimicrobial peptide (AMP) classification as a model classification task. AMPs are short, positively charged, amphipathic peptides that present a promising alternative to traditional antibiotics for addressing microbial resistance.30 AMPs offer a broad range of activity, low toxicity, and minimal development of microbial resistance, making them a valuable tool in the fight against resistant pathogens.30 ML-based AMP classification has gained interest in recent years as a strategy for reducing the time and resource intensive experiments required to screen new candidate peptides.2,31,32 In this way, AMP classification is a good model classification problem because 1) an accurate AMP classification model can be used to design new AMPs, and 2) there exist public AMP datasets for model training (albeit with fewer datapoints than RT datasets).
Previous studies have explored the development of various shallow and deep learning approaches for predicting both RT and antimicrobial activity. For example, DeepRT utilizes deep learning techniques to encode amino acid vectors within peptides, enabling accurate prediction of peptide retention times for various LC types.28 Similarly, AmPEP converts peptide sequences into a feature vector derived from physicochemical descriptors, which serves as input for a random forest model used to classify AMPs.1 As noted earlier, each of these methods relies on amino acid-level features instead of atom-level features making them less flexible than atomic models. A notable exception to this trend is the AMP-Net developed by Ruiz et al. which learns antimicrobial activity directly from the atom-level graph, which, when combined with peptide physicochemical properties, facilitates classification into AMPs or Non-AMPs.33 While AMP-Net provides a flexible alternative to previous methods, it sacrifices predictive power, failing to outperform random forest classifiers such as AmPEP.
In this work, we build upon these previous methodologies to develop PepMNet, a deep hierarchical graph model for peptide property prediction and we evaluate our model on two model tasks, chromatographic RT prediction and AMP classification. To quantify the impact of our hierarchical strategy we compare our model with non-hierarchical models trained on only atom-level or amino acid-level graphs. We additionally explore the impact of a series of deep learning strategies including graph convolutional layer choice, amino acid feature concatenation, and ensembling and we benchmark our model against current, state-of-the-art models. Finally, we explore how trends in the training datasets used impact molecular properties and model predictions. Overall, the resulting model achieves on par or better performance than other peptide property prediction models, while also offering greater flexibility in its ability to incorporate non-natural features. To make it straightforward to reproduce our results and repurpose our models, we have made our code publicly available on GitHub (https://github.com/danielgarzonotero/PepMNet.git).
![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 587 peptides. We trained and tested our model on each dataset separately using random splits of 90% and 10% for training and testing respectively. Instead of employing k-fold cross-validation, we opted for a random 90%/10% split for training and testing due to the large number (eight) and varied sizes of the datasets. This approach aims to balance computational feasibility with performance estimation. To account for prediction uncertainty, we performed the training process in triplicate for each dataset, and the final average performance metrics for each training and test set are reported in Table 1. We note that all datasets contain naturally occurring peptides without any synthetic modifications and most derived from digests of protein or peptide mixtures, except for the HeLa dataset, which includes peptides with modified amino acids, such as oxidized methionine, phosphorylated serine, phosphorylated threonine, and phosphorylated tyrosine, which were removed before model training.28
587 peptides. We trained and tested our model on each dataset separately using random splits of 90% and 10% for training and testing respectively. Instead of employing k-fold cross-validation, we opted for a random 90%/10% split for training and testing due to the large number (eight) and varied sizes of the datasets. This approach aims to balance computational feasibility with performance estimation. To account for prediction uncertainty, we performed the training process in triplicate for each dataset, and the final average performance metrics for each training and test set are reported in Table 1. We note that all datasets contain naturally occurring peptides without any synthetic modifications and most derived from digests of protein or peptide mixtures, except for the HeLa dataset, which includes peptides with modified amino acids, such as oxidized methionine, phosphorylated serine, phosphorylated threonine, and phosphorylated tyrosine, which were removed before model training.28
        
| Dataset | LC type | No. peptides | R 2 training set | R 2 testing set | 
|---|---|---|---|---|
| HeLa | RPLC | 1170 | 0.9894 ± 0.0041 | 0.9427 ± 0.0045 | 
| Yeast | RPLC | 14 ![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 361 | 0.9927 ± 0.0032 | 0.9825 ± 0.0043 | 
| Misc | RPLC | 146 ![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 587 | 0.9919 ± 0.0005 | 0.9885 ± 0.0006 | 
| SCX | SCX | 30 ![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 482 | 0.9962 ± 0.0014 | 0.9942 ± 0.0012 | 
| Luna HILIC | HILIC | 36 ![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 271 | 0.9922 ± 0.0014 | 0.9841 ± 0.0017 | 
| Xbridge | HILIC | 40 ![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 290 | 0.9928 ± 0.0020 | 0.9876 ± 0.0023 | 
| Atlantis silica | HILIC | 39 ![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 091 | 0.9891 ± 0.0009 | 0.9809 ± 0.0008 | 
| Luna silica | HILIC | 37 ![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 110 | 0.9905 ± 0.0047 | 0.9829 ± 0.0048 | 
For AMP classification, we used the datasets recently curated by Ruiz et al.33 which contains 23![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 919 peptides, with 13
919 peptides, with 13![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 334 classified as AMPs and 10
334 classified as AMPs and 10![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 585 as non-AMPS (Table 2). To assess the performance and differences between the proposed hierarchical and non-hierarchical graph models, a split of 80% and 20% was used for training and validation. To facilitate comparison across models, classification performance was evaluated on the same test dataset used previously by Ruiz et al. (Table 2). Additionally, we compared our model's performance against previous machine learning approaches, AMPEP and AMPepPy.1,14,33 We implemented 5-fold cross-validation during training to ensure robust evaluation and mitigate overfitting. The final model is an ensemble of the models from each fold.
585 as non-AMPS (Table 2). To assess the performance and differences between the proposed hierarchical and non-hierarchical graph models, a split of 80% and 20% was used for training and validation. To facilitate comparison across models, classification performance was evaluated on the same test dataset used previously by Ruiz et al. (Table 2). Additionally, we compared our model's performance against previous machine learning approaches, AMPEP and AMPepPy.1,14,33 We implemented 5-fold cross-validation during training to ensure robust evaluation and mitigate overfitting. The final model is an ensemble of the models from each fold.
| Dataset | AMP | Non-AMP | Total | 
|---|---|---|---|
| Training and validation | 10 ![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 667 | 8466 | 19 ![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 133 | 
| AMP testing | 2667 | 2119 | 4786 | 
|  | (1) | 
|  | (2) | 
| H0 = a = R({gTv|v ∈ A}) | (3) | 
|  | (4) | 
 is the modified Laplacian matrix, K is the number of parallel stacks, and T is the number of timesteps.41 In this way, during stage III, the information of each amino acid, represented by H, is sharing along the coarse-grained graph through ARMA spectral convolution. This convolution stage allows the model to learn more complex relationships between different amino acids. Here, after T timesteps, the feature matrices for each skip were averaged over the K stacks, to obtain a single feature matrix,
 is the modified Laplacian matrix, K is the number of parallel stacks, and T is the number of timesteps.41 In this way, during stage III, the information of each amino acid, represented by H, is sharing along the coarse-grained graph through ARMA spectral convolution. This convolution stage allows the model to learn more complex relationships between different amino acids. Here, after T timesteps, the feature matrices for each skip were averaged over the K stacks, to obtain a single feature matrix,  :
:|  | (5) | 
|  | (6) | 
 refers to the feature vector for amino acid V in peptide P and y* refers to the full peptide feature vector. For models where additional amino acid-level features were introduced, these features were concatenated to H0 before performing convolutions. Finally, in stage V, y* is passed to a linear fully connected neural network (FCNN) to yield the final prediction (shown in Fig. 2V).
 refers to the feature vector for amino acid V in peptide P and y* refers to the full peptide feature vector. For models where additional amino acid-level features were introduced, these features were concatenated to H0 before performing convolutions. Finally, in stage V, y* is passed to a linear fully connected neural network (FCNN) to yield the final prediction (shown in Fig. 2V).
        | Feature | Type feature | Description | Size | 
|---|---|---|---|
| Atom type | Atom | Type of atom by atomic number | 4 | 
| Aromaticity | Atom | Whether this atom is part of an aromatic system | 2 | 
| Number of bonds | Atom | Number of bonds the atom is involved in | 3 | 
| Number of H2 bonds | Atom | Number of bonded hydrogen atoms | 4 | 
| Hybridization | Atom | sp, sp2, sp3, sp3d, or sp3d2 | 2 | 
| Implicit valence | Atom | Number of implicit H2 on the atom | 4 | 
| Bond type | Bond | Single, double, triple, or aromatic | 3 | 
| In ring | Bond | Whether the bond is part of a ring | 2 | 
| Conjugated | Bond | Whether the bond is conjugated | 2 | 
| Aromaticity bond | Bond | Whether the bond is aromatic | 2 | 
| Valence contribution i | Bond | Contribution of the bond to the valence of atom i | 2 | 
| Valence contribution f | Bond | Contribution of the bond to the valence of atom f | 2 | 
| W amino acid | Amino acid | Amino acid molecular weight | 1 | 
| Aromaticity | Amino acid | Aromaticity | 2 | 
| Hydrophobicity | Amino acid | GRAVY | 1 | 
| Net charge | Amino acid | Charge at pH 7 | 1 | 
| Isoelectric point | Amino acid | Isoelectric point | 1 | 
| Log ![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) P | Amino acid | Octanol–water partition coefficient | 1 | 
| Atoms number | Amino acid | Number of atoms in the amino acid | 1 | 
Similarly, we compared PepMNet with non-hierarchical graph models using only amino acid information. To obtain initial representations of the amino acids, we summed the features of the atoms within each amino acid, effectively removing the atomic graph convolutions performed in stages I and II in PepMNet. We additionally concatenated the amino acid features listed in Table 3. Each of the layers tested for the atomic non-hierarchical graph was also tested for the amino acid non-hierarchical graph except NNConv, which was not included because it requires bond features (not present in the amino acid graph).
As shown in Fig. 3, PepMNet outperformed all models trained on either the atomic-level graph alone (Fig. 3a) or the amino acid-level graph alone (Fig. 3b), regardless of the type of graph convolutional layer used, achieving a mean R2 of 0.9804 across the RT datasets. Further, PepMNet outperformed each non-hierarchical model, regardless of dataset size. This illustrates that integrating both atomic- and amino acid-level feature extraction leads to more robust model development than learning from either the atomic- or amino acid-level graphs alone. For the final PepMNet architecture we selected NNConv6,39 at the atomic level for its ability to incorporate bond features and ARMA41 layer at the amino acid stage which provided the best performance at this stage after hyperparameter optimization (Table S3†). The training for each dataset was performed in triplicate, and the results of each training along with the scatter plots for each RT dataset are shown in Table S4† and Fig. S1,† respectively.
Throughout the seven graph convolutional layers tested, the ARMA layer, a spectral layer that uses autoregressive filters to update node embeddings,41 exhibited the second best performance for the atomic-level graph, achieving a mean R2 of 0.9537 after the NNConv layers (R2 of 0.9637). Conversely, for the amino acid-level graph model, the SAGEConv layer, a spatial graph convolutional layer, demonstrated superior performance with a mean R2 of 0.9458.50 Overall, the relative performance of models trained on different layer types varied depending on the training dataset used, such that there were no clear “winners” among the graph convolutional layers. This highlights that while there may be advantages to using specific layer types in specific contexts, it is unclear whether in practice there are significant advantages to using one over another.
          Fig. 4 illustrates the parity plots for PepMNet and each non-hierarchical model (using the ARMA layer) on two chromatographic retention time datasets, a large dataset called misc (containing 146![[thin space (1/6-em)]](https://www.rsc.org/images/entities/char_2009.gif) 587 peptides) and a small dataset called HeLa (containing 1170 peptides). Interestingly, prediction errors for the non-hierarchical models varies with retention time, such that peptides with high retention times are predicted with lower accuracy than peptides with lower retention times. In contrast, PepMNet makes robust retention time predictions regardless of where the peptide falls in the retention time distribution.
587 peptides) and a small dataset called HeLa (containing 1170 peptides). Interestingly, prediction errors for the non-hierarchical models varies with retention time, such that peptides with high retention times are predicted with lower accuracy than peptides with lower retention times. In contrast, PepMNet makes robust retention time predictions regardless of where the peptide falls in the retention time distribution.
For the non-hierarchical atomic-level model, the scatter plots reveal different behaviors depending on the dataset size. In the smallest dataset (HeLa), the model produces imprecise predictions for RT at the high ends of the distribution, with the predictions becoming increasingly dispersed for high RT values. For the largest dataset (misc), this dispersion was also evident for greater RT values. In this way, atomic-level features alone may struggle to capture the full range of retention times values. Similarly, the non-hierarchical amino acid-level model exhibits analogous trends to a lesser extent, suggesting that amino acid-level features alone may not fully encapsulate the nuanced interactions governing peptide retention. Overall, the performance gap between non-hierarchical models and PepMNet was most pronounced for the smallest dataset, HeLa. This suggests that non-hierarchical models alone, especially when the availability of information is limited as in smaller datasets, may struggle to capture the full range of retention times.
In contrast, the hierarchical model, which integrates both atomic and amino acid-level features achieves consistently low error independent of retention time, resulting in the highest R2 among the three models. This indicates that the hierarchical approach benefits from capturing multi-scale information, effectively combining the detailed atomic interactions with the broader sequence-level context provided by amino acid sequence. As a result, the hierarchical model expressed a better generalization across datasets of varying sizes and retention time ranges, leading to more accurate predictions. The results of the non-hierarchical models at the atomic and amino acid levels, with each training performed in triplicate, are listed in Tables S5 and S6,† respectively.
| AUC–ROC | ||
|---|---|---|
| Non-HierGraph atomic level | Non-HierGraph amino acid level | |
| TransformerConv | 0.9087 ± 0.0083 | 0.9439 ± 0.0012 | 
| SAGEConv | 0.9236 ± 0.0047 | 0.9482 ± 0.0015 | 
| GCNConv | 0.9249 ± 0.0090 | 0.9087 ± 0.0428 | 
| NNConv | 0.9253 ± 0.0010 | – | 
| EGConv | 0.9266 ± 0.0026 | 0.9343 ± 0.0030 | 
| GATConv | 0.9329 ± 0.0024 | 0.9375 ± 0.0044 | 
| ARMA | 0.9492 ± 0.0008 | 0.9297 ± 0.0133 | 
| PepMNet | 0.9619 ± 0.0017 | |
Interestingly, in the testing dataset for antimicrobial peptide classification, we observed that non-hierarchical models based on the amino acid-level graph tended to outperform those trained only on the atomic-level graph (Table 4 and Fig. S2†), while for retention time prediction the reverse was true for the dataset employed in this study (Fig. 4). Retention time is determined by the strength of interactions between the peptide and the stationary phase and the peptide's physical properties, such as hydrophobicity and charge, depend on the distribution and arrangement of atoms at the molecular level. We hypothesize, therefore, that representing the peptide at the atomic level may more accurately capture the contributions of each atom to these interactions, improving RT prediction. On the other hand, antimicrobial activity appears to depend more on the sequence and arrangement of amino acids. The amino acid sequence can determine characteristics like the formation of secondary structures and the peptide's ability to interact with microbial cell membranes.57 In this way, the impact of the representation level (atomic vs. amino acid) may depend on the specific property being predicted and how that property relates to the peptide's structure and function. By implementing a hierarchical model, we ensure to some extent that both types of information are captured by the model for the prediction task.
These results as well as previous studies in the literature1,15,17 suggest that information contained at the amino acid-level is key for accurately predicting anti-microbial activity. To this end, we evaluated whether our model could be further improved by concatenating amino acid features before performing graph convolutions on the amino acid-level graph. These amino acid features consist of a vector containing amino acid molecular weight, aromaticity, GRAVY score, net charge at pH 7, isoelectric point, octanol–water partition coefficient, and number of atoms. We note that all these features except for the GRAVY score can be calculated for any chemical group, not just an amino acid. In this way, only a small modification to our model would be required to introduce non-amino acid components to the molecules. We additionally explored the impact of changing the graph convolutional layer type at the amino acid level to determine whether this has a significant impact on model performance. Our results show that providing the model with amino acid features slightly improves AMP classification and generalizability, with the model achieving a small increase in AUC–ROC values for the test set regardless of graph convolutional layer choice (Fig. 5). Additionally, we observed that the ARMA layer led to a small increase in performance with an AUC–ROC of 0.9619.
Finally, to improve model robustness and reduce variability, we trained an ensemble of five separate models using 5-fold cross-validation. Specifically, each model differed in that 1) a different subset of the training set was used for validation and 2) the model weights were initialized randomly. The ensemble prediction was then taken as the average across the five models. Thus, by averaging across multiple models with different training/validation splits and different initialization seeds, we can obtain model predictions that are less sensitive to noise in our model training procedure.58 The loss curves for each fold, as well as the correlation in predictions between folds, are illustrated in Fig. S3 and S4.†
As shown in Fig. 6, the final model achieved an accuracy of 0.9233 using a threshold of 0.5, an average precision of 0.9813, and an AUC–ROC of 0.9775 on the held-out test dataset. These results indicate that the model not only effectively distinguishes between classes but also maintains a high level of precision, minimizing false positives. The high AUC–ROC score further emphasizes the model's ability to discriminate between positive and negative classes across various threshold settings. Together, these metrics suggest that the model is well-suited for peptide classification tasks, demonstrating strong performance across the evaluation metrics employed.
Additionally, we used the test dataset developed by Ruiz et al.33 to compare PepMNet's performance in AMP classification to three publicly available classifications models, two random forest models, AMPEP, and AMPEPpy, and one deep graph network, AMP-Net. Our architecture outperformed the random forest models AMPEP and AMPEPpy in AMP classification, demonstrating higher accuracy, average precision, and AUC–ROC, as depicted in Table 5. Importantly, because it was not necessary to do hyperparameter optimization for the random forest models, these models were trained with 100% of the training dataset, whereas our model was trained with 80% of the dataset and 20% was used for validation. When the random forest models were trained with 80% of the dataset, randomly selected, they achieved an average precision of 0.9709 and 0.9717 for AMPEP and AMPEPpy, respectively, slightly widening the gap between their results and those of PepMNet. Since hyperparameter optimization would normally be performed for random forest training, this scenario is more realistic to compare to. The hierarchical approach also surpassed the model from Ruiz et al., which employed graph representation atomic features. We attribute this improvement to the significance of the amino acid distribution stage, as highlighted by our previous results. By relying solely on atomic composition, the model may overlook important amino acid characteristics of the peptides. Overall, the multi-scale graph neural network proves to be versatile and efficient in handling diverse tasks and predicting various properties compared with the previous state-of-art approaches. This architecture allows for a thorough assessment of the model's generalization capabilities and has emerged as a promising tool for peptide prediction.
| Model | AUC–ROC | Accuracy | Average precision | 
|---|---|---|---|
| AMPEP | 0.9674 | 0.9061 | 0.9748 | 
| AMPEPpy | 0.9667 | 0.9067 | 0.9740 | 
| AMP-net | 0.9444 | 0.8808 | 0.9508 | 
| PepMNet | 0.9775 | 0.9233 | 0.9813 | 
To answer these questions, we explored the correlations between charge and hydrophobicity computed using the python package, Biopython,47 and chromatographic retention time. As shown in Fig. 7a, the HILIC retention time dataset was the most strongly correlated with peptide hydrophobicity, with a Pearson-R of 0.536. In contrast, two of the RPLC datasets, HeLa and misc, had lower correlations with hydrophobicity, with Pearson-Rs of 0.256 and 0.271, respectively. HILIC differs from RPLC in that it consists of a nonpolar solvent and a polar stationary phase, while RPLC consists of a nonpolar stationary phase and a polar solvent (typically water). Since both modes of chromatography rely on partitioning between a polar and non-polar phase, it is surprising that a stronger correlation is observed for HILIC compared with RPLC. Finally, we compared retention time on SCX with peptide charge and found that the two quantities were not well correlated (with a Pearson-R of 0.027). This illustrates that the determinant of retention time in SCX is more complex than the formal charge of the peptide.
Since antimicrobial activity prediction is represented in our dataset as a binary classification task, it is not possible to correlate antimicrobial activity with peptide properties. In lieu of this, Fig. 7b illustrates the length, charge, and hydrophobicity distributions of peptides in the AMP and non-AMP classes. Overall, the property distributions for both classes of peptides are similar, with non-AMPs having similar properties on average, but with a larger standard deviation than AMPs. Additionally, as expected, non-AMPs were close to neutral on average, while AMPs contained a net positive charge on average.
We can additionally explore the connection between antimicrobial activity and peptide properties by treating each property individually as a predictor of antimicrobial activity and applying different thresholds to obtain a receiver operating curve (ROC). The area under the receiver operating curve (AUC) can then be interpreted as a measure of the strength of the relationship between the two quantities. Fig. 8 illustrates the ROCs for charge, hydrophobicity, and length in predicting antimicrobial activity. Interestingly, charge and length alone are somewhat predictive of antimicrobial activity with AUCs of 0.694 and 0.699, respectively. Because chromatographic retention time and bacterial membrane binding are both adsorption phenomena, we were additionally interested in determining whether retention time was more predictive of antimicrobial activity than the calculated peptide descriptors. To this end, we constructed ROCs for retention times (using our retention time PepMNet model to predict retention times for any peptides that were missing experimental data) and observed a modest increase in AUC with SCX and HILIC both achieving AUCs of 0.704. Overall, this analysis demonstrates that quantifying peptide charge and hydrophobicity individually are not sufficient to predict antimicrobial activity, and a more complex model is required to achieve accurate classification.
It would be valuable in the future to extend this hierarchical approach to include non-natural amino acids and non-linear peptide systems. The atomic graph representation allows for the depiction of complex, nonlinear structures such as stapled and cyclic peptides, providing a more adaptable and less biased alternative. This flexibility is crucial for accurately modeling a broader range of peptide behaviors and functions that are not captured by traditional linear and natural peptide representations. However, a key challenge in translating sequences with non-natural amino acids lies in linking the reading of these sequences with existing libraries for handling chemical compounds. This requires the development of algorithms that can accurately interpret and incorporate non-natural residues. Additionally, future efforts must focus on generating comprehensive experimental datasets with non-natural amino acids, as the current lack of data significantly limits the training and validation of predictive models. Expanding this hierarchical approach to these types of datasets will not only improve the robustness of hierarchical graph models but also pave the way for applying deep learning models to more diverse applications in peptide research.
Finally, because the hierarchical model ensures that peptide information is captured at both atomic and amino acid levels, it is particularly useful when it is unclear whether a property depends more on atomic interactions or on the arrangement of amino acids. In this way, this approach allows for a comprehensive representation of the peptide's characteristics. Further, PepMNet consistently demonstrated a more reliable performance across various datasets of different sizes, highlighting its robustness and adaptability in diverse tasks such as antimicrobial peptide classification. In this context, testing the model's performance in the discovery of novel AMPs would be valuable in the future, especially given the urgent need for innovative therapeutic solutions.2,55 In this work, we have treated AMP classification as a binary problem, however it would be more accurate to classify AMPs into multiple categories such as anti-bacterial, anti-cancer, or anti-fungal peptides. Thus, in the future, it would be beneficial to use PepMNet as a multiclass model to discover peptides that are capable of addressing specific medical problems. This approach could significantly accelerate the discovery pipeline by reducing the reliance on traditional trial-and-error methods in the lab, a highly intensive process that demands substantial time and incurs significant costs.2,31,32 Further, since it is often valuable to know both peptide activity and retention time, it would be valuable to integrate the two models into a single unified workflow. Additionally, beyond AMPs, the implementation of this hierarchical approach could also be extended to other systems, such as polymers or peptide-polymer conjugates, due to its reliance on the atomic graph representation of molecules. This adaptability highlights the broad value of the hierarchical approach, extending its impact beyond peptide prediction to a wider range of molecular systems where atomic-level sequence information is crucial for understanding and predicting material properties.
| Footnotes | 
| † Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4me00172a | 
| ‡ Permanent address: 385 McCormick Road, Charlottesville, VA 22903, USA. | 
| This journal is © The Royal Society of Chemistry 2025 |