Abdulelah S.
Alshehri†
ab,
Michael T.
Bergman†
c,
Fengqi
You
ade and
Carol K.
Hall
*c
aRobert Frederick Smith School of Chemical and Biomolecular Engineering, Cornell University, Ithaca, NY 14853, USA
bDepartment of Chemical Engineering, College of Engineering, King Saud University, Riyadh 11421, Saudi Arabia
cDepartment of Chemical and Biomolecular Engineering, North Carolina State University, Raleigh, North Carolina 27606, USA. E-mail: hall@ncsu.edu
dSystems Engineering, College of Engineering, Cornell University, Ithaca, NY 14853, USA
eCornell University AI for Science Institute, Cornell University, Ithaca, NY 14853, USA
First published on 24th January 2025
Plastic pollution, particularly microplastics (MPs), poses a significant global threat to ecosystems and human health, necessitating innovative remediation strategies. Biocompatible and biodegradable plastic-binding peptides (PBPs) offer a potential solution through targeted adsorption and subsequent MP detection or removal from the environment. A challenge in discovering plastic-binding peptides is the vast combinatorial space of possible peptides (i.e., over 1015 for 12-mer peptides), which far exceeds the sample sizes typically reachable by experiments or biophysics-based computational methods. One step towards addressing this issue is to train deep learning models on experimental or biophysical datasets, permitting faster and cheaper evaluations of peptides. However, deep learning predictions are not always accurate, which could waste time and money due to synthesizing and evaluating false positives. Here, we resolve this issue by combining biophysical modeling data from Peptide Binder Design (PepBD) algorithm, the predictive power and uncertainty quantification of evidential deep learning, and metaheuristic search methods to identify high-affinity PBPs for several common plastics. Molecular dynamics simulations show that the discovered PBPs have greater median adsorption free energies for polyethylene (5%), polypropylene (18%), and polystyrene (34%) relative to PBPs previously designed by PepBD. The impact of including uncertainty quantification in peptide design is demonstrated by the increasing improvement in the median adsorption free energy with decreasing uncertainty. This robust framework accelerates peptide discovery, paving the way for effective, bio-inspired solutions to MP remediation.
Peptides may be valuable for tackling MP pollution.1,11,12 The promise of peptides lies in their inherent biocompatibility and biodegradability, strong adsorption to polymeric materials,13 and potential ability to bind preferentially to one material over others. Thus, PBPs could offer a sustainable solution to facilitate the detection, capture, and degradation of MPs.1
Peptide design14–16 is hampered by the vast combinatorial space of peptide sequences (i.e., over 1015 for 12-mer peptides).17 Experimental library screening has successfully identified peptides with affinity for various inorganic substrates18,19 by sampling up to 1010 sequences, but this is still only a small fraction of possible sequences and peptide sequences are sampled randomly. This limitation combined with the experimental labor and cost of library screening encourages the use of alternative design strategies. One alternative is biophysics-based design, which reduces the use of experimental resources and can intelligently explore peptide sequence space.17 Examples include Rosetta surface design,20 iterative procedures combining experimental and computational data,21,22 and the Peptide Binder Design (PepBD) algorithm.23 PepBD is especially relevant to this work as it was recently used to design PBPs for four common plastics: polyethylene (PE), polypropylene (PP), polystyrene (PS), and polyethylene terephthalate (PET).17 However, a major drawback of biophysical methods is that they sample even fewer peptides than library screening due to the computational expense of modeling peptide-plastic interactions. This motivates the adoption of deep learning (DL) models, which can be trained on modeling or library screening datasets to identify relationships between the peptide sequence and the design target,24,25i.e. peptide affinity to a certain type of plastic. There are many examples of DL having success in this domain, including predictions of whether a peptide will bind polystyrene using PS-Binder,12,26 discovery of quartz-binding peptides,27 and improvement of binding selectivity of peptides between gold and silver.28
DL's promise in design is often limited by methods that, while powerful, typically do not offer quantified estimations of the reliability of the predictions.29 Thus, a crucial consideration in DL-based peptide design, and in molecular design in general, is uncertainty quantification, which aims to estimate the model's confidence in its predictions.24,25 Uncertainties in model predictions can preclude sampling of peptides with high affinity and lead to over-sampling in regions where the model lacks confidence and over-generalizes.30,31 This is particularly relevant to peptide design, where the training data often covers a tiny fraction of possible amino acid sequences, and testing and synthesizing peptides is expensive and time-consuming. It thus is highly desirable to incorporate uncertainty quantification into the peptide design process to strategically navigate the vast combinatorial space and to prioritize candidates based on both predicted affinity and the reliability of predictions derived from biophysics-based calculations.32,33
Traditional uncertainty quantification methods in deep learning, such as Bayesian neural networks34 and sampling-based35 approaches, although useful, are often computationally intensive for large datasets,29,36 potentially compromising both efficiency and accuracy.37 These limitations are particularly pronounced in the context of peptide design, where the sheer number of theoretically possible sequences and the different types of peptide–plastic interactions (e.g. hydrogen bonding, pi–pi interactions, hydrophobic forces) pose significant challenges. As an alternative, evidential deep learning (EDL) directly learns and represents uncertainties without the need for extensive sampling.36 Furthermore, EDL's seamless integration with domain-specific architectures amplifies its capacity to quantify uncertainty across peptide–plastic binding affinities and diverse plastic types.38
In this work, we pair biophysical modeling with EDL to discover PBPs for several types of plastic. We hypothesize that quantifying score prediction uncertainty will lead to more effective exploration of peptide sequence space by encouraging the model to ignore sequences for which it cannot confidently predict affinity. To test this hypothesis, we train a convolutional neural network (CNN) with an EDL layer on PepBD data, then combine the trained model with biased random key genetic algorithm (BrKGA) to search for peptides with high affinity for four common plastics: polyethylene, polypropylene, polystyrene, and PET. The trained model accurately predicts affinity calculations from PepBD and generates unique peptides with higher predicted affinity than the best corresponding PepBD designs for all plastics. Validation of EDL peptides using molecular dynamics simulations shows that the EDL peptides have greater affinity than random sequences of amino acid for all plastics, and greater affinity than PepBD peptides for all plastics but PET. For PET, EDL peptides have slightly lower affinity than PepBD peptides, which we attribute to the greater chemical complexity of PET relative to the other plastics. Overall, our results show that uncertainty-aware design can discover and optimize biomaterials for microplastic remediation, and more generally develop solutions to complex environmental and technological problems.
Score = ΔG + αUpep | (1) |
![]() | (2) |
Peptides were represented using a one-hot encoding scheme, transforming each amino acid into a 21-dimensional vector (20 common amino acids plus one placeholder for unknown or other amino acids). The dataset was split into 90% for training, 5% for validation, and 5% for testing. This encoding method preserves the sequential nature of peptide information, enabling direct input into computational models. More complex representations, such as graphs, did not yield performance improvements and performed worse, likely due to the need for spatial and conformational information. This observation aligns with findings from state-of-the-art protein classification and regression models, such as ProtCNN.42
An optimized CNN with an EDL output layer was developed to predict PBPs affinities. Initially trained on PE data, the model was adapted through transfer learning to improve performance for other plastics by approximately 10%. The one-dimensional CNN architecture includes an input layer for one-hot encoded peptide sequences, three convolutional layers with 128, 64, and 32 filters, respectively, and max pooling layers to reduce dimensionality. Two fully connected layers further process the features, with dropout regularization to prevent overfitting. The EDL output layer, specifically a normal gamma layer, performs the regression task, predicting binding affinity and its associated uncertainty. The EDL layer is crucial for providing confidence intervals alongside binding affinity predictions, aiding in the selection of top candidate peptides. The model uses the adam optimizer, EDL-specific loss function, and Mean Absolute Error (MAE) as an evaluation metric, with L2 regularization to mitigate overfitting. Overall, the devised CNN architecture strikes a balance between expressive power and computational efficiency with 81700 trainable parameters, making it well-suited for the compact representations of peptides featuring 252 integer elements. We observe that more complex models, such as transformers43 and ProtCNN,42 yielded slightly worse results likely due to the low dimensionality of peptide data. Additional architectural details and hyperparameters are also available in our models and data repository.
The BrKGA optimization method44 on Pymoo45 was employed to optimize peptide sequences for binding affinity to plastics, leveraging the predictions from the CNN-EDL model. BrKGA was chosen for its success in solving complex combinatorial problems, efficient search mechanisms, and adaptability to various problem structures.46 The algorithm's fast convergence and effective handling of large search spaces make it ideal for peptide design. Initial populations were generated with and without bias towards amino acids known to enhance plastic binding. The fitness function integrated predicted binding affinity, uncertainty magnitude, and constraint violations, guiding the selection of promising candidate peptides. Key operators included tournament selection, simulated binary crossover, and polynomial mutation to maintain diversity. Hyperparameters were fine-tuned using a systematic grid search, optimizing crossover probability, distribution indices, population size, and the number of generations. Because all peptides in the PepBD dataset do not use the amino acids cysteine or proline and allow no more than 3 tryptophan per peptide to maintain peptide solubility in water, these constraints were also enforced by BrKGA. We observe that this combination of BrKGA with EDL predictions enhances the identification of high-affinity PBPs in terms of PepBD scores.
The difference in binding score distributions of EDL and PepBD designs, visualized in Fig. 2A, are statistically significant for all plastics. As shown in Fig. 2B, negative t-test statistics57 and p-values < 0.001 for all four plastics demonstrate that peptides designed by EDL have much more negative scores than those generated by PepBD. Ranking of the t-values shows polyethylene has the most pronounced difference between the two methods. To address the normality assumption inherent to t-tests, a non-parametric Mann–Whitney U test58 was also performed, yielding similarly significant results with all p-values being 0, further corroborating the observed performance gap.59 The calculated average mean improvements of EDL over PepBD—9.8% (polyethylene), 6.3% (polypropylene), 4.0% (polystyrene), and 5.1% (PET)—provide a quantitative measure of the EDL model's advantage. However, statistical significance does not equate to practical relevance, so we incorporated uncertainty quantification early in model development to ensure the reliability and applicability of predictions.
Fig. 2C presents an assessment of the uncertainty in EDL-predicted binding affinities. While EDL outperforms PepBD in mean predicted scores, the associated uncertainties, visualized as confidence intervals, vary across plastics and candidates. Average uncertainties range from 4.3 kcal mol−1 (polyethylene) to 14.6 kcal mol−1 (PET), highlighting the model's reduced confidence for certain predictions, especially PET. Material-specific variation in uncertainty may stem from differing complexities in peptide-plastic interactions or biases within the PepBD training data. For example, PET has more functional groups than polyethylene, meaning there are more types of peptide–plastic interactions to consider such as hydrogen bonding and pi–pi stacking. Recognizing this modelling limitation, the EDL framework incorporates uncertainty directly into the objective function in the generation process, prioritizing peptides with lower uncertainties to enhance the confidence and practical relevance of our designs. Although this may yield slightly less favorable mean binding scores compared to a purely score-driven approach, it prioritizes candidates with a higher likelihood of strong binding in simulation and experimental validation. We expand on this topic in the discussion section.
Comparing the amino acid composition of EDL peptides for all the plastics reveals common peptide features. The same amino acid types appear with high frequency, namely arginine (R), histidine (H), methionine (M), phenylalanine (F), and tryptophan (W) (Fig. 3B). These residues all have bulky side chains, indicating that EDL increases peptide affinity by increasing the number of possible intermolecular interactions or reducing the solvent-accessible surface area of the plastic. We note that the frequency of tryptophan is identical for all plastics due to a limit of 3 or fewer tryptophan per peptide, a constraint also used when generating the PepBD data set. While the amino acid frequencies are general similar between all plastics, there are some differences. Methionine (M) is more frequent in PE designs, perhaps because the crystallinity of the polyethylene surface facilitates interactions with the amino acid. Designs for PET have the greatest frequency of arginine (R), likely because it can form electrostatic interactions with the oxygens in the terephthalic acid group, which carry a partial negative charge. Isoleucine (I) and leucine (L) are more common in designs for polypropylene and polystyrene, respectively. This may be because the models of these plastic surfaces used to collect PepBD data have greater surface roughness, and the greater conformational freedom of leucine and isoleucine permits closer conformation to the plastic surface compared to amino acids with more rigid side chains. A question raised by Fig. 3B is how selective a peptide will be for a given plastic. The similar amino acid compositions suggest there are peptides that can bind strongly to multiple types of plastics. The amino acid composition differences also suggest that there may be peptides that bind selectivity, at least to some degree, to each plastic. However, it should be noted that peptide affinity depends on the arrangement of amino acids rather than just on the amino acid composition,20 and that these hypotheses merit further investigation.
Plastic | EDL vs. PepBD | EDL vs. random | ||
---|---|---|---|---|
p-Valuea | Percent median ΔG improvementb | p-Value | Percent median ΔG improvement | |
a Calculated using a two-sided, equivariance t-test. b Median improvement relative to either PepBD or random peptides; a positive improvement corresponds to higher affinity for EDL peptides to the given plastic than either PepBD or random peptides. | ||||
Polyethylene | 0.140 | +5% | 2.94 × 10−4 | +78% |
Polypropylene | 0.126 | +18% | −3.86 × 10−5 | +81% |
Polystyrene | 2.40 × 10−2 | +34% | 5.58 × 10−2 | +64% |
PET | 0.205 | −11% | 0.196 | +18% |
Plastic | Peptides sequence | ΔG (kcal mol−1) |
---|---|---|
Polyethylene | RMHWWMKWFMRR | −48.0 |
FFMWHMKWYMRW | −43.0 | |
SWMHKIHWHMRW | −34.1 | |
Polystyrene | WWMRHMFAWRIF | −35.0 |
FWWRTIVWRHIR | −28.3 | |
YFIWWWRMFFFR | −27.2 | |
Polypropylene | FIFRWWQWHVRM | −20.3 |
WWMRWHRLFFIR | −16.0 | |
WRWIRLIWQGHR | −12.5 | |
PET | FHVWWINIFWFF | −22.0 |
FMRWWRMYWFDF | −21.7 | |
FHEWWRMYWHRY | −19.9 |
The variability in molecular dynamics results emphasizes the need for this screening step prior to using the designed peptides for MP remediation. The values of ΔG span a large range, highlighting the need to evaluate not just the best design but multiple designs. Scores, either from PepBD or EDL, are only predictions of affinity and do not guarantee that high affinity will be observed in MD simulations or experimentally. Thus, MD can screen out false positives to minimize the cost, time, and labor of developing effective peptide-based tools of MP remediation. While the MD protocol can evaluate many peptides (∼150 total) at a reasonable computational cost, its relatively simplistic theoretical basis reduces the accuracy of calculations. It is proper to view the MD results as an initial screen that identifies promising peptides which require more rigorous evaluation, as we describe in detail in the discussion.
Analysis reveals that uncertainty-aware design helps generate better peptides. Ideally, lower uncertainty in score predictions would correspond to better peptide performance. To quantify peptide performance, we can use the percent difference in the median affinity of EDL peptides relative to PepBD or random peptides, as provided in Table 1. We term this quantity “improvement” for the sake of discussion. Noting that uncertainty in score predictions varied greatly between plastics (Fig. 1C), we can determine how score uncertainty relates with peptide performance by plotting improvement versus average uncertainty for all plastics (Fig. 5). The comparison between EDL and random peptides clearly displays the desired trend – as score uncertainty decreases, EDL peptides have greater improvement. The same trend appears to be present when comparing EDL and PepBD peptides, albeit the trend is weaker. We attribute this to PepBD performing better for some plastics (e.g., polyethylene, where PepBD greatly outperforms random peptides) than others (e.g. polystyrene, where PepBD has almost equal performance as random peptides), so possible improvement of EDL designs varies between plastics. Whatever the explanation may be, Fig. 5 shows a strong relationship between lower uncertainty and better peptide performance.
![]() | ||
Fig. 5 Lower average uncertainty in EDL score predictions correlates with greater affinity measured by MD simulations. The x-axis is the average uncertainty in the top 100 peptides found by EDL for each plastic. The y-axis is the improvement, or percent difference in the median adsorption free energy between EDL peptides and either random peptides (left) or PepBD peptides (right). Average score uncertainty is taken from Fig. 1c, and improvement is taken from Table 1. |
The EDL model possesses useful properties that facilitate peptide design. The model is statistically robust and quantifies uncertainty. This enhances confidence in the predictive power of models and guides the selection of peptides for practical applications. The result is reduced time, labor, and cost during the development of peptide tools for MP remediation as fewer design iterations are required. The EDL model is general - it designed high-affinity peptides for several common plastics, suggesting that the model can be readily transferred to other peptide design tasks previously addressed by PepBD, such as proteins60 or RNA.23 The worse performance of PBP designs for PET indicates that some degree of model tailoring may be needed to capture the complexity of peptide interactions. For example, improved performance for PET could be achieved either by adding more detail to the PepBD dataset or by modifying the architecture or training procedure. The EDL model is flexible. While PepBD data was used in this study to train the model, the PepBD dataset could be complemented with the MD dataset generated in this work to give a richer dataset to train the EDL model. Similarly, if a sufficiently large experimental dataset becomes available for peptide–plastic interactions, such data could be used for model training. This can be useful given the limitations in PepBD modeling including the limited sampling of peptide conformations.
EDL uncertainty varies greatly between plastics and is significantly larger for PET than for the other plastics. We attribute the high uncertainty with PET to two factors that make peptide interactions with PET more complex compared to the other plastics. PET contains polar ester groups and aromatic moieties that interact with the peptide through hydrogen bonding, strong electrostatic interactions, and π–π stacking.61 The chemical complexity in the PET monomer also leads to a more chemically heterogeneous surface relative to the other plastics. These two factors make PET–peptide interactions complicated and sensitive to system geometry. This differs from peptide interactions with polyethylene and polypropylene, which are driven primarily by hydrophobic and van der Waals interactions. From a deep learning standpoint, it is notable that even though the dataset for PET is only slightly smaller than that used for polypropylene or polystyrene, the greater complexity of peptide–PET interactions results in poorer performance. Resolution of these issues is needed, else model uncertainty will remain elevated and predictions may bias towards simple hydrophobic binding patterns. We see two possible solutions. The first solution is to obtain PepBD data for many more conformations of peptide adsorbed to PET. Increasing the data set size ideally will help the DL model learn a better implicit representation of PET heterogeneity. A second, more complicated solution is to modify our current CNN-based EDL mode, which relies solely on sequence-level features, to include peptide and PET structure in the input representation. This is motivated by work highlighting the importance of capturing specific structural and electrostatic details when modeling interactions between peptides and materials.62 This could be achieved using graph neural networks.63
Additional evaluation of the best EDL PBPs will be essential to determine their usefulness for MNP remediation. The MD analysis provided in this work is coarse and preliminary, since the free energy calculations use an implicit solvent model and do not account fully for the peptide's conformation entropy. These simplifications were needed so we could evaluate a large sample of peptides. Having identified the most promising peptides, future work can focus on more rigorously evaluating the best PBPs using simulations methods like metadynamics64,65 or umbrella sampling.66,67 Experimental measurements of peptide affinity using methods like atomic force microscopy are also essential. These measurements are underway and will be reported in a future manuscript that evaluates both EDL and PepBD peptides.
The peptides identified in this work can be integrated into many existing technologies and methods developed in recent years for MNP remediation. Examples of these recent developments abound: MNP pollution can be detected using spectroscopy,68 chromatography,69 image-processing,70 liquid crystals sensors,71 or surface plasmon resonance;72 MNP pollution can be captured with magnetic biochar,73 biopolymers,74,75 fungal mycelium,76 carbon-based materials, chemical coagulation,77 or lysozyme-based amyloid fibrils;78 and MNP pollution can be degraded both chemically79 and biologically.80,81 We believe PBPs can augment many of these technologies. Plastic-degrading microorganisms can be genetically engineered to express PBPs to facilitate biofilm formation and subsequent plastic degradation. PBPs could supplement amyloid fibrils in capturing MNP pollution. PBPs could help improve the sensitivity of the sensors for MNPs. Peptides also possess two properties that are advantageous for MNP remediation. First, peptides are naturally biocompatible, meaning PBPs could help monitor or remove MNP pollution in biological settings. Second, peptides interact with plastic via adsorption, a process driven by surface area, which suggests that PBPs may be particularly helpful in remediating nanoplastics, which have large specific surface area.
Potential limitations of applying PBPs to remediating MNP pollution merit discussion. Perhaps the most significant practical barrier is the cost of peptide synthesis. Two broad routes exist for synthesizing peptides: chemical82 and biological.83 Chemical synthesis is well established and several companies offer this service, but purchasing large quantities of peptide is impractical due to the high cost. We thus believe that producing PBPs through chemical synthesis would only be suitable for small scale remediation, method development, and MNP research. Biological synthesis of PBPs can be achieved by engineering microbes to continually produce the PBPs. We think that this could be a promising way to apply PBPs to MNP remediation on a large scale, given the interest and effort in scaling up bioremediation strategies. Another possible issue is the complexity of environmental conditions. MNPs are found in essentially all environment domains that span large ranges of properties like salinity, pH, and temperature. Changes in these properties could influence which PBP should be selected for a given plastic in a given environment. The environment can contain other chemical species that compete with PBPs to adsorb to MNPs. For example, a protein corona84 and biofilm85 may form on MNPs. Predicting the influence of these environmental factors on PBP adsorption to MNPs is challenging and should be explored in the future.
The designed peptides and our computational framework hold significant potential for the development of peptide-based materials and technologies for MP detection, capture, and degradation. The application of such peptides in environmental safety measures could revolutionize strategies for mitigating MP pollution, particularly in aqueous environments where such pollution is damaging and most pervasive. Our open-source approach to data and methodologies will help advance scientific understanding of peptide–plastic interactions and foster collaborative environment that encourages further research in the application of peptides to MP remediation. This openness is intended to spur innovation across disciplines, leading to more effective solutions to difficult environmental solutions and better functional biomaterials.
Footnote |
† Contributed equally. |
This journal is © The Royal Society of Chemistry 2025 |