Open Access Article
Guangyao Chenab and
Fengqi You
*abc
aCollege of Engineering, Cornell University, Ithaca, NY 14853, USA. E-mail: fengqi.you@cornell.edu
bAI for Science Institute (CUAISci), Cornell University, Ithaca, NY 14853, USA
cCornell AI for Sustainability Initiative (CAISI), Cornell University, Ithaca, New York 14853, USA
First published on 25th May 2026
Designing peptides for microplastic targeting is intrinsically multi-objective: sequence motifs that promote adsorption to hydrophobic polymers frequently elevate developability risks, including hemolysis, non-specific adsorption, and poor aqueous solubility. In this paper, we show that accurate developability screening can be achieved from sequence alone by focusing on the readout that converts token-level foundation model representations into peptide-level decisions. We introduce gated query pooling (GQP), a lightweight, backbone-agnostic evidence-selection head that learns a small set of query vectors to extract complementary signals from protein language model embeddings and gates them adaptively per peptide. With a consistent evaluation protocol and identical splits for all methods, GQP with sequence-only backbones reaches 91.09%, 86.30%, and 75.56% accuracy on hemolysis, non-fouling, and solubility, respectively, outperforming representative sequence-only and AlphaFold-augmented Multi-Peptide baselines. Beyond predictive accuracy, attention diagnostics and controlled counterfactual substitutions enable residue-level, testable design rules that connect model outputs to actionable sequence edits. Finally, integrating these developability constraints with PepBD-derived affinity scores for polyethylene, polypropylene, and polyethylene terephthalate supports scalable multi-objective prioritization of microplastic-binding candidates and reveals non-fouling as a dominant feasibility bottleneck, with coarse-grained molecular dynamics triage providing complementary physical evidence supporting the plausibility of the PepBD-prioritized selections.
A promising direction is to develop molecular recognition elements that can selectively bind and capture microplastics.10–13 Peptides are attractive in this context because they are programmable, chemically diverse, and amenable to high-throughput synthesis and screening.14 Experimental studies have already demonstrated that engineered peptides can bind common plastics such as polypropylene (PP) and polystyrene (PS), enabling sensitive capture or biosensing of microplastics under relevant conditions.15 However, real-world pollution is inherently multiplastic, and dominant polymers such as polyethylene (PE), PP, and polyethylene terephthalate (PET) differ in surface chemistry and polarity.11 This motivates the adoption of design objectives that are plastic-specific and, ideally, applicable across multiple plastics. Recently, biophysical modeling frameworks such as PepBD have enabled large-scale computation of peptide adsorption to plastics, and protein language model-guided generative approaches have leveraged these scores to design high-affinity peptides for PE/PP/PET. Despite this progress, microplastic-binding peptide engineering is fundamentally multi-objective. Strong adsorption to hydrophobic polymers often favors hydrophobic and aromatic motifs, but the same features can increase nonspecific membrane interactions and compromise safety or formulation feasibility.16 In practice, candidate peptides must be screened not only for binding, but also for developability-related constraints such as low hemolysis, resistance to nonspecific adsorption (non-fouling), and sufficient aqueous solubility. In this work we treat these three properties as archetypal developability endpoints. The associated datasets, however, differ in sequence-length distributions and label construction, particularly for non-fouling, where negatives include insoluble and hemolytic peptides as well as scrambled positives. This creates a tension between function and biocompatibility that is difficult to resolve with single-objective optimization. The central challenge, therefore, is to couple plastic-specific affinity objectives with accurate, scalable developability prediction so that large libraries can be filtered down to candidates that are both high-affinity and biocompatible.17
Sequence-based machine learning has recently made this coupling feasible. Protein language models trained on large-scale sequence corpora can encode biophysical regularities directly from primary sequence. ProtTrans18 and ESM2 (ref. 19) are representative foundation encoders that have shown strong transfer to diverse protein prediction tasks and can recover structural and functional signals from sequence alone. Building on this foundation, PeptideBERT demonstrated that transformer-based, sequence-only models can predict key peptide properties, including hemolysis, non-fouling, and solubility, without explicit structural inputs.18 Multi-Peptide subsequently explored augmenting sequence models with predicted structural information through a language-graph framework, showing that structure-aware signals can improve selected settings.20 In parallel, AlphaFold has made accurate structure prediction broadly accessible, further catalyzing interest in structure-guided pipelines.21 Yet structure-augmented workflows, while potentially improving prediction in some settings, introduce additional computational stages and rely on imperfect structure predictions that can propagate uncertainty into downstream models. For high-throughput developability-oriented screening, this raises the practical question of when the added cost and complexity of structural modeling are justified relative to what can be achieved with sequence-only approaches.
In this paper, we rethink peptide developability prediction under a sequence-only paradigm and argue that a key bottleneck lies in the readout: how token-level representations are aggregated into a fixed-length peptide embedding for classification. In transfer learning settings with limited labeled peptides, this aggregation step can dominate performance, particularly when the task depends on localized sequence patterns rather than global composition alone. Because many peptide phenotypes are driven by sparse, localized motifs that reflect charge-hydrophobic patterning and amphipathic helical segments. Simple mean or max pooling can then dilute or mis-weight the decisive residues, especially for shorter peptides where single substitutions can have large effects.22,23 Motivated by cross-attention mechanisms that use learnable queries to extract evidence from variable-length inputs, we introduce gated query pooling (GQP) as a lightweight, backbone-agnostic readout for peptide property prediction. GQP learns a small set of query vectors that attend over token embeddings to extract complementary evidence. It then applies input-adaptive gating on the query-to-token attention weights (token-wise and query-wise) and pools the gated query summaries into a fixed-length representation. This evidence-selective design aims to maximize what can be extracted from sequence representations without requiring explicit 3D structure generation.
A second goal of this work is to connect model predictions to actionable design guidance. Attention maps provide useful diagnostic signals for how the readout routes evidence, and in our datasets, they recover chemically plausible residue-level tendencies, such as increased emphasis on hydrophobic and aromatic residues in hemolytic peptides, consistent with large-scale analyses of experimentally curated hemolysis data.24 For non-fouling, the diagnostics highlight mixed-charge and highly hydrated chemistries, aligning with experimental anti-biofouling measurements showing that zwitterionic peptide motifs built from EK and DK repeats strongly suppress protein adsorption and cell adhesion.25 For solubility, the same framework strongly solubilizes residues, consistent with mutational evidence that Asp, Glu, and Ser contribute particularly favorably to solubility compared with other hydrophilic residues.26 Because attention alone is not guaranteed to be a faithful explanation, we pair these diagnostics with controlled counterfactual substitutions that quantify how single-residue edits shift model outputs, yielding residue-level editing rules and a ranked notion of “intervenability” that is directly usable for sequence refinement. Finally, we integrate these developability predictors with PepBD-derived PE/PP/PET affinity scores to enable large-scale multi-objective prioritization of microplastic-binding peptides and to identify which constraints dominate feasibility at scale; notably, we find that non-fouling filtering removes the majority of high-affinity candidates, and plastic-specific substitution landscapes indicate that binding optimization is polymer-dependent, with larger edit sensitivities for PP and PET than for PE. These polymer-dependent patterns are mechanistically consistent with prior physics-based and AI-guided plastic-binding studies, which report that stronger PepBD scores are driven by increased van der Waals interactions and enrichment of bulky side chains (including aromatic residues such as tryptophan), and emphasize that sequence preferences differ across plastics. As a complementary physics-based sanity check, we additionally perform coarse-grained molecular dynamics (MD) triage on the PepBD-prioritized candidates. These simulations provide supporting physical evidence for the plausibility of our multi-objective selection under the tested proxy conditions.
In summary, our main contributions:
(1) Gated query pooling (GQP) improves sequence-only prediction of hemolysis, non-fouling, and solubility in both full-data and low-data settings.
(2) A systematic benchmark clarifies how backbone selection and adaptation strategy (frozen versus fine-tuned) shape transfer performance across developability tasks.
(3) Attention-based diagnostics summarize residue-level patterns and how evidence is routed through the GQP readout.
(4) Controlled counterfactual substitutions yield residue-level editing rules and intervenability rankings that translate predictions into actionable sequence edits.
(5) A developability-aware screening workflow integrates PepBD-derived PE/PP/PET affinity objectives with developability constraints to prioritize microplastic-binding candidates and highlights non-fouling as the dominant feasibility bottleneck.
(6) A unified coarse-grained molecular dynamics triage provides complementary physics-based evidence to de-risk the PepBD-prioritized candidate panel.
![]() | ||
| Fig. 1 Sequence-only multi-property modelling enables developability-aware screening of microplastic-relevant peptides. (A) Conceptual workflow for identifying peptide candidates relevant to polypropylene (PP), polyethylene (PE) and polyethylene terephthalate (PET), while prioritizing three developability-related properties, no hemolysis, non-fouling and aqueous solubility. (B) Simplified overview of gated query pooling (GQP). A protein sequence encoder first produces residue-level token representations (keys and values). Learnable query vectors then summarize the sequence in two explicit steps: query-wise attention assigns each query to residue-level evidence, and attention-weight gating downweights weak or noisy query-token contributions before the gated summaries are pooled into a fixed-length peptide representation for property prediction. (C) Held out test accuracy (percent) for hemolysis, non-fouling and solubility across representative baselines (PeptideBERT,18 Multiple Peptide20) and sequence-only protein language model backbones equipped with GQP. Bars report mean accuracy (left) and mean AUROC (right). For the full-data ESM2 + GQP reruns, error bars indicate standard deviations across three independent random-seed runs (seeds 42, 43, and 44). N/A indicates results not reported for Multi Peptide on the solubility task. | ||
To this end, we introduce gated query pooling (GQP), a lightweight, backbone-agnostic readout that converts token embeddings from a sequence encoder into a compact peptide representation using a small set of learnable query vectors. As illustrated in Fig. 1B, each query attends over the token sequence to form a query-specific summary, and an input-adaptive attention-weight gating mechanism modulates how each query routes evidence over tokens before pooling and prediction. This design is intended to separate sequence encoding from task-specific evidence extraction, allowing the model to learn multiple complementary “views” of a peptide and to down-weight uninformative queries for a given input. Importantly, GQP operates purely on sequence representations, making it compatible with widely used foundation protein language models such as ProtT5 (ref. 27) and ESM2.19 For fair comparison, we reimplemented both PeptideBERT18 and Multi-Peptide20 using the official code and trained them under the same benchmark split protocol as our models. Across the three developability tasks, adding GQP on top of sequence-only protein language model backbones achieves comparable or higher held-out accuracy relative to representative baselines (Fig. 1C). The same trend is observed for threshold-free discrimination, with ESM2 + GQP and ProtT5 + GQP also achieving strong AUROC across tasks (Fig. 1C). In particular, ESM2 + GQP reaches 90.37% accuracy for hemolysis and 86.00% for non-fouling, and ProtT5 + GQP reaches 75.54% for solubility, exceeding representative sequence-only and structure-augmented baselines where those results are available. Three independent full-data ESM2 + GQP reruns showed stable performance, supporting the robustness of the main benchmark comparison while avoiding a formal statistical-superiority claim for every endpoint. ProtT5 with GQP and ESM2 with GQP achieve consistently strong performance on hemolysis, non-fouling, and solubility, matching or exceeding prior sequence-only and structure-augmented baselines while avoiding the explicit generation of 3D structures. These results support a central premise of this work: for peptide developability screening, sequence-only foundation encoders can be highly effective when paired with an evidence-selective pooling head, and GQP provides a simple, general mechanism to realize that benefit in a plug-and-play manner across backbones.
We also observe that solubility is more dependent on protein-specific pretraining than the other two tasks. In Fig. 2A, the strongest solubility results are achieved by protein-pretrained encoders (Prot-T5 and ESM2), whereas NLP backbones remain substantially behind even after fine-tuning. Moreover, fine-tuning yields only modest additional gains for solubility on the strongest protein backbones, suggesting diminishing returns once the encoder already captures relevant sequence-level biophysical features, while fine-tuning remains more beneficial for weaker or domain-mismatched backbones. These results motivate two practical conclusions for developability-oriented peptide screening. First, backbone selection matters and protein-pretrained encoders should be preferred when available. Second, adaptation strategy is a first-order design choice: freezing the encoder can be competitive in some settings, but fine-tuning generally yields more reliable improvements across tasks, mirroring the fine-tuning-centric approach used in peptide-specific transfer baselines such as PeptideBERT.18
For hemolysis (Fig. 4A), the most negative intervenability values are associated with strongly hydrophobic and aromatic residues, including L, I, W, and F, indicating that substitutions away from these residues tend to lower the hemolysis log odds on average. This agrees with large-scale analyses of experimentally validated hemolytic peptides, which report enrichment of leucine and isoleucine and, to a lesser extent, phenylalanine and tryptophan in hemolytic sequences relative to non-hemolytic controls.36 Accordingly, a practical rule to reduce hemolytic propensity is to target hydrophobic or aromatic hotspots (for example, L/I/F/W) for replacement with more polar or charged residues, consistent with the general link between hydrophobicity-driven membrane insertion and hemolysis.36 In addition, proline substitutions provide a mechanistically grounded option when the goal is to disrupt amphipathic helices, because proline is a potent α-helix breaker; experimental studies on model amphipathic peptides show that introducing or retaining a central proline can reduce membrane activity and hemolysis compared to helix-stabilizing variants.37 Non-fouling (Fig. 4B) effects separate residues whose substitution tends to decrease non-fouling propensity from those whose substitution tends to increase it. Residues with strongly negative mean CSE include K, D, and E, suggesting that these charged residues often support the non-fouling class and should be preserved when anti-adsorption is a priority. This is consistent with experimental anti-biofouling tests on zwitterionic peptide motifs, where surfaces presenting repeating units of EK and DK exhibit markedly reduced protein adsorption and cell adhesion compared with other charged pairings.25 Conversely, several hydrophobic residues show positive mean CSE (for example, I, L, F, V, and W), supporting a complementary rule of thumb for improving non-fouling behavior by reducing hydrophobic content or disrupting hydrophobic patches, which is consistent with hydration-based anti-fouling principles.25 Solubility (Fig. 4C) effects are smaller in magnitude than hemolysis and non-fouling, but show a clear dominant driver in our controlled analysis: methionine (M) exhibits a strongly positive intervenability, indicating that substitutions away from M tend to increase the solubility log odds. This direction is consistent with established solubility models that explicitly penalize hydrophobic residues and hydrophobic patches, including sequence-based solubility predictors such as CamSol38,39 and broader reviews of solubility-aware protein design.38 More broadly, experimental mutational analysis of RNase Sa shows that aspartate (D), glutamate (E), and serine (S) contribute particularly favorably to solubility, and are recommended targets for solubility-improving substitutions relative to other hydrophilic residues.26 Together with the strong M effect in Fig. 4C, these findings motivate a practical formulation rule: when solubility is limiting, prioritize substituting away from hydrophobic residues such as M and toward strongly solubilizing residues such as D, E, and S, subject to functional constraints.
Fig. 4 provides an editing playbook for multi-objective peptide design that is consistent with experimentally supported residue-level trends. Reduce hemolysis by mutating away from hydrophobic and aromatic residues (L/I/F/W) and, where appropriate, introducing helix-disrupting substitutions (for example, proline); improve non-fouling by preserving mixed-charge, highly hydrated motifs (notably K/E/D-rich patterns such as EK/DK); and increase solubility by prioritizing substitutions away from hydrophobic residues (highlighted by M) and toward solubilizing residues such as D, E, and S.
![]() | ||
| Fig. 5 Relationship between microplastic-binding affinity and developability predictions across PP, PE, and PET peptide datasets. (A–C) Two-dimensional density plots showing the joint distribution between microplastic-binding affinity scores (x axis) and predicted developability probabilities (y axis) for hemolysis (A), non-fouling (B), and solubility (C). Each column corresponds to peptides evaluated for binding to polypropylene (PP), polyethylene (PE), or polyethylene terephthalate (PET). Colors indicate log-scaled point density (counts), and n denotes the number of peptide samples in each plastic-specific dataset. PP, PE, and PET affinity data are taken from the PepBD-based datasets.40 | ||
We therefore applied a sequential, multi-objective screening pipeline (Fig. 6A) that first enforces developability constraints and then enriches for the high-affinity tail of the PepBD score distribution. We use sequential filtering rather than Pareto ranking42,43 because the developability objectives are treated as hard feasibility constraints: peptides predicted to be hemolytic, insoluble, or fouling-prone are not actionable regardless of affinity. In large libraries, Pareto fronts can remain broad and may retain many high-affinity but infeasible sequences, whereas feasibility-first filtering yields a compact, interpretable feasible set before optimizing affinity within that set. PepBD scores are energy-like and span roughly −64 to +12 for 12-mers (lower is better).44 We therefore set plastic-specific cutoffs in the extreme negative tail, consistent with score ranges reported for top PepBD candidates in prior PepBD-based design studies. Specifically, Step 1 retains peptides that pass all three developability classifiers (non-fouling, solubility, and non-hemolysis) using fixed probability cutoffs. Step 2 then selects high-affinity candidates using plastic-specific PepBD score thresholds (lower is better): PE ≤ −56, PP ≤ −50, PET ≤ −60 (Fig. 6B). Starting from hundreds of thousands to nearly a million candidates per plastic, the non-fouling filter produces the largest initial reduction, followed by additional attrition from solubility and hemolysis constraints,25 yielding a compact feasible set before ranking by affinity (Fig. 6A). As shown by the largest drop in remaining candidates immediately after the non-fouling constraint across PE, PP, and PET, the majority of microplastic-binding candidates do not satisfy the anti-adsorption requirement before considering solubility or hemolysis. Importantly, the affinity distributions shift markedly after screening: compared to the “before” distribution, the “after” distribution concentrates near the extreme affinity region for each plastic, and the final selected hits lie beyond plastic specific score thresholds (Fig. 6B). This behavior indicates that the pipeline is not simply removing unsafe peptides, but is actively enriching for rare sequences that satisfy developability constraints while retaining strong predicted binding. Using these plastic-specific thresholds (PE ≤ −56, PP ≤ −50, PET ≤ −60), the final screens produce a small conservative candidate set, yielding five hits for PE, five hits for PP, and one hit for PET. We further assessed the sensitivity of these final hits to perturbations of the PepBD score thresholds in the SI.
Finally, we used residue substitution effect maps (CSE, Δlogit) to interpret and refine plastic binding within the screened set. Fig. 6C provides an interpretable, plastic-specific map of how single-residue edits are expected to shift the microplastic binding prediction within the screened candidate set. Across PP, PE, and PET, the substitution effect landscapes differ in both magnitude and pattern, indicating that the predicted binding objective is polymer-dependent rather than governed by a single, universal residue preference. Notably, the PP and PET maps show larger effect ranges and clearer residue class structure than the PE map, suggesting that, in the local neighborhood of our selected candidates, PP and PET binding predictions are more sensitive to single substitutions. Such sensitivity is mechanistically plausible because bulky hydrophobic and aromatic residues are repeatedly implicated as key contributors to adsorption on hydrophobic polymer surfaces, whereas hydrophilic residues tend to favor solvent exposure; for example, recent work on PepBD-guided microplastic binding design highlights bulky hydrophobic residues such as tryptophan and phenylalanine as strong contributors to plastic interactions and emphasizes polymer-dependent optimization.16,45 These CSE patterns offer an interpretable bridge between high-throughput screening and validation, analogous to recent microplastic-binding peptide studies that pair model-guided design with downstream experimental or mechanistic evaluation.
To further de-risk developability liabilities beyond sequence-level screening, we complemented the multi-objective selection with a physics-based sanity check using coarse-grained molecular dynamics (MD) simulations with the Martini 3 (ref. 46) force field and GPU-accelerated GROMACS.47 We evaluated the final candidate panel (11 screened candidates plus 2 controls) under a unified protocol across three proxy assays: membrane interaction as a hemolysis-risk proxy, multi-copy self-association in bulk water as a solubility proxy, and surface proximity as a non-fouling proxy. Across three independent replicates per peptide under fixed thermodynamic conditions and analysis thresholds, these MD proxies did not indicate strong membrane-active behavior, stable multi-copy aggregation, or adsorption events within the sampled trajectories, and provided consistent relative ordering among candidates. We therefore interpret the MD outcomes as relative physical triage evidence that complements the model-based screening, rather than definitive measurements of hemolysis, solubility, or anti-fouling performance. Full simulation setups, metrics, and summary statistics are reported in the Appendix (Fig. S10). The MD simulations should also be viewed as short-timescale triage rather than exhaustive sampling: the 0.5–1.0 µs production windows may miss slower peptide aggregation, membrane insertion, or surface-adsorption events that require longer simulations, enhanced sampling, or experimental validation.
A key implication is that pooling is a high-leverage design choice for peptide transfer learning. Gated query pooling (GQP) implements a query-based evidence extraction interface that is conceptually related to learnable seed or query pooling mechanisms in attention-based set models, such as Pooling by Multihead Attention. In practice, this readout consistently improves accuracy and is most beneficial when labeled data are scarce, suggesting that a structured, evidence-selective head can compensate for limited supervision by learning where to attend within pretrained token representations rather than relying on coarse global statistics. This design also supports interpretability in a way that separates diagnostic signals from actionable, testable edits. Attention-derived summaries are useful for generating residue-level hypotheses, while controlled counterfactual substitution analyses directly quantify how model outputs change under single-residue edits while controlling for global composition. These developability models enable a multi-objective screening loop when paired with microplastic binding objectives. Using PepBD-derived PE, PP, and PET affinity scores, we find that high predicted affinity is abundant but frequently co-occurs with unfavorable developability predictions, motivating explicit constraint-based filtering rather than affinity-only selection. The sequential screen further indicates that non-fouling is the dominant feasibility bottleneck, consistent with the stringent hydration requirements needed to suppress nonspecific adsorption.
Several limitations define clear next steps, and they reflect different kinds of validity. First, the validity of benchmark developability prediction depends on how well benchmark labels and distributions match downstream use. We frame developability as binary classification, but real decisions often need calibrated probabilities or continuous readouts such as hemolysis intensity. Dataset construction can also couple properties. This can amplify correlated signals and reduce generalization. Future work should improve calibration and reduce confounding in dataset design. Second, the validity of our interpretability analyses depends on faithfulness and robustness. Attention patterns and controlled substitution effects offer plausible residue-level hypotheses. However, attention may not track causal evidence. Substitution effects can also change with the chosen controls or stratification. Sensitivity analyses and complementary faithfulness tests would strengthen these conclusions. Third, the validity of PepBD-derived binding hypotheses remains experimental. Our microplastic-binding candidates are computational hypotheses rather than validated leads. We provide mutation proposals to guide validation. However, adsorption strength and selectivity still need to be tested, especially on aged or biofilm-coated plastics. Biocompatibility also requires standardized assays and mechanistic follow-up. Despite these limitations, the broader message is that sequence-only foundation models can enable practical multi-objective peptide screening when the readout extracts task-relevant evidence.
More specifically, the microplastic screening component should be interpreted within the PE/PP/PET scope of the PepBD-derived datasets used here. We did not experimentally test binding on plastic substrates in vitro, and the prioritized peptides should therefore be viewed as computationally ranked candidates rather than experimentally validated plastic-binding leads. A direct follow-up validation campaign should synthesize the top candidates and controls, quantify adsorption to PE, PP, and PET films or particles using fluorescence-labelled peptide retention, QCM-D, or compatible surface-retention assays, and test wash-off stability and selectivity against non-plastic surfaces or serum/protein backgrounds. In parallel, solubility, hemolysis, and cytotoxicity assays should be used to verify developability before iterative model refinement. With respect to generalization, the available benchmarks support length-stratified evaluation but do not provide harmonized species-origin annotations or modification-type metadata. We therefore interpret the present GQP developability models as primarily applicable to linear peptide sequences composed of canonical amino acids and lying within the length and composition regimes represented in the benchmark training distributions. Predictions for species-shifted peptide families, D-amino-acid peptides, non-canonical residues, cyclized peptides, terminally modified peptides, or other chemically modified sequences should be treated as outside the validated scope unless supported by additional metadata-rich training data and task-specific validation.
453 sequences with 47.6% positives and 52.4% negatives. The non-fouling dataset was constructed from prior antifouling work and contains 3600 positives and 13
585 negatives. Negatives include insoluble and hemolytic peptides as well as scrambled positives. The three tasks have markedly different sequence-length distributions, which we report in Fig. S1. Hemolysis is concentrated in short peptides with a median length of 17 aa. Non-fouling is strongly skewed toward short peptides with a median of 8 aa and a long tail to 198 aa. Solubility is dominated by longer sequences with a median of 143 aa and a maximum of 198 aa. We did not truncate sequences. All models used a maximum input length of 512 and no dataset sequence exceeded this limit. Sequences were padded as needed. For hemolysis and non-fouling, we used the sequence-disjoint 80/20 split protocol from Multi-Peptide to enable direct comparisons across methods. For solubility, we created a new sequence-disjoint 80/20 split because Multi-Peptide provides fixed splits only for hemolysis and non-fouling. Split construction details are provided in the SI Methods.
509 sequences for PE, 433
488 for PP, and 441
978 for PET, with peptides represented as fixed-length 12-mers (excluding cysteine and proline in that resource).
for a peptide of length L, and let
denote m learnable query vectors. Each query is trained to extract a complementary “view” of the peptide by attending over all tokens. Unless otherwise noted, experiments use single-head query pooling with m = 4 learnable queries and full-softmax attention over all valid (non-padding) tokens. We fix the attention temperature to τ = 0.5, disable top-k sparsification, and do not use multi-head attention.
![]() | (1) |
is the attention weight matrix and
contains one d-dimensional summary vector per query. In our implementation, the effective sharpness of attention is controlled by the temperature τ. In practice, padding positions (if any) are masked before the softmax so that attention is computed only over valid tokens.
from token embeddings and a query-wise gate
from the learnable queries:| gt = ϕt(X), gq = ϕq(P) | (2) |
![]() | (3) |
= ÃX.
| (4) |
![]() | (5) |
The resulting representation z is passed to a task-specific prediction head.
, where
is the gated-and-renormalized attention weight assigned by query pto token position
. Padding positions (if any) are masked using the same convention as in GQP (i.e., logits for padded positions are set to a large negative value before the softmax), so attention is normalized over valid (non-padding) tokens and
for each query p. We visualize à as the per-query attention map in Fig. 3A.
denote the gated attention matrix produced by GQP (after masking and renormalization over valid tokens). We define the token-level mass as
, which increases when multiple gated queries place probability mass on the same token position. If padding is present, we set Ml = 0for masked positions.
be the amino acid identity at position
in peptide n. We compute
![]() | (6) |
denotes the set of such residue positions for peptide n. The reported class contrast is| ΔM(aa) = M1(aa) − M0(aa), | (7) |
![]() | (8) |
denotes the mutated sequence. We report logit differences rather than probability differences because logits are additive and less sensitive to saturation near extreme probabilities. The temperature T is used only to scale logit differences; in practice we divide by max(T, 10−3) for numerical stability. Unless otherwise specified, we set T = 1.0 for all reported CSE results. At each position we evaluate all 19 non-identity substitutions (excluding a → a).
#(H) − #(D, E), with all other residues contributing 0. Hydrophobic fraction is defined as the fraction of residues in the hydrophobic set
. Strata are defined by discretizing these covariates using fixed bin widths. Specifically, we use a charge bin width of 1.0, a hydrophobic-fraction bin width of 0.05, and a length bin width of 25 amino acids. Each sequence is assigned to a stratum based on the resulting (Q, H, L) bin indices. Within each stratum c, we compute the mean sequence-level substitution effect Δc(a → a′). We then form a standardized controlled effect by averaging stratum-specific estimates using empirical stratum weights. Importantly, weights are computed separately for each from-residue a, using only the subset of sequences that contain a. Denoting the corresponding stratum weights by wc,a, the controlled substitution effect is
![]() | (9) |
To ensure stable estimates, strata with fewer than five sequences are excluded and the remaining weights are renormalized to sum to one for each a. For each from-residue a, we exclude strata with fewer than five sequences containing a, and renormalize the remaining weights to sum to one. This procedure is a discrete form of standardization, or the g-formula, for estimating average effects under measured confounding.
![]() | (10) |
We exclude the identity substitution a → a and average over the remaining 19 substitutions. In Fig. 4, we visualize the full substitution matrices CSEctrl(a → a′) (left) and the intervenability ranking (right) for hemolysis, non fouling, and solubility.
Supplementary information (SI): methodological details, diagnostic analyses, final peptide candidates, and MD-based triage results. See DOI: https://doi.org/10.1039/d6sc01486k.
| This journal is © The Royal Society of Chemistry 2026 |