Open Access Article
Jokent Gaza
ab,
Monica J. Roth
c,
Gaetano T. Montelione
de and
Alberto Perez
*ab
aDepartment of Chemistry, University of Florida, Gainesville, Florida 32611, USA. E-mail: perez@chem.ufl.edu
bQuantum Theory Project, University of Florida, Gainesville, Florida 32611, USA
cDepartment of Pharmacology, Rutgers-Robert Wood Johnson Medical School, 675 Hoes Lane Rm 636, Piscataway, NJ 08854, USA
dCenter for Biotechnology and Interdisciplinary Sciences, Rensselaer Polytechnic Institute, Troy, New York 12180, USA
eDepartment of Chemistry and Chemical Biology, Rensselaer Polytechnic Institute, Troy, New York 12180, USA
First published on 10th November 2025
AI-based protein design can rapidly generate thousands of candidate binders, but most fail to fold or bind productively, creating a critical need for robust prioritization. We present a generalizable hybrid pipeline that integrates deep-learning design and physics-based simulations to filter large libraries down to a handful of high-confidence candidates.
Peptide-based inhibitors often display high specificity for their targets, but their clinical translation is limited by poor stability, proteolytic susceptibility, and structural disorder in solution.11 Miniproteins have emerged as promising alternatives: small, well-folded scaffolds that retain binding affinity while improving proteolytic stability and structural robustness.12 Recent advances in AI-based tools have democratized miniprotein design, enabling the rapid generation of candidates that embed known peptide-binding motifs into stable protein frameworks. However, the vast majority of these de novo sequences are unlikely to fold correctly or bind with high affinity, making prioritization a critical challenge. In our previous work,13 we used AlphaFold-based competitive binding assays (AF-CBA14) to identify peptide binders and binding sites from pulldown libraries. Building on these insights, we now explore whether miniproteins incorporating these peptide interaction motifs can be designed to bind with higher selectivity and remain robust to degradation. To do so, we developed a hybrid design and filtering pipeline (Fig. 1) capable of selecting high-quality binders from large AI-generated sequence libraries. Although the ET domains of BRD2, BRD3, and BRD4 all interact with murine leukemia virus integrase, we focused on BRD3-ET due to our extensive biochemical, biophysical, and structural characterization of this system.10,13,15,16
Starting from five known ET-binding peptide motifs (Fig. S1), we used RFDiffusion17 to design 3000 miniproteins per peptide with target length of 70–120 residues. The number of initial designs was based on the reported in silico success rate of the binder design pipeline using predicted aligned error (pAE) as the confidence metric.18 We reasoned that with a pool large enough to statistically contain a reasonable number of successful designs, a prioritization pipeline should be able to recover the most promising ones. Conditional protein design preserved the peptide's hairpin interaction while scaffolding it with additional secondary structure to promote folding. ProteinMPNN19 was then used to assign amino acid sequences compatible with these backbones, resulting in a library of 15
000 de novo sequences. As many of these candidates may be misfolded or non-binders, AlphaFold2 initial guess18,20 offers a fast, orthogonal AI validation strategy to RosettaFold. As a first-pass filter, we applied the pAE to estimate structural confidence and binding mode quality. Designs with low pAE scores (pAE < 10) were retained, yielding 823 candidates. However, this set remains too large for most experimental efforts. While high-throughput groups may test hundreds of designs, most collaborative settings require prioritization of a small number of candidates with the highest likelihood of success.
To further prioritize candidates with high predicted binding affinity, we applied our previously developed AF-CBA,14 which enables side-by-side structural prediction of multiple binders competing for a shared binding site. In this framework, sequences that consistently occupy the binding site more frequently than others are inferred to have higher relative binding affinity. We implemented this in a tiered manner to reduce computational cost:13 the first O(N) stage filtered designs against five random miniproteins in the set, followed by an O(N2) competition to rank the promising candidates. This filter reduced the pool from 823 to just 20 candidates.
However, structural inspection of these top-ranked designs revealed a systematic flaw. Many miniproteins exhibited elongated helical elements that disrupted globularity, resulting in high radius of gyration (RoG) values and poor packing around the binding domain (Fig. S2a). Retrospective analysis indicated that this was likely due to biases in the default RFDiffusion model weights we used which favored extended helices rather than compact folds. Despite this limitation, the designs preserved key features along the complex interface residues. All designs retained the canonical hairpin interface and a motif of alternating hydrophobic/charged residues in the binding epitope,21–23 and several formed a conserved hydrogen bond between ET-domain residue Asp612 and a nearby basic residue on the binder (Fig. S2b). Aligning and analyzing these sequences led us to define two new structural motifs. Motif I combines features from CHD4:ET and NSD3:ET complexes, and Motif II uses elements from CHD4:ET and TP:ET complexes (Fig. S2d).
For the second round of miniprotein design, we surmised that using the Complex_beta weights in RFDiffusion would improve the likelihood that the new miniproteins would fold into globular structures. Using these weights, we generated 3000 backbones for each two motif (Motif I and Motif II), and proposed sequences via ProteinMPNN to produce a second design library of 6000 candidates. To determine whether improvements in design quality stemmed from the new motifs or from the updated generative weights, we also created a control set of 6000 sequences using the same motifs but the default RFDiffusion weights.
All designs were filtered using AlphaFold2 initial guess, retaining only those with high-confidence structures (pAE < 10). We then applied AF-CBA to prioritize binders predicted to outcompete the viral TP peptide, and implemented two additional structure-based criteria: (1) an RoG < 14 Å to exclude non-globular scaffolds, and (2) an RMSD < 2 Å between bound and unbound conformations to identify candidates with minimal structural rearrangement upon binding. The latter serves as a proxy for minimizing the conformational free energy cost of binding, which can improve affinity and functional robustness.
Among the 6000 designs generated with Complex_beta weights, 31 (9 from Motif I and 22 from Motif II) passed all filters. In contrast, although some designs from the control set showed favorable AF2 predictions, none satisfied both structural criteria. This result underscores the importance of selecting the right model weights. With this manageable set of 31 compact, stable candidates in hand, we proceeded to physics-based validation using MELD simulations as an orthogonal test of folding and binding fidelity (Fig. 2).
MELD is an enhanced sampling approach that incorporates ambiguous and noisy information, such as generic heuristics about protein folding (e.g., hydrophobic residues tend to form cores), to molecular dynamics and infers structures that are consistent with some subset of the data and the physics model using Bayesian inference.24,25 During CASP evaluations, MELD has shown success in modeling designed proteins,26 motivating its application to study miniproteins.
For each of the 31 candidate designs, we carried out folding simulations using only sequence and secondary structure predictions as input. The resulting ensembles were analyzed via clustering to identify dominant metastable states. The higher the population of the top cluster and the more independent replicas (walkers) that sample it, the higher our confidence that the folded state is stable and accessible. Of the 31 designs, 21 exhibited top clusters with populations exceeding 50%, indicating a well-defined folding basin (Fig. S3 and Table S1). In 20 of these 21 cases, the representative structure of the dominant cluster was in excellent agreement with the AlphaFold-predicted model (RMSD < 5 Å), suggesting strong convergence between physics-based and AI-based predictions. Even among lower-population designs, several retained good structural agreement, indicating that MELD can recover the native fold even when sampling a more heterogeneous distribution.
After validating that our miniprotein designs could fold reliably, we next assessed whether they could bind the BRD3 ET domain in the expected manner. We used previous chemical shift perturbation data to define the possible binding sites in ET.16 Given that both the ET domain and the designed miniproteins were predicted to be stably folded, these simulations focused exclusively on binding, without modeling folding upon association. MELD binding simulations apply restraints to preserve native-like flexibility while preventing global unfolding during enhanced sampling at elevated temperatures. These ambiguous restraints allow the proteins to explore multiple binding modes, enabling an ensemble-level view of binding specificity.
We analyzed the resulting complexes through clustering to identify dominant binding modes (Fig. S3 and Table S1). For 20 of the 31 designs, the top population cluster exceeded 50%, indicating a strong preference for a specific binding mode. Of these, 19 bound in a geometry consistent with the AlphaFold- or RoseTTAFold All-Atom27 (RFAA)-predicted models. Selecting only those designs that showed agreement across MELD, AlphaFold2, and RFAA models yielded 12 high-confidence candidates: 4 from Motif I and 8 from Motif II (Fig. S4 and Table S1). Interestingly, the hairpin regions of these designs preserved distinct features of their source motifs beyond a known pattern of alternating hydrophobic/charged residues creating a zipper like interaction between the peptide and the receptor. Motif I designs based on NSD3 exhibited a flipped β-sheet orientation relative to Motif II designs derived from TP.
As a final filter, we applied the MELD Competitive Binding Assay (MELD-CBA) to evaluate whether each of the 12 surviving miniprotein designs could outcompete the TP peptide, a known high-affinity binder, for the ET domain binding site. In these simulations, both the miniprotein and TP were introduced simultaneously, and we monitored their occupancy of the ET domain across replica indices. While high-temperature replicas emphasize entropic flexibility, low-temperature replicas reflect enthalpic stabilization and shape complementarity.
Among the 12 designs, none of the four Motif I candidates were able to consistently outcompete TP. In contrast, 5 of the eight Motif II designs exhibited dominant binding at the lowest temperature replicas, indicating stronger enthalpic interactions (Fig. S5 and Fig. 3). Interestingly, the replica-dependent binding profiles revealed diverse thermodynamic behaviors. For one design (Miniprotein 2183), the miniprotein outcompeted TP consistently across all replicas, thus suggesting both favorable entropy and enthalpy. Four miniproteins (Miniproteins 879, 522, 50, and 1147), on the other hand, only dominated at low temperatures, implying a higher entropic cost compensated by stronger binding interactions.
These results highlight the nuanced balance between conformational flexibility and binding strength. The disordered TP peptide can rapidly sample orientations and form initial contacts but ultimately incurs a higher penalty as it folds upon binding. In contrast, pre-folded miniproteins may be slower to sample binding-compatible conformations at high temperatures but exhibit stronger, more specific interactions at lower temperatures.
In terms of protein–protein interactions, a known hotspot in the BRD3 ET domain is a hydrophobic pocket (Fig. 4) near VAL596, where high-affinity binders such as the viral TP peptide and host protein NSD3 typically contribute a tryptophan or phenylalanine residue.10 Notably, 12 of our top 31 MELD-validated designs (5 from Motif I and 7 from Motif II) incorporated a tryptophan at this position, despite no explicit biasing during design. This convergence reinforces the biological relevance of the selected binders and highlights the pipeline's ability to recover key molecular recognition features. Furthermore, the presence of a surface-exposed tryptophan in the unbound state that becomes buried upon binding also provides a convenient feature for future fluorescence-based binding assays. To assess experimental feasibility, we evaluated common N-terminal tags using MELD binding simulations (Fig. S6–S9 and Tables S2, S3). All constructs retained the expected binding mode, with AviTag variants showing the lowest perturbation. In silico assessment also showed high expected stability at 65 °C, low aggregation propensity and high solubility (Tables S4 and S5).
![]() | ||
| Fig. 4 Interactions at a key hydrophobic pocket for TP and the top three designs. The ET domain is shown as a surface, with hydrophobic regions colored orange and hydrophilic regions colored cyan. | ||
Our results demonstrate how an integrated AI/physics pipeline can reduce thousands of designs to a handful of compact and stable candidate binders. Based on these predictions, our team is acquiring these constructs for experimental validation, which will be published in future work, comparing to our deposited predictions. While applied here to BRD3-ET, the strategy is broadly applicable to peptide-derived motifs and other flexible protein–protein interactions. By releasing all predictions as blind benchmarks, we aim to promote transparency, reproducibility, and community-wide validation, in line with FAIR principles and the spirit of CASP and CAPRI style challenges. This work provides a practical resource for BET targeting while also serving as a blueprint for prioritizing designs in binder discovery pipelines.
All sequences and results are uploaded in our GitHub repository (https://github.com/PDNALab/Miniprotein_Design) and in Zenodo DOI: https://doi.org/10.5281/zenodo.16755842.
| This journal is © The Royal Society of Chemistry 2025 |