Kübra
Kaygisiz
a,
Arghya
Dutta‡
b,
Lena
Rauch-Wirth
c,
Christopher V.
Synatschke
a,
Jan
Münch
c,
Tristan
Bereau§
*b and
Tanja
Weil
*a
aDepartment Synthesis of Macromolecules, Max Planck Institute for Polymer Research, Ackermannweg 10, 55128 Mainz, Germany. E-mail: weil@mpip-mainz.mpg.de
bPolymer Theory, Max Planck Institute for Polymer Research, Ackermannweg 10, 55128 Mainz, Germany. E-mail: bereau@thphys.uni-heidelberg.de
cInstitute of Molecular Virology, Ulm University Medical Center, Meyerhofstraße 1, 89081 Ulm, Germany
First published on 15th June 2023
Amyloid-like nanofibers from self-assembling peptides can promote viral gene transfer for therapeutic applications. Traditionally, new sequences are discovered either from screening large libraries or by creating derivatives of known active peptides. However, the discovery of de novo peptides, which are sequence-wise not related to any known active peptides, is limited by the difficulty to rationally predict structure–activity relationships because their activities typically have multi-scale and multi-parameter dependencies. Here, we used a small library of 163 peptides as a training set to predict de novo sequences for viral infectivity enhancement using a machine learning (ML) approach based on natural language processing. Specifically, we trained an ML model using continuous vector representations of the peptides, which were previously shown to retain relevant information embedded in the sequences. We used the trained ML model to sample the sequence space of peptides with 6 amino acids to identify promising candidates. These 6-mers were then further screened for charge and aggregation propensity. The resulting 16 new 6-mers were tested and found to be active with a 25% hit rate. Strikingly, these de novo sequences are the shortest active peptides for infectivity enhancement reported so far and show no sequence relation to the training set. Moreover, by screening the sequence space, we discovered the first hydrophobic peptide fibrils with a moderately negative surface charge that can enhance infectivity. Hence, this ML strategy is a time- and cost-efficient way for expanding the sequence space of short functional self-assembling peptides exemplified for therapeutic viral gene delivery.
For example, the emerging field of gene therapy requires efficient transduction of target cells by viral vectors that deliver the therapeutic gene.3 In this regard, self-assembled peptide fibrils that increase the colocalization of viral vectors and cellular membranes and thereby enhance gene delivery are promising candidates for new or optimized gene-therapeutic applications.4–6 However, the discovery of self-assembling peptides as enhancers of viral transduction via screening methods is challenging because of the complexity in predicting sequences that show the required physicochemical properties such as sequence amphiphilicity, charge, and assembly, for biological activity.7,8
Traditionally, new peptides with certain desired properties, e.g., viral transduction/infection enhancement, have been found mainly by serendipity during screening processes,4,9 or by nature-inspired rational design.7,10,11 A common strategy is finding recurring motifs in known active peptides to generate new sequences.12,13 Although changing one amino acid at a time to screen derivatives of a known active compound is often a direct and efficient way to find active peptides with similar structure and study property–activity relationship, it is not a tractable way to discover new sequences. The sequence space of peptides is huge—there are 206 = 64 million peptides if we only consider all possible 6-mers composed of 20 canonical amino acids. Further, the size of the sequence space increases exponentially with the number of residues. Consequently, exploring the peptide space to discover new peptides by creating randomly generated peptides or derivatives of known structures and studying them in experiments quickly becomes unfeasible. In the quest to discover new peptides with certain target bioactivity, computational methods have been established for fast and inexpensive prescreening of peptide sequences.14
In this direction, Asgari and Mofrad recently proposed a method that can convert any protein sequence into a unique, dense, 100-dimensional numerical vector, termed ProtVec.26 They used a method from natural language processing that employs an artificial neural network that, while attempting to determine the context in which a word is most likely to occur in a sentence, generates a continuous distributed representation of the word; further, the ProtVecs were found to accurately capture the physicochemical properties of proteins.
Here, we report a three-step approach to explore the sequence space of bioactive self-assembling peptides and discover de novo sequences with high bioactivity by only using sequence and activity information. First, we trained a LASSO (Least Absolute Shrinkage and Selection Operator) regression model using ProtVec representations of peptides from a relatively small library of 163 sequences with known activity values. LASSO aims at identifying a minimal subset of parameters relevant to the prediction, thereby enhancing explainability. Then, we utilized the trained model to systematically sample the sequence space of 6-mer peptides using a Monte Carlo approach.27 Finally, we screened the sequences with highest predicted activities based on their charge and tendency to aggregate. The search yielded 16 new peptides, which were tested experimentally and found to be active with a 25% hit rate or a 50% hit rate if further predictive parameters like aggregation are included. Strikingly, the newly created peptides are very different in sequence from the ones comprising the training set and shorter than any previously reported sequence for infectivity enhancement. Taken together, our method offers a fast and computationally inexpensive way of predicting and screening potentially bioactive, self-assembling peptides for any desired target bioactivity; consequently, it can accelerate peptide screening in the early stages of research.
![]() | ||
Fig. 1 Schematic overview of the workflow. (A) A peptide library8 was created by systematic sequence derivations of the infectivity-enhancing peptide EF-C. This database contains 163 peptide sequences and their respective biological activity data (i.e. enhancement of HIV-1 infection, “infectivity”. (B) The sequence logo plot summarizes the amino acid frequency in the library and shows that most sequences are composed of positive charged (K, lysine) and hydrophobic uncharged (F, phenylalanine or I, isoleucine) amino acids in alternating order. (C) The diagram shows the length distribution of infectivity enhancement in log scale relative to the reference peptide EF-C. Despite the sequence similarity, a wide range of activities can be found within the library. (D) Flowchart summarizing the machine-learning approach in this study. As input data, the EF-C based peptide library (A) was used. Each peptide sequence in the library was represented as a 100-dimensional vector using a continuous distributed representation, ProtVec.26 The vector and respective activity were used in a supervised LASSO regression model to compute a linear relationship between the activity and relevant components of the 100-d ProtVec. The trained model was then used in a Monte Carlo search where 1 million 6-mers were generated, and the 6-mers that were predicted to perform better than EF-C were retained based on a Metropolis criterion (see Methods). By repeating the Monte Carlo step 16 times, each time starting with a random 6-mer, a broad portion of the 6-mer sequence space was covered. |
![]() | ||
Fig. 2 (A) Performance of supervised linear regression model trained from EF-C based library. The experimentally measured data correlate highly with the predicted infectivity enhancement (Pearson R = 0.84). (B) Scheme showing the selection criteria for narrowing down putatively infectivity-enhancing peptides for experimental evaluation. From each independent prediction Monte Carlo Sampling run the best predicted 1000 sequences were selected. From this subset 3669 sequences showed a positive net charge and were further considered as promising candidates. 8 sequences predicted for aggregation and 8 sequences not predicted for aggregation were selected for experimental testing. The letter size in sequence logo plots is visualizing the amino acid frequency at the corresponding positions in the predicted 6-mer sequences. A detailed listing of amino acids abundancy for the 3669 peptides is shown in Fig. S4.† The Venn diagram is summarizing the main composition motifs of the 3669 peptides. (C) t-SNE dimension reduction plot of the predicted sequences (12![]() |
The strength of this ML model is that it can be applied on any other sequences made of canonical amino acids, which can be represented in a 100-d vector. With the trained ML model, we decided to screen for bioactive 6-mers. We chose peptides with only 6 amino acids because they would be shorter than any infectivity enhancing peptide from our library or the literature.
To sample the sequence space in a time and computational wise cost-efficient way we applied a Monte Carlo model, so that not all possible 6-mer sequences (206 = 64 million) must be calculated. Starting from 16 initial 6-mers with randomly chosen residues, we executed 16 independent MC runs which were continued for 1 million steps, generating a peptide at each step. The generated peptides were retained based on a Metropolis criterion (Fig. 2B, see Methods; full lists of the retained peptides, along with code, data, and the trained ML models, can be found at https://gitlab.com/arghyadutta/seq-to-infect).
From each of the 16 MC runs, only 1000 sequences with the largest predicted infectivity values were kept, yielding 12320 sequences after removing duplicates (Table S3†). The predicted 6-mer sequences were represented in a 2-d t-distributed stochastic neighbor embedding (t-SNE)29 dimensionality reduction plot from 100-d vector space to check their semantic varieties (Fig. 2C). The t-SNE algorithm attempts to cluster sequences that are semantically close to each other regarding their amino acid composition. As expected, the training set peptides, which have similar amino acid compositions by design, formed a cluster that is distinctly separate from the widely distributed clusters of the generated sequences.
Out of the 12320 predicted sequences for infectivity enhancement, 3669 peptides have a net positive charge; 424 of these 3669 peptides were predicted for aggregation by at least two of Aggrescan, PATH, and APPNN (for detailed analysis see ESI Section 2, Table S4†).
Most of these peptides contain the motif “WWN” (1600 of 3669) or the amino acid Cysteine (2009 of 3669) as visualized in the Venn diagram and in the sequence logo plot (Fig. 2B, and Fig. S4†). Interestingly, the motif “WWN” does not appear in any of the training set peptides, whereas Cysteine was shown previously by us to contribute positively to infectivity enhancement.7,8 For experimental evaluation, peptides with a large variation in sequences were selected from different clusters in the t-SNE map in order to cover a large sequence space (Fig. 2C). Other than predicted infectivity, hydrophobicity, and aggregation propensity (Fig. S3†), we considered N-gram similarity scores (Fig. S5, and Table S5†) to ensure diverse selection of sequences. Finally, from the total 16 peptides selected for experimental evaluation, 8 of the peptides were predicted for aggregation, and 8 sequences were not predicted for aggregation as a control group (Fig. S3†). All these peptides strongly differ sequence wise from the training set, as visualized via sequence plot and N-gram similarity scores (Fig. 2B, and Fig. S5†).
4 of the 16 peptides show remarkable infectivity enhancement above 10% relative to EF-C (Fig. 3A, and Fig. S7†). It is important to recognize the inherent difference in sequence length when comparing the infectivity enhancement of the newly found sequences with EF-C. The four infectivity enhancing 6-mer peptides have roughly half the molecular weight of the 12-mer EF-C. Therefore, these peptides exhibit approximately twice the infectivity enhancing efficiency in terms of mass concentration when compared to EF-C. Consequently, a direct comparison needs to consider this significant difference of sequence length. Further, these 4 peptides were predicted for aggregation, resulting in a hit rate of 50% based on the selected 8 aggregation prone peptides (Fig. S3†) or a hit rate of 25% relative to the entire selection. Interestingly, among these peptides only one peptide (ICICLK) shows a positive zeta-potential. The other 3 peptides (HVWCIF, HICLFW, HFICIC) form fibrils, colocalize with cell-membranes (Fig. 3B–E) and show infectivity enhancement despite their moderately negative zeta-potentials. We wondered whether these hit peptides show a different mode of action and applied a property–activity correlation model, which was developed with the training set (ESI section 4†).8 The newly created peptides fit in the model well (R = 0.72, Fig. 3F), which indicates an interaction mode comparable to the training set: the peptide fibrils associate with viruses and colocalize them with cellular membranes, which facilitates the uptake and increase infection rate.4
![]() | ||
Fig. 3 (A) Summary of infectivity enhancement and physicochemical properties of de novo predicted peptides (Table S2†). Peptides are incubated in PBS at 1 mg mL−1 for 1 d at RT before characterization. Absolute HIV-1 infection rates in the presence of EF-C (QCKIKQIINMWQ) and peptides from ML-prediction at 6.5 μM, 1.3 μM, 0.26 μM and 0 μM (virus only infection). The n-fold infection rates relative to virus only control is shown for each column. The aggregation into μm-sized colloids was determined by light scattering count rate during zeta-potential measurements. The molecular aggregation into nm-sized fibrils was determined by transmission electron microscopy (TEM, Fig. S6†). Hydrophobicity was calculated according to Fauchere hydropathy scale.50 (B–E) TEM showing fibril morphology (scale bar 1 μm) and confocal fluorescence microscopy showing cell-fibril-colocalization (scale bar 20 μm) of hit peptides (B) HVWCIF, (C) ICICLK, (D) HFICIC, (E) HICLFW. For TEM measurements the peptides were stained with 4 wt% uranylacetat during preparation. For confocal fluorescence microscopy the fibrils were stained with Proteostat and diluted to 20 μg mL−1 before adding to HeLa cells (40![]() |
Our training set contains 163 peptide sequences based on derivatives of an active compound. Small training sets are common in early stages of research, but they are rarely considered for ML approaches that aim to predict new sequence spaces since training the model is difficult.20,27,38–42 As shown here, new peptide sequence spaces can be discovered via a computational approach that combines ML and MC with further screening, while still using a small training set with a wide variation in bioactivity.
Interestingly, most of the predicted peptides are rich in hydrophobic amino acids cysteine and tryptophan, the latter mainly from the sequence motif “WWN”. Sequences which contain these hydrophobic amino acids enhance infectivity if they form fibrillar structures that are influenced by cysteine's capability to form disulfide bonds (ESI chapter 9, Fig. S10A–D†). Notably, in a previous study on the training set, we discovered a higher prevalence of cysteine in active peptides.7,8 However, it was found that cysteine was not essential for activity.7 In contrast to the de novo peptides found in this study, such as ICICLK, the non-essential role of cysteine in the training set can be attributed to the strong self-assembly tendency of peptides with amphiphilic sequence patterns. These patterns stabilize the structure even after the disulfide bonds are broken (Fig. S10F†). Therefore, our machine learning approach successfully extracted the importance of cysteine for peptide fibril formation and incorporated this information in the newly found sequences. We showed here that fibrillar structure formation can be predicted with an accuracy of ∼75% through the combination of open-source aggregation prediction tools Aggrescan,35 APPNN,32 and PATH.34 We found that while these algorithms were developed for polypeptides and proteins, they also perform well for short self-assembling peptides.
The infectivity enhancement of peptide fibrils occurs due to improved colocalization of viruses with cell-membrane.4 The main driving force for this interaction is believed to be electrostatic interactions;7,9,11,43 where positively charged fibrils sequester negatively charged virions which in turn bind to the negatively charged cellular membrane. As virion attachment to the cell membrane is the major rate limiting step during viral entry, increased numbers of virions at the cell surface result in higher cell entry and infection rates. However, we here found that fibrils with a moderately negative zeta-potential can also increase infectivity. These kinds of fibrils were not included in the training set and not reported before. We hypothesize that oversimplification of the fibril–cell-membrane interaction by reducing it to solely electrostatic interactions can be misleading since the cell interaction of fibrils is regulated by an intricate balance between charge and hydrophobicity. For example, the peptide CQFICR (Fig. 3A) forms fibrils and has a positive zeta-potential but does not enhance infectivity. The low hydrophobicity for CQFICR (0.9) results in less aggregated fibrils and decreases hydrophobic cell-membrane interaction. Another example is demonstrated with the hydrophobic peptide FHVWNF (Fig. 3A), which forms aggregating fibrils with a negative zeta-potential but does not enhance infectivity due to the contribution of the strong negative zeta-potential, as supported by our property–activity model (Fig. 3F). A further example are fibrils derived from the immunoglobulin light chain that have a net negative surface charge and retain virion-binding activity but lack cell-binding and viral transduction enhancing properties.43 More recently it has also been shown that cellular protrusions actively engage EF-C fibril/virion complexes, suggesting that not only electrostatic interactions may account for bioactivity.44 It is important to note that meaningful comparisons of zeta-potential can only be made among peptides that either form aggregates or do not form aggregates. This is because the size of colloidal particles has an influence on the measured zeta-potential.45
Our method demonstrates that moderately negatively charged peptide fibrils can be active if the hydrophobicity and aggregation features are both strongly pronounced. Hydrophobic amino acids such as tryptophan, phenylalanine, and cysteine can facilitate these desired properties; the continuous vector representation of peptides successfully extracts this underlying information by processing sequence and activity information of the training set without the requirement to assume a predetermined set of relevant descriptors as often done in traditional prediction approaches.46 The predicted sequences show a higher hydrophobicity, on average, than reported for the training set (Fig. S11†).
Taken together, our method offers a promising tool to yield diverse peptide structures, which cannot be created rationally from derivatives of active compounds or by using conventional approaches such as sequence–pattern analysis.8,47
Finally, all these newly found active peptides are the shortest infectivity-enhancing peptides known to us and not found in any protein databases, which makes them truly de novo.
The strength of a continuous vector representation-based approach is that it can encode sequence and physicochemical information of a peptide into a numerical vector which can then be used to train an ML model. Monte Carlo sampling, using the trained ML model, enables us to screen a large sequence space in a time- and cost-efficient way and yield de novo active peptide sequences, which are structure- and property-wise very different from the training set. We envision that our data-driven method will substantially accelerate the early stages of research by screening large sequence spaces and predicting de novo peptides starting from a small dataset, which are unexpected by human experience and rational design.
R5-tropic HIV-1 stocks and HIV-1 infection assays were prepared analogous to a previous report.7 Briefly, the effect of peptide fibrils (final concentration on cells 6.5, 1.3, 0.26, 0 μM) on HIV-1 infection was studied via a luminescence assay for detection of β-galactosidase, which is expressed upon HIV-1 infection of TZM-bl cells. The HIV-1 infection assay was conducted in three technical replicates and reproduced at least once. Note, that n-fold infectivity enhancement of peptides relative to virus only infection rates are strongly dependant of initial virus concentration. To compare independent measurements with each other EF-C (QCKIKQIINMWQ) was always used as a reference peptide. EF-C is the original sequence on which the training set is based and was applied previously by us to quantify infection rates.4,7,8
Cell viability was determined after addition of peptides to TZM-bl cells via the CellTiter-Glo assay. To this end, 10000 cells were seeded and on the next day serial diluted peptides were added. After 3 days the supernatant was removed and 100 μL CellTiter-Glo Reagent 1
:
1 diluted in PBS was added. After 10 min 50 μL was transferred to white microplate and luminescence was recorded by Orion microplate luminometer.
Confocal laser scanning microscopy studies were performed for the visualization of the cell-peptide interaction. HeLa cells were seeded one day prior to conducting the assay (40000 per well) in an 8-well IBIDI slide. 4 μL of the preformed peptide fibrils (1 mg mL−1) were diluted with 4 μL Proteostat (Enzo Life Science, 1 μL stock in 999 μL PBS) and further diluted with medium to receive a final peptide concentration of 20 μg mL−1. The nucleus of the HeLa cells was stained with Hoechst 33342 (NucBlue™, Thermo Fisher Scientific). The peptide solution mixture was transferred to the HeLa cells and incubated for 30 min at 37 °C before washing three times with PBS. The interaction of fibril clusters with cells was monitored after 30 min incubation time on a Stellaris 8 confocal laser scanning microscope (Leica) equipped with a 20× air objective and laser excitation wavelength of 405 nm (Hoechst) and 561 nm (Proteostat).
All code and data used for ML and MC analysis are openly available at https://gitlab.com/arghyadutta/seq-to-infect.
The study was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation)—project number 316249678—SFB 1279 (A02, A03, A05, C01). T. B. acknowledges support from the Emmy Noether program of the Deutsche Forschungsgemeinschaft (DFG). A. D. acknowledges support by BiGmax, the Max Planck Society's Research Network on Big-Data-Driven Materials Science. Open Access funding provided by the Max Planck Society.
Footnotes |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d3bm00412k |
‡ Present address: Institute of Biochemistry II, Faculty of Medicine, Goethe University, Theodor–Stern–Kai 7, 60590 Frankfurt, Germany. |
§ Present address: Institute for Theoretical Physics, Heidelberg University, Philosophenweg 19, 69120 Heidelberg, Germany. |
This journal is © The Royal Society of Chemistry 2023 |