From the journal Digital Discovery Peer review history

You do not have JavaScript enabled. Please enable JavaScript to access the full features of the site or access our non-JavaScript page.

Round 1

Manuscript submitted on 30 Aug 2022

Editor’s decision letter

08-Nov-2022

Dear Dr Mansbach:

Manuscript ID: DD-ART-08-2022-000091
TITLE: Latent Spaces for Antimicrobial Peptide Design

Thank you for your submission to Digital Discovery, published by the Royal Society of Chemistry. I sent your manuscript to reviewers and I have now received their reports which are copied below.

I have carefully evaluated your manuscript and the reviewers’ reports, and the reports indicate that major revisions are necessary.

Please submit a revised manuscript which addresses all of the reviewers’ comments. Further peer review of your revised manuscript may be needed. When you submit your revised manuscript please include a point by point response to the reviewers’ comments and highlight the changes you have made. Full details of the files you need to submit are listed at the end of this email.

Digital Discovery strongly encourages authors of research articles to include an ‘Author contributions’ section in their manuscript, for publication in the final article. This should appear immediately above the ‘Conflict of interest’ and ‘Acknowledgement’ sections. I strongly recommend you use CRediT (the Contributor Roles Taxonomy, https://credit.niso.org/) for standardised contribution descriptions. All authors should have agreed to their individual contributions ahead of submission and these should accurately reflect contributions to the work. Please refer to our general author guidelines https://www.rsc.org/journals-books-databases/author-and-reviewer-hub/authors-information/responsibilities/ for more information.

Please submit your revised manuscript as soon as possible using this link:

*** PLEASE NOTE: This is a two-step process. After clicking on the link, you will be directed to a webpage to confirm. ***

https://mc.manuscriptcentral.com/dd?link_removed

(This link goes straight to your account, without the need to log on to the system. For your account security you should not share this link with others.)

Alternatively, you can login to your account (https://mc.manuscriptcentral.com/dd) where you will need your case-sensitive USER ID and password.

You should submit your revised manuscript as soon as possible; please note you will receive a series of automatic reminders. If your revisions will take a significant length of time, please contact me. If I do not hear from you, I may withdraw your manuscript from consideration and you will have to resubmit. Any resubmission will receive a new submission date.

The Royal Society of Chemistry requires all submitting authors to provide their ORCID iD when they submit a revised manuscript. This is quick and easy to do as part of the revised manuscript submission process. We will publish this information with the article, and you may choose to have your ORCID record updated automatically with details of the publication.

Please also encourage your co-authors to sign up for their own ORCID account and associate it with their account on our manuscript submission system. For further information see: https://www.rsc.org/journals-books-databases/journal-authors-reviewers/processes-policies/#attribution-id

I look forward to receiving your revised manuscript.

Yours sincerely,
Dr Kedar Hippalgaonkar
Associate Editor, Digital Discovery
Royal Society of Chemistry

************

Reviewer comments

Reviewer 1

The authors proposed a set of deep learning applications support by Variational Autoencoder strategies to develop models to improve the exploration and generation of new sequences with AMP biological activity. The authors have trained 15 models in total, varying between the architectures and the length of the generated encoders, and demonstrate the separation generated in the latent space by the different biological activities evaluated in this work. In general, the authors try to demonstrate the explications of learning process correlating the PC components with physicochemical properties. Despite the results show by the authors are very interesting and appear a nice work which try to solve a very complex problem associated with design peptide sequences with desirable biological activity functions, there are some points and discussions in the presented work that are very difficult to understand, in particular for non-computational users and lectors not associated with deep learning strategies. Also, is necessary to improve the usability of the proposed method, for example, adding a representative scheme about how to use the proposed method, also, could be very interesting test the proposed autoencoder to create new AMP peptide sequences and compare it we known sequences not considered in the training process.

What is the difference of your Autoencoders, and methods proposed by the application of learning representations as the Seq2Vec, Unirep, Babbler, AminoBert, among other?
It is not clear in the presented draft how a user could create new amp sequences and how a user could be to evaluate the generated sequences.
It is not clear the proposed pipeline, there are different evaluation of the proposed autoencoders, could be very interesting to demonstrate that the usability of the proposed method is highest in comparison with previously pretrained models. Also, the authors should clarify the requirements to apply the proposed method, for example, what happen if the methodology could be applied to build anti-bacterial or anticancer peptides? Is the proposed method enough generalizable to generate these types of peptide sequences?
It is not clear how to combine the five models to create new sequences, could be very interesting if the authors add in the draft a representative scheme with the pipeline to develop the Autoencoders and then how to use to develop new sequences, also to create new non-amp sequences.
Input dataset to train the predictive model was 300.000 examples. It is not clear the redundance of information between AMP peptides in Uniprot and AMP peptides obtained from StarPep.
What is the length distribution for peptide sequences because length 1 is only a residue, Why the authors did select this length conditions?
Please, try to clarify the number of examples in the dataset, adding more information of positive and negative examples, and the proportions of the proposed examples.
The developed of the negative dataset (non-AMP) is a bit complex because only a random selection of the peptides obtained in the query developed to UNIPROT is not informative. In this case is necessary to demonstrate that the peptide sequences are effectively non-AMP, also. I think the authors could building the datasets using only reviewed sequences and then remove redundancy, maybe making MSA evaluations, clustering, or other strategies to choose representative elements to build the negative examples.
Why the authors selected the applications of VAE to make numerical representations?
It is not clear if the different architectures evaluated in this work are created by the users or were reported previously.
The Figure 1 could be more informative adding in the scheme how the information is process, layers, and other relevant topics, for example the dimensions of the input, etc.
How the authors found the best hyperparameters configuration? It is necessary to clarify the configuration of batch size and number of epochs used during the training models.
Normally, in this type of problems is more used the t-SNE evaluations in comparison with PCA strategies, also, it is not clear why the authors have selected PCA methods.
What is the amino acid index? The authors should define in the document to increase the understanding of the proposed work.
128 dimensions appear the highest performance using a transformer strategy is ok?
It is not clear how was applied the PCA and why, please give more details about the necessary hypothesis to apply PCA.
Why the authors applied silhouette coefficient? There are other metrics more interesting as davies boulding index and Calinski-Harabasz index.
In section 3.2 the authors talk about the binary classification models, what is the model? How did the authors train this model?
Depending on the type of possible lectors, it is not necessary to explain the SS method, it could be add in the support information. Also, it is necessary to add more details about the estimation of SS using the principal components evaluated (pairs combinations)
Could be interesting to explain what does mean that the ordering proposed in the section 3.2 does not appear in the PCA 1 and PCA 2 and appear in other PCA ranks.
It is necessary to clarify the application of the spearman correlation to the PC components.
Why did the authors select Bernoulli distribution in Figure 6?

Reviewer 2

The manuscript by Renaud and Mansbach presents a comparative study of five deep learning model architectures with VAE-like latent spaces for de novo generation of antimicrobial sequences. The paper first discusses the training of these models and analysis of their reconstruction accuracy considering both entire-sequence reconstruction accuracy as well as index-based reconstruction accuracy. Second, the paper analyzes the latent space using a predictive model and PCA, investigating possible correlations between the first five PCs and AMP-likeness/physicochemical properties in the latent space. Interestingly, the latent space orders sequences in a chemically meaningful way, even though the models have been trained on letter sequences only, although some of the correlations are moderate. The paper finally discusses possible approaches to improve the models, which is the disappointing part of this report: a comparative analysis and then nothing in terms of novelty. The journal must decide if this type of work belongs, also considering here that the story is well written and might help the field to progress. The following comments must be addressed:

1) the introduction misses on important precedents of using generative models for AMP design and actual experimental validation that must be cited:
J Biol Chem. 2018 Mar 9;293(10):3492-3509. doi: 10.1074/jbc.M117.805499.
Chem. Sci., 2021, 12, 9221-9232, doi:10.1039/D1SC01713F

2) All presented models are trained on peptide sequences, which do not contain explicit chemical information. A comparison with models trained with explicit chemical structures (SMILES) would be interesting. The authors must in any case include a thorough discussion of possible representations and how they impact model predictions and performances.

3) Please add a use case for the best models, for instance by sampling either diverse of focused sets and investigating the properties. Going to experimental work would be even better, because this is usually when the pretty modeling collapses.

4) In figure 9, the dimensions of the latent space should be added directly in the plot (like you did in figure 4).

5) the authors discuss PCA as the representation of the latent space. There are many other and often better dimensionality reduction methods (e.g. t-SNE, UMAP), it would be useful to comment on why PCA is better suited in this case.

Author response

Thank you for your communication dated November 7, 2022 concerning our manuscript Latent Spaces for Antimicrobial Peptide Design by Samuel Renaud (Graduate Research Assistant, Physics, Concordia University) and Rachael Mansbach (Assistant Professor, Physics, Concordia University) and assigned tracking ID DD-ART-08-2022-000091.

We are grateful for your investment of time and effort in processing our manuscript, and would also like to extend our thanks to the two anonymous referees for their careful reading and helpful and considered comments on our work. As requested in your communication, we reproduce below the reviewers' critiques along with our responses and an accounting of the corresponding changes and improvements to the manuscript with associated page numbers. For convenience, we have also submitted a copy of the revised version of the manuscript with changes colored in red text (designated "other" in the file upload section), as well as a PDF version of this response with the changes reproduced within . We also delineate below the non-scientific formatting changes made to the manuscript.

We appreciate the positive reception of our work by the reviewers, and believe that our edits in response to their comments have greatly improved our manuscript. We hope that in light of our responses and revisions that our manuscript is suitable for publication in Digital Discovery.

Response to Reviewer # 1
The authors proposed a set of deep learning applications support by Variational Autoencoder strategies to develop models to improve the exploration and generation of new sequences with AMP biological activity. The authors have trained 15 models in total, varying between the architectures and the length of the generated encoders, and demonstrate the separation generated in the latent space by the different biological activities evaluated in this work. In general, the authors try to demonstrate the explications of learning process correlating the PC components with physicochemical properties. Despite the results show by the authors are very interesting and appear a nice work which try to solve a very complex problem associated with design peptide sequences with desirable biological activity functions, there are some points and discussions in the presented work that are very difficult to understand, in particular for non-computational users and lectors not associated with deep learning strategies.

We thank the reviewer for their kind words and also their thoughtful and detailed suggestions. We address these suggestions point by point below, including the changes we have made to our article in response. We believe the revisions have significantly improved the clarity and rigor of our paper, and we very much appreciate the effort to which the reviewer has gone to help us.

1)Also, is necessary to improve the usability of the proposed method, for example, adding a representative scheme about how to use the proposed method, also, could be very interesting test the proposed autoencoder to create new AMP peptide sequences and compare it we known sequences not considered in the training process.

Author reply: We agree with the reviewer that a section on how to use our models on a real-world problem would heighten the clarity and interest of our work, and we thank them for the suggestion. We have therefore added a new section to the paper providing a case study centered around a suite of β sheet-forming AMPs of interest to our lab. We reproduce below the new section, which occupies page 15 and 16, in the updated paper draft.

Case study: Pipeline for de novo AMP discovery using PCA components

Our pipeline consists of three of the best performing models. We suggest the simultaneous exploration of the Transformer-128 (high reconstruction accuracy, good clustering separation in two dimensions, linear generated sequence partitioning and low interpretability), AAE-128 (moderate reconstruction accuracy, good clustering separation in two dimensions, non-distorted PCA space linear generated sequence partitioning, moderate linearity, medium interpretability--bridge variables identified for AMP PCs), and WAE-128 models (moderate reconstruction accuracy, moderate clustering separation in two dimensions, moderate distortion in PCA space, linear generated sequence partitioning, moderate linearity, high interpretability).

The pipeline takes an AMP of interest as input. The AMP is passed through the encoders of the different models and transformed into a model-specific latent representation. The AMP latent vector is appended to the embedded training data and is transformed into linear representations with PCA (Fig.10).

We then proceed to choose variables of interest to hold constant in the latent space and perturb the others. This will allow us to find new peptides with the properties of interest maintained. We choose to maintain the verified-AMP-correlated PC and the hydrophobicity-correlated PC, to produce the sequences likelier to share experimental AMP-likeness and hydrophobicity-- an important AMP property-- with GL13K. We note that this procedure could be done with any identified property near any input point of interest. We use the previously mentioned correlation coefficients (Sect. 3.4), to find the verified-AMP correlating PC and the top hydrophobicity correlating PC. Having kept track of the PCA vector for our AMP of interest, we perform sampling by keeping our two PC's of interest, AMP and hydrophobicity correlated, fixed and adding gaussian noise to all other PC's. The added noise variance must be tuned individually for each model as the latent space organization is different for each model. The variance tuning is performed by identifying the mean and standard deviation of all training dataset points in PCA space and then sampling Gaussian noise centered at the mean with 1/5 the standard deviation. This noise vector is then summed with the PCA vector of our AMP of interest, thus shifting the sampling location to be near the AMP of interest. Once we have the noisy nearby PCA samples we perform an inverse PCA transformation returning to our latent space vector representation, demonstrating one particular valuable property of the PCA approach. We then pass the noisy latent samples to the decoder which will generate the desired candidate peptide sequences.

We demonstrate the use of the pipeline on GL13K. For each model, we generate five different sequences and display their properties (Figs.S6-S8). In the ideal case we should observe little change in Hydrophobicity of the samples while the other physicochemical properties should vary and in general this is what we observe. We observe that certain properties of random sequences generated in the neighborhoods tend to remain more constant irrespective of model (Aliphatic Index, Charge at pH=3, Molecular Weight, Isoelectric Point) and certain properties tend to be more variable (Boman index, Charge at pH=11, Instability Index), while certain properties are more dependent on the model. Hydrophobicity varies the most in the AAE model, and the least in the WAE model. One benefit of employing multiple models is the ability to sample different local neighborhoods of GL13K, with potentially different properties; another benefit is including both more interpretable and more high-performing models to generate samples.

The results from the pipeline for the AAE-128, Transformer-128 and WAE-128 presented in the supplementary figures (Figs.S6-S8) show that by locking PC's of interest and sampling nearby points along other PC's we can generate novel peptides that have similar properties to the original GL13K. While some generated samples feature drastic changes in certain properties such as the fourth peptide from the AAE-128 with a charge of 1 at pH 11 or both generated peptides with -1 charges at pH 11 for the Transformer-128 and WAE-128 models, most properties fall near the original GL13K sequence properties.

This general pipeline can be extended to any peptide function by identifying the top PC's correlating with the desired function and keeping those fixed as noise is added to the remaining PCA vectors in order to sample in the functional neighbourhood it could be employed for.

We modify the abstract to include the new pipeline section.

In this way we demonstrate several benchmarks that must be considered for such models and suggest that for optimization of search space properties, an ensemble methodology is most appropriate for design of new AMPs. We design an AMP discovery pipeline and present candidate sequences and properties from three models that achieved high benchmark scores. Overall, by tuning models and their accompanying latent spaces properly, targeted sampling of new anti-microbial peptides with ideal characteristics is achievable.

We modify our Results section introduction to mention the new pipeline on Page 6.

We assess and compare the models and their associated latent spaces in a number of ways. We begin by performing a traditional assessment through model reconstruction accuracies on the training and testing datasets. To assess the generalizability of the models, we benchmark their generative capacities in terms of the distributions they are capable of sampling, a process which is becoming more and more imperative as generative models become more commonplace [Polykovskiy2020, Dollar2021]. To assess the meaningfulness of associated distance metrics, we perform a quantitative and qualitative assessment of the models' ability to cluster the data in a low-dimensional space produced by principal components analysis (PCA) and a detailed analysis of the top five principal components. To assess model interpretability and address a longstanding problem in the field, we employ metrics from manifold theory to assess the extent to which PCA distorts the latent space embedding. Leveraging PCA's enhanced interpretability we connect latent representations to peptide properties through bridge variables. Finally, we demonstrate the use of our models in a pipeline. Unless otherwise noted, all analyses are performed on the test set data.

1)It is not clear in the presented draft how a user could create new amp sequences and how a user could be to evaluate the generated sequences.

Author reply: The evaluation of the samples generated by the pipeline is performed by comparing the physicochemical properties of the generated candidates to that of the known peptide of interest. In the ideal scenario one should find that the properties of interest, e.g. Hydrophobicity, have been preserved while variations are observed in all other physicochemical properties. Other possible evaluation mechanisms include Minimum Inhibitory Concentration (MIC) verification and performing MTT assays to determine cytotoxicity and cellular activity both of which would require wet lab experiments and are out of scope for this research.

1)It is not clear how to combine the five models to create new sequences, could be very interesting if the authors add in the draft a representative scheme with the pipeline to develop the Autoencoders and then how to use to develop new sequences, also to create new non-amp sequences.

Author reply: We have added to the proposed new section mentioned above a schematic of the pipeline we propose for the use of our three best models to identify good starting points for AMP design. (Fig.10, page 16 in the updated paper draft).

2)It is not clear the proposed pipeline, there are different evaluation of the proposed autoencoders, could be very interesting to demonstrate that the usability of the proposed method is highest in comparison with previously pretrained models.
Author reply: Outside of experimental measurements of specific peptide candidates, it is not yet obvious how to compare between generative models. Since we are particularly interested in interpretable latent spaces and the properties of the latent spaces themselves, as mentioned in the introduction, comparative performance of the models is not as important to us as assessing the qualities of the latent space and methods for interpreting them. To clarify this point, we modify the introduction as follows:

One primary point of interest that is beginning to attract attention is the quality of the latent space itself. Because in data-rich regimes, VAEs and other generative models can be extremely powerful, the focus in past work has been primarily on improving performance [Polykovskiy2020, Dollar2021, Prykhodko2019, Kusner2017, Grisoni2020, Dai2018]. However, construction of an interpretable and well-organized latent space not only tends to make targeted feature sample generation easier, it also significantly improves the ability of the user to apply domain expertise to the problem at hand. Therefore, although we do not ignore model performance, a major focus of this article is on interpretability, trustworthiness, and organization of the latent spaces themselves.

In work by [Gomez-Bombarelli2018, Dai2019, Grisoni2020], the authors demonstrated that training a property predictor on the latent space and forcing the model to consider the quality of its predictions as part of its overall goal leads to an ordering of the latent space, which is desirable for generating candidates with particular features. Because latent spaces tend towards dimensionalities on the order of 30 or above, this orderedness has been visually presented through projection onto a two-dimensional space, often by using principal components analysis (PCA) to identify the most relevant variables. Since PCA is a linear projection, it remains unclear how well this visual orderedness is preserved in the high-dimensional space, and therefore we make a point to address the trustworthiness of this simple, rapid, and useful analytic technique.

[removed paragraph to Methods section]

More generally, in this article, we focus on the question of latent space quality and interpretability for the relevant test case of generation of novel antimicrobial peptide sequences. We train five different deep generative models with VAE-like latent spaces and assess and compare their different behaviors in terms of reconstruction accuracy, generative capacity, and interpretability. We demonstrate that, as expected, a property predictor leads to an ordering in the latent space. We quantify this ordering and assess how well a PCA projection captures the properties of the original latent space. We argue that the better the PCA projection, the more interpretable the latent space, because we can apply a ranking to linear combinations of the latent space dimensions, allowing us to more easily identify bridge variables that tell us what the model considers important. We also show that the models are capable of generating unique and diverse sequences that grow more AMP-like in specific regions of the latent space.

3)Also, the authors should clarify the requirements to apply the proposed method, for example, what happen if the methodology could be applied to build anti-bacterial or anticancer peptides? Is the proposed method enough generalizable to generate these types of peptide sequences?

Author reply: Yes, the proposed method is sufficiently generalizable that it could be applied to generate e.g. anti-cancer peptides. One would change the positive examples supplied to the classifier to a series of anti-cancer peptides instead and retrain the already-trained models. We clarify this point by the addition of a sentence to Section 5, page 16-17:
We have trained five deep learning models with VAE-like latent spaces on a library of sequences of short proteins and assessed the characteristics of the resultant models, including a careful examination of the properties of the latent spaces associated with them. We show that the models have the capability to create smooth, continuous latent spaces that can be sampled for de novo AMP generation and optimization. The inclusion of a simultaneously-trained property prediction network disentangles verified-AMP and non-verified-AMP sequences; however, to our surprise we find that, unlike in previous studies, this is not always apparent in the first two principal components derived by PCA. Different models associate different information with highly-varying PCs, and different models vary drastically in the ways in which they encode variance, and, hence, information. It is important to note that this presents a challenge for the incorporation of domain knowledge, since we see that the model does not necessarily place greater emphasis on user-provided information. Furthermore, it sounds a note of caution in the interpretation of latent space orderings, since observed orderings may occupy only a small fraction of the informational content in the model latent space. We have also addressed the question of how meaningful the use of PCA is as a tool for indicating properties of VAE-like latent spaces and argued that the less distortion imposed by PCA upon the different neighborhoods and interactions between points in the manifold, the more clearly interpretable the latent space is. The analysis that we present here may be applied to other short peptides of interest, such as anti-cancer peptides. Based on our results, we would suggest retraining our Transformer, AAE, and WAE models in conjunction with a new cancer/anti-cancer property predictor that relates to the property of interest. One could then employ a similar analysis and pipeline to assess the quality of the resultant latent spaces and generate sequences of interest.

1)Input dataset to train the predictive model was 300.000 examples. It is not clear the redundance of information between AMP peptides in Uniprot and AMP peptides obtained from StarPep.

Author reply: We thank the author for noting this lack of clarity. We agree that our discussion of the input dataset was insufficiently detailed, and we have rewritten it accordingly (see below in our modified dataset construction section 2.1, page 2.) When pre-processing the Starpep and Uniprot datasets we found 2100 duplicates between both datasets and 317 AMPs from Starpep were also found within the Uniprot dataset. The duplicates were removed from the Uniprot dataset leaving only the correctly labelled peptides from Starpep. We include in the SI the physicochemical property distributions comparing the known AMPs from the unlabelled peptides from both Starpep and Uniprot. We apologise for the error in the dataset construction process and have re-trained all models on the new dataset. We found that the new results are in agreement with previous results and in some cases further reinforce conclusions previously made. The updated results and corresponding changes made in the paper can be found at the bottom of this response in the “additional changes” section.

2)What is the length distribution for peptide sequences because length 1 is only a residue, Why the authors did select this length conditions?

Author reply: The length distribution for the training sequences, has been added to the supplementary information figures (Fig.S9). Since most known AMPs are less than fifty amino acids long, we found that unsurprisingly, including the long tail of outlier sequences with greater than this length significantly impacted model performance by sparsifying the data too much, and therefore, we consider only sequences of the aforementioned lengths.

We thank the reviewer for noting that this was unclear, and we have clarified this point in section 2.1 (see our response below on rewriting the section on Dataset Construction)

3)Please, try to clarify the number of examples in the dataset, adding more information of positive and negative examples, and the proportions of the proposed examples.

Author reply: An updated discussion of the input data can be found on page 2-3 section 2.1 some details of which are discussed in the questions on the length distribution for the training sequences and the questions relating to the development of the negative dataset.

3)The developed of the negative dataset (non-AMP) is a bit complex because only a random selection of the peptides obtained in the query developed to UNIPROT is not informative. In this case is necessary to demonstrate that the peptide sequences are effectively non-AMP, also. I think the authors could building the datasets using only reviewed sequences and then remove redundancy, maybe making MSA evaluations, clustering, or other strategies to choose representative elements to build the negative examples.
Author reply: We thank the reviewer for this very pertinent comment. We agree that we insufficiently explained our rationale for the employed dataset, which could lead to significant confusion for the reader, and as such we have rewritten the section on Dataset Construction in a more detailed manner, in an effort to clarify our choices, rationale, and the final details of the dataset. We also noticed during our revisions that we had made a small error in curating the dataset and inadvertently removed duplicates from Uniprot and StarPep separately, rather than removing duplicates from the combined set. We regret the error and have retrained our models with the corrected dataset, as well as regenerating our figures. There was no significant impact on our results.

The datasets used in this study are a subset of peptides from the Uniprot database [Bateman2019], composed of 268,195 short peptide sequences with a maximum sequence length of 50 residues, sequences from the StarPep database, which features short bio-active peptides [Pinacho-Castellanos2021]. We restrict our training set to sequences of lengths 2-50 because a significant majority of AMPs are short peptides with length <50 [Ramazi2022], and 85% of the peptides featured in StarPep are between 2 and 50 amino acids long. Preliminary testing demonstrated that retaining peptides up to 100 amino acids in length led to worse performance, presumably due to the heightened sparsification of the dataset in that regime.

Out of the total 45,120 peptides in the StarPep database, 13,437 of them are labelled as having antimicrobial properties (Fig.S2). StarPep aggregates bioactive peptides from over 40 existing datasets featuring peptides with sequence lengths between 2 and 100 amino acids. After removing from the dataset the 15% of the peptides featured in StarPep with sequence lengths greater than 50 amino acids and peptides with non-standard amino acids, we retained 35,806 of the original 45,120 sequences, of which 10,841 are labelled as having antimicrobial properties. The remaining 24,965 peptides from Starpep did not have the label Antimicrobial.
Although certain deep learning algorithms can perform well when trained on “smaller” datasets of tens of thousands of datapoints [Nagarajan2018, Capecchi2021], these are more typically classifiers rather than generators. When we trained our models on the 35,806 datapoints from the StarPep database alone, we found that the models demonstrated poor reconstruction accuracy, low robustness, and high error. One method for solving this is to pretrain on a large corpus and fine-tune on a smaller one; however, we chose instead to train all at once on a larger corpus, as has previously been done for natural language models and chemical language models [Gomez-Bombarelli2018, Polykovskiy2020, Tucs, Chen2018a, Segler2018].

Since preliminary analysis on the StarPep database sequences indicated that 35,806 was not a large enough dataset to properly train our generative VAE-based models, we expanded the dataset by adding negative examples from Uniprot. After executing the query [(length:[2 TO 50]) AND (keyword:(KW-0929))], we found 714 of unreviewed peptide sequences with the label “antimicrobial” and, placing “NOT” before the keyword, found 3,713,736 of unreviewed peptide sequences without the label “antimicrobial.” We also found 1,108 of reviewed peptide sequences with the label “antimicrobial” and 11,941 reviewed peptides without the label “antimicrobial”. Since we would not have been able to significantly expand our dataset by using solely reviewed sequences, we choose to include some non-reviewed sequences. We recognize that in the worse case it is possible that this may include a potentially substantial number of AMPs that have not been identified as such. We considered the possibility of utilizing an existing sequence predictor for AMPs to identify these possibly unlabeled AMPs; however, we decided against it to avoid introducing unknown assumptions into the data at this stage. Instead we consider a different goal for ordering the space. We employ a property predictor that classifies as hits peptides with experimentally-verified antimicrobial properties (which we label with integer 1) and as misses peptides without experimentally-verified antimicrobial properties (which we label with integer 0) and demonstrate that training such a predictor enforces an ordering of the space that allows generation of AMP-like sequences.

To avoid having a prohibitively small percentage of experimentally-verified AMPs in the dataset and to match the sizes of datasets used in previous experiments with similar architectures, we subsampled 10% of the 3 million unreviewed datapoints of Uniprot and selected 268,195 to form an additional set of datapoints. Subsampling was done at random, subject to the constraint of retaining a roughly equal number of peptide sequences of each length to ameliorate the extent to which the models focused on this aspect, though there are fewer very short sequences due combinatoric constraints (Fig.S9).
Together the 268,195 peptides from Uniprot and 35,806 from StarPep form our full dataset of 304,001 peptides, of which 10,841 were labeled "verified AMPs" (all from StarPep) and 293,160 were labeled “non-verified-AMPs” (a mixture of Uniprot and Starpep sequences). We note here again that we only labeled those sequences as AMPs that are experimentally verified as such. Although this could introduce a bias into the property predictor from the unreviewed sequences, we believe that this choice is justified because our goal is not to classify sequences but to order the space in a sensible manner. After ensuring there was no sequence redundancy between the merged datasets, we inspected various physicochemical properties of the “verified AMP” and “non-verified-AMP” labeled data sets to ensure they represented a broad distribution over peptide sequence space and had largely similar distributions of various physico-chemical properties (Fig.S10-S19).

We clarify AMP/Non-AMP labeling on page 9 section 3.2

To better visualize the latent space, in Fig.5, we plot the test set of sequences embedded in the reduced latent representation for each model and color the scatter points according to whether they feature verified-AMPs (orange), or non-verified-AMPs (blue), characteristics. We observe an agreement between the Silhouette scores previously discussed and the visual clustering in the scatter plots. The AAE-128 and Transformer-128 models form more distinct clusters than the other models, though all clusters still overlap as desired.

1)What is the difference of your Autoencoders, and methods proposed by the application of learning representations as the Seq2Vec, Unirep, Babbler, AminoBert, among other?

Author reply: There are three major differences between our variational autoencoders and the learning representations mentioned. The first is that most of these representations are exceedingly high dimensional, on the order of thousands of dimensions, which is not desirable for interpretability, for the eventual use case of our latent spaces as search spaces, or for robust dimensionality reduction. The second is that these models are generally trained on the longer sequences of large proteins, whereas our model is specifically trained only on the short peptide sequences characteristic of bioactive peptides. The third is that most such representations are one-way, providing embeddings into a latent space without providing a way to reconstruct those points in sequence space, which is not desirable for our use case. It may be possible in the future to approach this problem differently by finetuning a model such as AminoBert for our specific use case, but this is out of the scope of the current work. We thank the reviewer for the question, and we have added a paragraph to the introduction to clarify this point, (page 2):

While previous methods have focused on the longer proteins that represent a larger fraction of the known proteome this research focuses on small antimicrobial peptides (AMPs). Much of the previous work discovering protein embeddings with deep neural networks has used large latent space representations [Chowdhury2021, Alley2019, Kim2020] to maximize data throughput or graph-based representations which require the use of graph neural networks to process the protein graphs. In this work we emphasize small latent representations and model interpretability in order to construct interpretable search spaces for AMP design.

2)Why the authors selected the applications of VAE to make numerical representations?

Author reply: We selected VAE-based models in this case because we wanted a set of mathematically-rigorous generative models with clearly associated continuous latent spaces that had previously been used on biomolecular design problems. Two of the most well-known generative deep learning model types are VAEs and generative adversarial networks (GANs). GANs tend to display higher performance, at least in the more mature field of image generation; however, they can be more subject to instability due to their adversarial nature, and they do not have a simple pipeline that can take in a new sample and generate its respective latent representation. The Gaussian prior assumption on a VAE leads to benefits because it allows us a natural way of acquiring a Gaussian distribution of output samples “near” a given embedded point, if desired.

To clarify this point, we have added a paragraph to Section 2.2 on Deep Learning Models (Page 3):

The variational autoencoder is a model introduced in [Kingma2014] that combines an “encoder” and a “decoder” to form a variational inference model. In specific, given a dataset with a marginal likelihood p_θ(x), the objective of the VAE is to learn the parameter set “θ” that most closely reproduces the data's distribution p(x) [Kingma2019, Kingma2015, kingma2016]. VAEs operate under an evidence lower bound (ELBO) maximization objective that leads to a joint maximisation of the marginal likelihood p_θ(x) and a minimization of the Kullback Leibler (KL) divergence of the decoder-approximated posterior q_φ(z|x) from the true posterior p_θ(z|x), where z represents the embedding variables of the latent space and p_θ(z) is the latent space prior, which for the standard VAE is a Gaussian distribution.

There are multiple types of generative models; the two most well-known and often-used are VAEs and Generative Adversarial Networks (GANs). We choose to employ VAE-based models because VAEs boast a number of properties of particular relevance for our specific case of developing smooth, interpretable latent spaces with potential use as search spaces for computer-aided biomolecular design. In comparison to GANs, VAEs have a more natural formulation of an associated latent space, with a rigorous mathematical derivation from Bayesian probability theory [Kingma2019]. The assumption of a Gaussian prior enforces a certain level of smoothness and continuity that is desirable in a design space, and allows for the sampling of Gaussian distributions “near” any defined point in the space. Finally, on a practical note, they have demonstrated utility in the field of de novo small molecule design [Gomez-Bombarelli2018], and we wanted to determine their utility for sequence-based AMP design as well. VAE's also feature an encoder which can directly map new samples of interest to their respective latent vectors, a feature not present, out of the box, in GAN architectures.

3)It is not clear if the different architectures evaluated in this work are created by the users or were reported previously.

Author reply: We thank the reviewer for noting that this point was unclear. We provide the following additional information on Page 4, to clarify this point:

In what follows, we introduce the VAE models and their unique differences. Briefly the RNN, RNN-Attention and Trans-VAE models all employ the typical VAE bottleneck with mean (μ) and log of the variance (σ) network layers while the AAE and WAE employ unique loss functions which regularize the latent space to match a set distribution. We started with three publicly available models modified from work by [Dollar2021], in which the authors compared a recurrent neural network (RNN), a RNN with an attention layer and a Transformer to assess their respective capabilities for generating SMILES strings describing novel molecular compounds for drug design. The models all made use of a VAE architecture that generated smooth latent spaces. The results demonstrated the benefits of including self-attention to deep generative models, and the authors concluded there is a trade-off between variational models in terms of their ability to properly reconstruct input data and their ability to embed complex information about the inputs into a continuous latent space. We modified these three models to take as input and output sequences of up to fifty amino acids rather than SMILES strings. In addition, we derived two more models, an adversarial autoencoder (AAE) [Makhzani2015], and a Wasserstein autoencoder (WAE) [Tolstikhin2018], by modifying the latent spaces and architectures of the original three.

4)How the authors found the best hyperparameters configuration? It is necessary to clarify the configuration of batch size and number of epochs used during the training models.

Author reply: The batch size (200) and number of epochs (300) are reported in Table 1 in Section 2.3. To clarify this point, we have updated the title of Section 2.3 from “Model Training Procedure” to “Model Training and Hyperparameter Configuration”. The batch size and epoch size were selected such that batches could fit on the NVIDIA Tesla P100 graphics cards used for training and the number of epochs was selected to mimic the total quantitiy of inputs used in the Dollar implementation [Dollar2021] of the models which input around 1 million samples and trained for 100 epochs. In our case, with 300,000 samples we train for 300 epochs. The individual model loss curves were used as an indicator of model training performance. The β KL-annealing terms and learning rate hyperparameters were obtained from the original implementation by Dolar et al. The training hyperparameter information is presented in the re-worked section “Model Training Procedure and Hyperparameter Configuration” on Page 6.

The different subsets were collated into a total dataset with 304,001 peptides with lengths between 2 and 50 amino acids. This total dataset was split into a 80-20, train-test split, resulting in a training set of 243,201 training sequences and a test set of 60,800 sequences.

We fixed the parameters to maintain consistency between the different models and allow authentic comparison of the fully trained networks. The RNN, RNN-Attn and Transformer feature a Kullback-Leibler divergence annealing term β (see Table. 1). β starts small and increases in weight as training progresses to avoid posterior collapse. Batch Sizes of 200 were found to be ideal in that they fit on the NVIDIA Tesla P100 graphics cards used for training and results in good performance. 300 epochs was found to be an appropriate number by a posteriori inspection of the loss curves

5)In section 3.2 the authors talk about the binary classification models, what is the model? How did the authors train this model?
Author reply: The binary classification model is a two-layer neural net that takes the latent space embedding as input and produces a prediction for AMP or non-AMP as its output. It was trained according to binary cross-entropy loss in conjunction with the training of the VAE model on the ELBO loss. To clarify this point, we have added the following paragraph to page 5:

All models feature an associated property predictor in the form of a binary classifier that predicts whether the latent representation of a given peptide is an verified-AMP or not. This classifier is a two layer fully connected neural network trained with a binary cross-entropy loss that is joined to the KL-divergence loss and the peptide reconstruction accuracy loss of each model.

1)The Figure 1 could be more informative adding in the scheme how the information is process, layers, and other relevant topics, for example the dimensions of the input, etc.

Author reply: We included Figure 1 as more of a broad overview for the reader. Rather than changing it, we instead include a more detailed figure in the SI and refer to it in the caption of Figure 1.

2)What is the amino acid index? The authors should define in the document to increase the understanding of the proposed work.
Author reply: The amino acid index is intended to refer to the token distance along the sequence. To clarify this point we have renamed it to “sequence position index” and modified the caption:

The position-based average reconstruction accuracies, R_i_acc with corresponding confidence intervals for the five models. The x-axis refers to the token position along the sequence, running from the first generated token to the last generated token. Each model features three variants with latent space sizes of 32, 64 or 128 dimensions, respectively. Error bars are 95% confidence intervals on the respective index-wise means computed with the assumption of Bernoulli sampling. As expected, model performance decreases with distance along the sequence in most cases.

3)Why did the authors select Bernoulli distribution in Figure 6?

The probabilities on the y-axis are obtained by summing the peptides AMPLIFY has predicted to be positive AMP samples and dividing by the total number of input peptides. The Bernoulli confidence interval is used taking into account that each peptide input into AMPLIFY would be returned as either positive or negative. This has been clarified in the image caption Figure 6, section 3.3.

The AMPlify QSAR 48 classifier [Chen2018] is a binary classifier that returns True or False and an average AMP probability is taken over the 100 local samples. All 5 models and their respective latent dimension variants are plotted.

1)Normally, in this type of problems is more used the t-SNE evaluations in comparison with PCA strategies, also, it is not clear why the authors have selected PCA methods.

Author reply: We employed PCA rather than t-SNE because as a linear dimensionality reduction technique, it provides an invertible and interpretable mapping. We clarify this point by modifying section 3.2, page 8, lines 4-8:
It has been shown that the simultaneous training of a property predictor on a latent space leads to a desirable “ordering” of that space visually identified in the top principal components. Although there are many such dimensionality reduction techniques available, we employ PCA here because it provides an interpretable and invertible mapping to a visualization-friendly space and has been previously used in the field of de novo molecular generation [Gomez-Bombarelli2018]. In this section, we identify ordering in the top principal components in our models and go one step further by quantifying the extent of the ordering and the PCs in which it appears.

2)It is not clear how was applied the PCA and why, please give more details about the necessary hypothesis to apply PCA.
Author reply: We clarify this point with the addition of a paragraph on Page 8, Paragraphs 2-3:
We employ a simple binary classifier trained to predict whether or not a given sequence falls within the experimentally validated AMPs category, or the unvalidated category. Although such a predictor can be of use in terms of assessing the specific properties of some given sequence of interest, in this case we use it rather as a tool to order the space such that we may expect interpolation between points in the latent space to generate sequences with desirable properties. In other words, for our case the property predictor primarily functions to ensure that the neighboring points of an experimentally-verified AMP in the latent space will display similar properties. We can consider this to be a way of enforcing a meaningful distance metric and quantifying how well we have done so. In Sec.3.4, we will also address the question of how meaningful it is for us to identify this “ordering” in a linear projection rather than in the full latent space.

After model training with the binary classifier, we perform principal components analysis on the associated latent space of each model (32-dimensional, 64-dimensional, or 128-dimensional, respectively) and project into the top five highest-variance components of the PCA decomposition. Based on previous studies that employed PCA for two-dimensional projections, we expected to observe latent space ordering--that is a partitioning of the data points according to the label predicted by the property predictor--in the top one or two principal components of each model. To our surprise, this was not the case (Fig.5).

3)Could be interesting to explain what does mean that the ordering proposed in the section 3.2 does not appear in the PCA 1 and PCA 2 and appear in other PCA ranks.

Author reply: We agree with the reviewer and have modified the text to clarify and address this point via additions to Sec.3.2 and to the Discussion and Conclusions section (Pages 8, 16-17)

Further investigation revealed at least some ordering in one or two of the top five principal components for 4/5 of the models. The fifth model, the RNN-Attn, performs so poorly we do not expect to observe ordering no matter how many PCs we analyze. This indicates that despite being explicitly “instructed”--via the loss function--of the importance of the verified-AMP/non-verified-AMP labeling, most models do not “pay attention” to this to the exclusion of other characteristics of the sequences. We may quantify this idea further by investigating the fractional variance explained by different PCs as a proxy for the amount of information contained.
When the extent to which these different components capture the overall latent space variance is assessed (Fig.S3) we observe that the models differ significantly in how much fractional variance explained (FVE) is contained within the top five PCs, ranging from only about 10% in the Transformer-128 model to over 40% in the AAE-128 model. The more variance is contained within the top five PCs, the more it seems to be concentrated in the first PC (AAE, RNN demonstrate a much more significant drop from PC1 to PC2 compared to the other three models--overall we observe 12% (min) and 35% (max), for the AAE, 4.5% (min) and 5.5% (max), for the RNN FVE respectively, and closer to 1-2% for the other models. Trends are largely independent of latent space dimensionality. In general, the Transformer and WAE models of various sizes contain the least FVE in their top five components, whereas the AAE models contain the most. We observe that in all models other than the Transformer, PC1 is correlated with molecular weight or length (see Sec.3.4), with all other variances being of roughly equivalent magnitudes. Overall, we may interpret this as telling us that the models place greatest importance on length, while other descriptive variables, even those we attempt to emphasize a priori, are assigned similar levels of importance.

Discussion and Conclusions

We have trained five deep learning models with VAE-like latent spaces on a library of sequences of short proteins and assessed the characteristics of the resultant models, including a careful examination of the properties of the latent spaces associated with them. We show that the models have the capability to create smooth, continuous latent spaces that can be sampled for de novo AMP generation and optimization. The inclusion of a simultaneously-trained property prediction network disentangles verified-AMP and non-verified-AMP sequences; however, to our surprise we find that, unlike in previous studies, this is not always apparent in the first two principal components derived by PCA. Different models associate different information with highly-varying PCs, and different models vary drastically in the ways in which they encode variance, and, hence, information. It is important to note that this presents a challenge for the incorporation of domain knowledge, since we see that the model does not necessarily place greater emphasis on user-provided information. Furthermore, it sounds a note of caution in the interpretation of latent space orderings, since observed orderings may occupy only a small fraction of the informational content in the model latent space. We have also addressed the question of how meaningful the use of PCA is as a tool for indicating properties of VAE-like latent spaces and argued that the less distortion imposed by PCA upon the different neighborhoods and interactions between points in the manifold, the more clearly interpretable the latent space is. The analysis that we present here may be applied to other short peptides of interest, such as anti-cancer peptides. Based on our results, we would suggest retraining our Transformer, AAE, and WAE models in conjunction with a new cancer/anti-cancer property predictor that relates to the property of interest. One could then employ a similar analysis and pipeline to assess the quality of the resultant latent spaces and generate sequences of interest.

4)It is necessary to clarify the application of the Spearman correlation to the PC components.

Author reply: Because we wanted to identify a single PC rather than a pair of PCs that was most related to AMP/non-AMP labeling, we employed the Spearman correlation coefficient as a metric. After deliberation we concluded that the use of silhouette score in 1D is still the ideal metric to evaluate AMP correlation of 1 PC. The analyses have been redone using the silhouette score in 1D instead of the Spearman correlation coefficient. To clarify this point, we modify page 13 and Figure 9, as outlined in the results modification section at the end of this review response.

We employ the Silhouette Score in one dimension} to measure correlation between individual PCs and verified-AMP/non-verified-AMP labeling, and we use the Pearson correlation coefficient (r) to measure correlation between individual PCs and the continuous physicochemical properties.

We consider a series of physicochemical properties of peptides that are measurable from sequence alone: Aliphatic Index, Boman Index, Isoelectric Point, Charge pH=3, Charge pH=7, Charge pH=11, Hydrophobicity, Instability Index, and Molecular Weight. We measure each of these properties for the test set sequences using the peptides python package Larralde2021. We employ the Silhouette Score in one dimension to measure correlation between individual PCs and verified-AMP/non-verified-AMP labeling, and we use the Pearson correlation coefficient (r) to measure correlation between individual PCs and the continuous physicochemical properties.

1)128 dimensions appear the highest performance using a transformer strategy is ok?

Author reply: The transformer has outperformed the other models in most metrics we analyze and is a good model to select for downstream peptide design. Drawbacks we identify from the transformer is increased distortions in the PCA representation of the latent space and significant degredation of performance on smaller latent spaces. With this in mind we add some clarification in the paper by introducing the top performing models in the pipeline section (see case study).

1)Why the authors applied silhouette coefficient? There are other metrics more interesting as davies boulding index and Calinski-Harabasz index.

Author reply: We thank the reviewer for this valuable suggestion. Our interest in the Silhouette coefficient stems from its boundedness where scores are limited between -1 and 1. We identify a score of 1 as problematic given the gaussian prior imposed on the latent space and it is of interest to keep track of the separation between clusters. We add this justification to the text in the introduction of the silhouette score metric, Page 8

We use the Silhouette score as our metric because its boundedness gives it a natural interpretation in this context. Scores that are too close to zero indicate that no clustering has occured, whereas scores too close to one indicate a disconnected latent space, which might mean the model had failed at satisfying the Gaussian prior. Ideally, therefore, we would observe moderate Silhouette scores, and indeed we do. The silhouette score for each PC pair is calculated by taking subsamples of PC pair vectors and computing their Silhouette Score relative to the known verified-AMP/non-verified-AMP properties. Once this has been computed for all pairs of PC's [1-5] we then find the pair that has the highest silhouette score.

2)Depending on the type of possible lectors, it is not necessary to explain the SS method, it could be add in the support information. Also, it is necessary to add more details about the estimation of SS using the principal components evaluated (pairs combinations)

Author reply: We agree with the reviewer that more details about how the Silhouette score is computed for PC pairs can clarify ambiguities in the methodology. We add a few sentences in the text about this on Page 8 Section 3.2

The silhouette score for each PC pair is calculated by taking subsamples of PC pair vectors and computing their Silhouette Score relative to the known verified-AMP/non-verified-AMP properties. Once this has been computed for all pairs of PC's [1-5] we then find the pair that has the highest silhouette score.

We remove the detailed introductory explanation of the Silhoutte Score in the methods section on page 8 of the original article.

Response to Reviewer # 2

Comments:

The manuscript by Renaud and Mansbach presents a comparative study of five deep learning model architectures with VAE-like latent spaces for de novo generation of antimicrobial sequences. The paper first discusses the training of these models and analysis of their reconstruction accuracy considering both entire-sequence reconstruction accuracy as well as index-based reconstruction accuracy. Second, the paper analyzes the latent space using a predictive model and PCA, investigating possible correlations between the first five PCs and AMP-likeness/physicochemical properties in the latent space. Interestingly, the latent space orders sequences in a chemically meaningful way, even though the models have been trained on letter sequences only, although some of the correlations are moderate.
The paper finally discusses possible approaches to improve the models, which is the disappointing part of this report: a comparative analysis and then nothing in terms of novelty. The journal must decide if this type of work belongs, also considering here that the story is well written and might help the field to progress.

Author reply: We thank the reviewer for their kind words and careful examination of our article. We address their specific points below. We also believe that they have identified a lack of clarity about the underlying thrust of our paper. We wanted to focus not on overall performance of the models but on the quality of the latent spaces associated with the models. We believe this is an important complement to the multiple extremely valuable contributions to the literature of high-performing models for AMP design. As such, we have tried to further clarify this point by modifying the introduction (reproduced below, as also mentioned in our response to Reviewer #1.

One primary point of interest that is beginning to attract attention is the quality of the latent space itself. Because in data-rich regimes, VAEs and other generative models can be extremely powerful, the focus in past work has been primarily on improving performance [Polykovskiy2020, Dollar2021, Prykhodko2019, Kusner2017, Grisoni2020, Dai2018]. However, construction of an interpretable and well-organized latent space not only tends to make targeted feature sample generation easier, it also significantly improves the ability of the user to apply domain expertise to the problem at hand. Therefore, although we do not ignore model performance, a major focus of this article is on interpretability, trustworthiness, and organization of the latent spaces themselves.

In work by [Gomez-Bombarelli2018, Dai2019, Grisoni2020], the authors demonstrated that training a property predictor on the latent space and forcing the model to consider the quality of its predictions as part of its overall goal leads to an ordering of the latent space, which is desirable for generating candidates with particular features. Because latent spaces tend towards dimensionalities on the order of 30 or above, this orderedness has been visually presented through projection onto a two-dimensional space, often by using principal components analysis (PCA) to identify the most relevant variables. Since PCA is a linear projection, it remains unclear how well this visual orderedness is preserved in the high-dimensional space, and therefore we make a point to address the trustworthiness of this simple, rapid, and useful analytic technique.

[removed paragraph to Methods section]

More generally, in this article, we focus on the question of latent space quality and interpretability for the relevant test case of generation of novel antimicrobial peptide sequences. We train five different deep generative models with VAE-like latent spaces and assess and compare their different behaviors in terms of reconstruction accuracy, generative capacity, and interpretability. We demonstrate that, as expected, a property predictor leads to an ordering in the latent space. We quantify this ordering and assess how well a PCA projection captures the properties of the original latent space. We argue that the better the PCA projection, the more interpretable the latent space, because we can apply a ranking to linear combinations of the latent space dimensions, allowing us to more easily identify bridge variables that tell us what the model considers important. We also show that the models are capable of generating unique and diverse sequences that grow more AMP-like in specific regions of the latent space.

In addition, however, to further improve the work in the latter half of the paper, (as also mentioned in our response to Reviewer #1) we have added a section where we perform a case study on a specific AMP of interest, which also demonstrates the pipeline by which we might use the models that we focused primarily on assessing, Page 2.

Methods

We trained five different models of three different latent-space sizes on a \rood{dataset} of around 300,000 short peptides. In the following sections, we describe\rood{ and justify} the \rood{dataset} and its properties, the model architectures \rood{and the training} procedure with relevant hyperparameters.

The following comments must be addressed:

1) the introduction misses on important precedents of using generative models for AMP design and actual experimental validation that must be cited:

J Biol Chem. 2018 Mar 9;293(10):3492-3509. doi: 10.1074/jbc.M117.805499.

Chem. Sci., 2021, 12, 9221-9232, doi:10.1039/D1SC01713F

Author reply: We thank the reviewer for drawing our attention to these relevant citations and regret missing them in our original draft. We have added them to the introduction, along with the following modifications, Page 2:

In the past five years, there has been an explosion of research in the field of deep generative models for small molecule drug design [Gomez-Bombarelli2018, Polykovskiy2020, Tucs, Chen2018a, Segler2018. Most work in this area until now has been focused on the design of small molecules, with only a few studies on the design of short peptides [Das2021, Sercu2019a, VanOort2021a], but in the last year or two research in this area has begun to grow. [Das2021] For example, [Das2021] constructed a WAE model that formed a complete pipeline for de-novo protein design and identified two sequences with potent antimicrobial activity against E.coli. In another paper by the same team [Sercu2019a], they created a tool, (IVELS), for analysing VAE models, throughout the training process, with which they could select models that perform best. Two other relevant recent studies have reported on the use of an LSTM model to design peptides with experimentally-verified activity against multi-drug resistant pathogens, and the use of generative RNNs to design non-hemolytic AMPs [Nagarajan2018, Capecchi2021].

2) All presented models are trained on peptide sequences, which do not contain explicit chemical information. A comparison with models trained with explicit chemical structures (SMILES) would be interesting. The authors must in any case include a thorough discussion of possible representations and how they impact model predictions and performances.

Author reply: We thank the reviewer for the suggestion. While a comparison with explicit chemical structures would be out of the scope of this work, it is a very useful potential suggestion for future work. To address the second part of this point, we have added a discussion of possible representations along with pros and cons to the discussion section:

The models developed in this research used deep learning to discover embeddings for sequences of amino acids but future work should investigate other peptide representations such as protein structure distance graphs which can embed structural information and SMILES strings used in the world of small drug design. SMILES strings encode chemical information into a sequence of characters thus allowing the models to learn chemical distinctions between the amino acids.

3) Please add a use case for the best models, for instance by sampling either diverse of focused sets and investigating the properties. Going to experimental work would be even better, because this is usually when the pretty modeling collapses.

Author reply: As mentioned above, we have added a section where we demonstrate the potential use of the best models. We do agree with the reviewer that experimental verification is invaluable when focusing on design, but we feel it is slightly out of the scope of this work, because we focus not on design but on methods for engaging with, assessing, and understanding the latent spaces of the generative models themselves.

4) In figure 9, the dimensions of the latent space should be added directly in the plot (like you did in figure 4).

Author reply: We thank the reviewer for the suggestion and have added them to the plot.

5) the authors discuss PCA as the representation of the latent space. There are many other and often better dimensionality reduction methods (e.g. t-SNE, UMAP), it would be useful to comment on why PCA is better suited in this case.
Author reply:} We agree with the reviewer that in terms of dimensionality reduction, many algorithms exist that can be fruitfully applied for identifying low-dimensional descriptions of higher-dimensional data. We focus on PCA in this work because it has been a popular visualization method in the de novo biomolecular design sphere, and because its linearity and invertibility afford it unique value in the assessment and utilization of latent spaces as smooth and continuous search spaces. We believe that our demonstration that PCA does not imply a significant distortion of the latent space in this case is a good a posteriori justification for its use in our and other similar works, which, as far as we know, has not been performed previously in this context. We modify the manuscript to clarify these points (as also mentioned in our response to Reviewer # 1):

It has been shown that the simultaneous training of a property predictor on a latent space leads to a desirable “ordering” of that space visually identified in the top principal components. Although there are many such dimensionality reduction techniques available, we employ PCA here because it provides an interpretable and invertible mapping to a visualization-friendly space and has been previously used in the field of de novo molecular generation. [Gomez-Bombarelli2018] In this section, we identify ordering in the top principal components in our models and go one step further by quantifying the extent of the ordering and the PCs in which it appears.

We employ a simple binary classifier trained to predict whether or not a given sequence falls within the experimentally validated AMPs category, or the unvalidated category. Although such a predictor can be of use in terms of assessing the specific properties of some given sequence of interest, in this case we use it rather as a tool to order the space such that we may expect interpolation between points in the latent space to generate sequences with desirable properties. In other words, for our case the property predictor primarily functions to ensure that the neighboring points of an experimentally-verified AMP in the latent space will display similar properties. We can consider this to be a way of enforcing a meaningful distance metric and quantifying how well we have done so. In Sec.3.4, we will also address the question of how meaningful it is for us to identify this “ordering” in a linear projection rather than in the full latent space.

Additional Edits

The authors identified a minor error, which we regret making and have rectified. Data assessed for charge at pH 11 was incorrectly labeled “pH 9” (pg 13), and this has been rectified.

Re-training the models resulted in slight changes to numerical values of the results which were not significant and in fact reinforced our conclusions. We outline these changes and their respective locations in the revised article below.

Changes to the reconstruction accuracy results, page 6 right column.

Models of 64 dimensions or greater generally display moderate to high performance on the reconstruction task. They exhibit an accuracy of at least 60% for tokens up to position twenty (Fig.2), of interest because most AMPs have sequence lengths of under twenty amino acids. The entire-sequence reconstruction accuracy (Fig.3), is between 50% and 70% for the 128-dimensional AAE, RNN, RNN-Attention, and WAE, and about 97% for the Transformer, although reconstruction accuracy diminishes with decreasing latent space dimensionality.

It is of interest to note that in all models except the 128-dimensional Transformer, we observe an almost monotonic decrease in reconstruction accuracy for tokens later in the sequence. All 32 dimensional models, except the Transformer approach the accuracy of a random guess (5% for any one of twenty possible amino acids) near the 50th position. This clear length-dependent effect on accuracy is expected behavior arising from the increasing difficulty of the predictive task for each successive token position: the prediction depends primarily on the previous tokens in the sequence, leading to a compounding error effect. The difficulty of this task is also exacerbated by model training with teacher forcing, which corrects mistakes earlier in the sequence during training but not during testing. The 128-dimensional Transformer clearly has the capacity to represent and retain entire sequences in a holistic manner, as it was intended to do.

Indeed, the Transformer model achieves by far the highest accuracy (97%, 128 dimensions) but is also most affected by a diminishing latent dimensionality: the mean accuracy drops precipitously from 97% to 26% when going from 128 dimensions to 64 dimensions, whereas the AAE and the RNN both achieve nearly 50% reconstruction accuracy with 64 dimensions. All models display increasing accuracy with increasing dimensionality. All attention based models display a greater improvement when increasing from a dimensionality of 64 to a dimensionality of 128 than when increasing from a dimensionality of 32 to a dimensionality of 64.

We report slight changes to variance explained numerical results, Page 8 right column.

The more variance is contained within the top five PCs, the more it seems to be concentrated in the first PC (AAE, RNN demonstrate a much more significant drop from PC1 to PC2 compared to the other three models--overall we observe 12%} (min) and 35%} (max), for the AAE, 4.5% (min) and 5.5%} (max), for the RNN FVE respectively, and closer to 1-2% for the other models. Trends are largely independent of latent space dimensionality. In general, the Transformer and WAE models of various sizes contain the least FVE in their top five components, whereas the AAE models contain the most.

We report slight numerical differences in pairwise Silhouette Score values. Page 8 right column and page 9 left column.
If we rank the models by maximum Silhouette score in all pairs of the top five PCs, the Transformer and AAE perform the best, reaching a maximum of $SS = 0.51 ±0.02 for the Transformer at 128 dimensions, and a maximum of SS=0.3± 0.03 for the AAE at 128 dimensions, in PC's [1,3] and PC's [2,3] respectively, with both of them demonstrating moderate maximal clustering ability and an increase in clustering ability with latent space dimensionality. The WAE and demonstrates poorer performance, displaying a maximum of SS=0.24 ± 0.02 in PC's [2,4] at 64 dimensions this performance is largely independent of latent space dimensionality. The RNN and RNN-Attention demonstrate the worst abilities as measured by the Silhouette score, with a maximum of SS=0.1± 0.02 in any of the top five PCs for the RNN and SS=0.02± 0.02 for the RNN-Attention.

We update the captions of figure 4 Page 9, and figure 5 page 10, to match the new top principal components.

The maximal Silhouette scores, SS, for the five models with corresponding three latent sizes. Silhouette scores are shown for the principal components pairs featuring the largest score of the top five PCs calculated with PCA (see Fig.S4 for Silhouette scores of all pairs of top five PCs). The PC combinations from left to right are, AAE: [2,3], [1,5], [3,4], RNN: [4,5], [1,2], [1,2], RNN-Attention: [3,4], [3,4], [2,3], Transformer: [1,5], [2,4], [1,3], WAE: [2,5], [2,4], [2,5]. The error bars are generated by bootstrapped sampling of the latent space and calculating a 95% confidence interval computed with the assumption of Bernoulli sampling. The flier points indicate outliers from the interquartile range of the whiskers.

Scatter plot of maximally AMP-separating PCs presented as a visualization of the different latent spaces. We embed the test sequences into the latent space, perform PCA, and illustrate the PC pair corresponding to the largest Silhouette score (cf.Fig.4). The PC combinations are as follows, AAE:[2,3], [1,5], [3,4], RNN: [4,5], [1,2], [1,2], RNN-Attention: [3,4], [3,4], [2,3], Trans-former: [1,5], [2,4], [1,3], WAE: [2,5], [2,4], [2,5]. Orange points denote known AMP sequences, and blue points denote known non-AMP sequences. Different models correspond to different visible separations in the latent space, though most models demonstrate at least some separation of AMPs and non-AMPs.

We note slight differences in the shape of the curves and results for the AMPlify sampling results. Page 11 left column.

We observe little or no (ρ_AMP}<0.4 for all tested values) partitioning for the 32- and 64-dimensional AAE and RNN-Attention. We observe a semi-linear partitioning--as the PC is increased or decreased, an area of zero AMP probability is followed by a monotonic, almost linear rise in probability--for all other models except the 128 dimension RNN-Attention. Interestingly, the 64 dimensional RNN-Attention model shows a non-linear partitioning of its space with a region of high AMP probability, even though the PCA visualisation and the silhouette score for this model did not demonstrate a clustering of AMPs in its latent space. Thus, we show that predicting experimentally-verified antimicrobial properties, even though many of the non-hits may also be AMP-like in nature, is in many cases sufficient to partition the space into regions of high and low antimicrobial peptide probability in general.

We update numerical results for the Jaccard 2-mer and 3-mer similarity scores, Pages 11 and 12.

The size of the latent space is not seen to have a drastic impact on the sequence similarity of peptides found in a local neighbourhood for any of the models. An interesting dynamic is displayed for the some models, in which as the center of the latent space is approached the samples become slightly more diverse. This is seen by the “U” shape seen especially in the RNN, and WAE models. This is likely due to the Gaussian nature of the space packing more peptides near the center thus increasing sampling diversity.

The Jaccard similarities for the 2-mers and 3-mers are rather low, but still significantly higher than would be predicted by random chance. Due to the nature of the Jaccard score, this indicates a balance between shared and dissimilar fragments, particularly for the Transformer model. The 128 dimensional Transformer has significantly higher 2-mer and 3-mer similarity scores, from 0.24 to 0.34 for 2-mer, and 0.11 to 0.26 for 3-mer, than all other models < 0.22. This is a good indication that local points in the space share similar constituent sequence fragments. Interestingly, we observe heightened similarity for the RNN and WAE models in the region corresponding to heightened AMP probability (i.e. towards the PC maximum value), whereas for the Transformer-128, the trend is opposed: there is a region of heightened similarity near the minimum corresponding to a low probability of AMP-like sequence generation, and a region of somewhat lower probability near the maximum corresponding approximately to the region of heightened probability of AMP-like sequence generation.

For clarity we modify the numerical value of the two random sequence 2-mer score to 0.06 to match the figure values. Page 11 right column.

Very roughly, for the 2-mers, a Jaccard score of 0.06 is the expected value for two random sequences of length 50, and it is significantly lower than that for the 3-mers.

We remove a results paragraph about the Jaccard score results that is no longer relevant.

The standard deviation of sequence similarity and Jaccard scores is comparatively high. The means are quite similar for most models. but the AAE model clearly tends to produce more dissimilar sequences than the other four. We speculate that this may be related to its adversarial nature; however, we note that it has no impact on its performance as measured by any of the other metrics considered.
We improve clarity in presenting results on bridge variables and update numerical values where necessary. Pages 13 left column and 15 left column.

For all models except the Transformer. all scores are >0.5 for all four metrics and the scores tend to hover around 0.75, with some exceptions. While the 128-dimensional Transformer performs best according to the steadiness score(~0.8), the Transformer model in general performs worse on all other metrics than the other models, particularly cohesiveness, for which all three Transformer models perform worse than any other model, and in particular the 128-dimensional Transformer displays an exceedingly poor performance of just over 0.25.

Overall we observe it is possible to construct linear projections of model latent spaces with comparatively low overall distortions; however, the model that has thus far performed particularly highly on other metrics (the 128-dimensional Transformer) has the greatest distortion, particularly in terms of cohesiveness, meaning it potentially introduces a significant number of false groupings. The Transformer's comparatively high distortion overall underscores one of the traditional trade-offs of machine learning: with greater power comes lessened interpretability.

(Page 15)

We note that although there is a comparatively low Silhouette score in one dimension indicating that the score in two dimensions is a more appropriate quantification of ordering in the system, we may use it to indicate which PC the models consider “most” relevant for verified-AMP ordering, and thus identify whether those PC's simultaneously correlate with physicochemical properties.

For example, The 32-dimensional AAE, 64-dimensional WAE, RNN and Transformer and 128-dimensional AAE and RNN models employ one of the top 5 PC's most strongly for AMP ordering (Fig.9A), and the same PC is also correlated with Charge pH=7 (Fig.9B) and Isoelectric Point (Fig.9F) for these models (Fig.9H).

We adjust our discussion section to account for our new numerical values from the updated results section. Page 16 right column, 17 left column.

We have investigated the use of the principal components with the highest clustering of verified AMP properties
for de novo AMP generation and showed that our models generate highly diverse and unique sequences, with comparatively low sequence similarity in local neighborhoods. Despite the low similarity and the use of a predictor trained on experimentally-verified AMP properties rather than direct knowledge of AMP-like-ness, all models but the 32 and 128-dimensional RNN-Attention and 64-dimensional AAE are capable of successfully partitioning a single coordinate in the latent space into regions that generate AMP-like sequences with high probability and regions that generate non-AMP-like sequences with high probability. We observe that the capacity of the model to reconstruct input sequences is not clearly linked to its ability to partition the space, and we add our voices to a number of cautions against the over-use of reconstruction accuracy as a metric for generative models.

We have evaluated the extent to which models order their latent spaces according to standard peptide physicochemical properties that they are not trained on and identify the principal components most strongly correlated to given properties. We find that the models will order any of the first five principal components investigated according to physicochemical properties but when a model needs to assign a larger proportion of the variance to learning peptide length the first component is usually correlated with length / molecular weight. Indeed, only the 128-dimensional Transformer eschews length as a consideration almost entirely (no length dependence observed in the reconstruction accuracy, small correlation between any top five PC and molecular weight). The Transformer in general clearly has the capacity to function independent of the length but demonstrates a more rapid drop in performance as the latent space dimensionality is decreased than the RNN, WAE, or AAE. We speculate that this is due to the nature of the task being that discriminating between lengths can actually discourage models from overfitting; that is, from simply “memorizing” the answers, and may encourage a more meaningful lower-dimensional representation, although we also note that the 128-dimensional Transformer shows by far the most heightened local fragment-based similarity in its verified-AMP-relevant PC. This could suggest a model relying on fundamentally different information from the others.

In terms of other relevant physicochemical variables, we observe moderate correlations (0.35<~ Pearson <~0.65) between at least one PC and isoelectric point / charge at pH 7 in the AAE-64, RNN-32, RNN-64, Transformer-64 and all WAE models and moderate correlation between at least one PC and hydrophobicity in all but the RNN-Attention-32. As these are traditional hallmarks employed for AMP design, this is desirable behavior, and in particular aids in the interpretability of the models through a linear mapping of the latent space variables to straightforward “bridge variables” for most models. It also shows that the models are capable of identifying relevant properties from sequences alone, despite being trained only with a binary property predictor.

We may further employ the bridge variables in conjunction with the probability of AMP generation in a single PC (Fig.6) to aid in the interpretability of the models. In our case models demonstrating a monotonic linear increase in probability, especially those reaching a prediction probability of >0.6 (128-dimensional AAE, 64-dimensional RNN, 64-dimensional RNN-Attention, 64 and 128-dimensional Transformer and 32 and 64 dimensional WAE models) are arguably the easiest to interpret, since we now essentially have a single linear mapping from the latent space to AMP probability, which for the RNNs and WAEs are comparatively non-distorted from the original space (cf. Sec.3.2).

We remove a paragraph in the discussion going into too many details of numerical results which we thought reduced clarity and succinctness of the discussion. Page 18 of the original article.

We modify page 11 section 3.3 to improve the description of how sampling is performed in the reduced space.

In contrast to the previous section, where we focused on PC pairs, in this section, for ease of calculation and visualization, we employ only a single variable per model variant. We choose the principal component (PC) having the largest Silhouette Score with verified-AMP/non-verified-AMP labeling (see Fig.9A and further discussion of PC identification in Sec 3.4). We identify the “width” of the latent space by calculating the mean and standard deviation over the embedded sequences. Then we sample from ten evenly spaced regions that form a “line across” the latent space, limiting the min and max values of the region centers to ±4 times the standard deviation. The random sampling performed in each region follows a normal distribution with the mean determined by the region center and the standard deviation set to 20% of the standard deviation of the latent space. (see Table.S1 For external validation on a large number of generated sequences--difficult to perform experimentally--we assess in each neighborhood the probability of generating AMPs ρ_AMP (Fig.6) by employing a previously published QSAR model. AMPlify is a deep neural network AMP classifier that has been shown to achieve 93.71% accuracy on the task of AMP classification [Chen2018].

We update the numerical values in Table S1 in the supplementary information.

Round 2

Revised manuscript submitted on 30 Jan 2023

Editor’s decision letter

09-Feb-2023

Dear Dr Mansbach:

Manuscript ID: DD-ART-08-2022-000091.R1
TITLE: Latent Spaces for Antimicrobial Peptide Design

Thank you for submitting your revised manuscript to Digital Discovery. I am pleased to accept your manuscript for publication in its current form. I have copied any final comments from the reviewer(s) below.

You will shortly receive a separate email from us requesting you to submit a licence to publish for your article, so that we can proceed with the preparation and publication of your manuscript.

You can highlight your article and the work of your group on the back cover of Digital Discovery. If you are interested in this opportunity please contact the editorial office for more information.

Promote your research, accelerate its impact – find out more about our article promotion services here: https://rsc.li/promoteyourresearch.

If you would like us to promote your article on our Twitter account @digital_rsc please fill out this form: https://form.jotform.com/213544038469056.

We are offering all corresponding authors on publications in gold open access RSC journals who are not already members of the Royal Society of Chemistry one year’s Affiliate membership. If you would like to find out more please email membership@rsc.org, including the promo code OA100 in your message. Learn all about our member benefits at https://www.rsc.org/membership-and-community/join/#benefit

By publishing your article in Digital Discovery, you are supporting the Royal Society of Chemistry to help the chemical science community make the world a better place.

With best wishes,

Dr Kedar Hippalgaonkar
Associate Editor, Digital Discovery
Royal Society of Chemistry

Reviewer comments

Reviewer 1

The proposed method is very interesting. The authors have addressed all my comments, suggestions, and questions. The presented work has been improved significantly and could be a very interesting approach to peptide discovery. Only a suggestion, please improve the quality of the figures.

Reviewer 2

the authors have addressed the reviewer comments in detail and produced a much improved manuscript, which can be published

Transparent peer review

To support increased transparency, we offer authors the option to publish the peer review history alongside their article. Reviewers are anonymous unless they choose to sign their report.

We are currently unable to show comments or responses that were provided as attachments. If the peer review history indicates that attachments are available, or if you find there is review content missing, you can request the full review record from our Publishing customer services team at RSC1@rsc.org.

Find out more about our transparent peer review policy.

Content on this page is licensed under a Creative Commons Attribution 4.0 International license.

From the journal Digital Discovery Peer review history

Latent spaces for antimicrobial peptide design

Round 1

Reviewer 1

Reviewer 2

Round 2

Reviewer 1

Reviewer 2

Transparent peer review