Open Access Article
Jyler Menard
and
R. A. Mansbach
*
Department of Physics, Concordia University, Montreal, QC H4B 1R6, Canada. E-mail: re.mansbach@concordia.ca
First published on 7th May 2026
Generative deep learning techniques have demonstrated an impressive capacity for tackling biomolecular design problems in recent years. Despite their high performance, however, they still suffer from a lack of interpretability and rigorous quantification of associated search spaces, which are necessary to unlock their full potential for scientific inquiry beyond efficient design. An area in which they are of particular interest is the design of antimicrobial peptides, which are a promising class of therapeutics to treat bacterial infections. Discovering and designing such peptides is difficult because of the vast number of possible sequences and comparatively small amount of experimental information. In this work, we perform an empirical investigation of latent Bayesian optimization for searching through peptide sequence spaces, with a focus on antimicrobial peptides. We investigate (1) whether searching through a dimensionally-reduced variant of the latent design space may facilitate optimization, (2) how organizing latent spaces by differing amounts of more and less relevant information may improve the efficiency of arriving at an optimal peptide design, and (3) the interpretability of the spaces. We find that employing a dimensionally-reduced version of the latent space is more interpretable and can be advantageous, while the use of less-relevant but more easily-computable physicochemical properties is advantageous to latent space organization in certain contexts and the use of more-relevant but sparser properties associated with the latent Bayesian objective function is advantageous in others. This work lays crucial groundwork for biophysically-motivated peptide design procedures, with an especial focus on antimicrobial peptides.
Design, System, ApplicationThe goal of our work is to develop optimization methods for designing antimicrobial peptides, short proteins that kill bacteria. Most evaluations of candidates are computationally or experimentally intensive tasks where a direct mapping of the peptide composition to activity is not available. Bayesian optimization is an iterative design algorithm that excels in situations where we wish to optimize a black-box function with limited evaluations, exactly the case at hand. We probe different search spaces with a focus on improving BayesOpt efficiency and interpretability. We compare BayesOpt performance in the latent space of a variational autoencoder to its performance in projected, lower-dimensional versions of that space. We ask whether organizing the search space with easily-computable, but less informative, physicochemical properties can improve optimization efficiency and search interpretability over using a limited number of very informative labels. In our toy setting of using a trained predictive model as our objective, we found that using sparse informative labels in a lower-dimensional space can provide an optimization and interpretability advantage. This work lays necessary groundwork for closed-loop antimicrobial peptide design with more realistic objective functions. |
One potentially fruitful area of research is the design and development of new and more effective antimicrobial peptides (AMPs). Antimicrobial peptides are short proteins of around 5 to 100 residues with broad-spectrum antibacterial activity.9 Many AMPs are thought to act by disrupting the bacterial membrane, through the formation of large pores, through inducing micellization of membrane patches (the so-called “carpet method”), or through a combination of such mechanisms.10 Because of the comparatively indiscriminate nature of the entire membrane as a target rather than a small protein binding pocket like most traditional small-molecule drugs, it was hypothesized that resistance to membrane-active AMPs will develop more slowly than to many traditional small-molecule antibiotics. Evidence supporting this hypothesis includes in vitro experiments across four different Gram-negative bacterial species showing less resistance developing to membrane-active antibiotics than to traditional antibiotics within the same time frame.11 In the same study, it was demonstrated that strong resistance to traditional antibiotics appeared within 120 generations of repeated exposure,11 compared to the 600 generations required for resistance to develop to an analogue of the peptide magainin-2 that was observed by Perron et al.12 Despite differences in study methodology, the evidence overall supports slower resistance development to peptides. In addition to the possibility of slowing resistance, certain AMPs can be used as part of combination drug cocktails to improve the efficacy of other antibiotics.13 Such combination therapies have also been found to have slower resistance development.14 Due to their broad-spectrum activity, slow resistance, synergism with other antibiotics, and the need for new antibiotics, AMPs are a promising therapeutic class.
While peptides are short compared to other proteins, the space of all possible peptide sequences is still vast. Considering only naturally-occurring amino acids, there are on the order of
possible sequences. Traditional mutagenesis methods are inadequate to explore more than a tiny fraction of this sequence space to find peptides with properties of interest. Traditional computational methods, including machine learning, can improve the speed of candidate identification by rapidly screening databases full of peptide and protein sequences. For example, some of the earliest predictive models for AMP identification were hidden Markov models and shallow neural networks,15,16 with logistic regression17 and SVMs18 following after. More recently, motivated by the success of deep neural networks for (natural language) sequence representations, deep neural network architectures began to be employed.19 All of these methods, however, function to identify candidates in a database and cannot suggest completely novel AMPs. Because of the number of possible peptides compared to the speed at which they can be tested, finding peptides with desired properties is difficult.
Deep learning generative models can partially obviate the need to pre-generate libraries of sequences by providing a method to sample arbitrarily many new sequences. Recently, such models have been adopted by the biomolecular design community and have demonstrated a powerful capacity to produce peptides with desirable properties. Capecchi et al.20 trained a strain-specific RNN to generate peptide sequences and filtered them using classifiers. Li et al.21 trained a generative model based on a transformer architecture intended to generate high-activity mutants from an input sequence. Zakharova et al.22 used an RNN to generate anticancer peptides. Szymczak et al.23 designed a conditional VAE – conditioned on categorical classifications of AMPness and high/low activity levels – in an attempt to directly sample from the high-activity AMP conditional probability distribution. Das et al.24 used a Wasserstein autoencoder together with a number of classifiers to generate peptides.
While impressive work is being done in improving both predictive and generative models, tying them together in a closed-loop optimization scheme is (currently) less common. Furthermore, the focus of these models has tended to be on design of peptides for specific applications rather than an understanding of the ways in which the specific model and procedure affect the underlying design space. Recently, we developed quantitative benchmarks and investigated the properties of the design space itself in the context of our own generative model for antimicrobial peptide sequence design based on a variational autoencoder (VAE).25 VAE-based models possess associated latent spaces that can be thought of as explicit continuous design space representations of the discrete peptide sequences. This formulation of the problem facilitates a focus on the quality of the design space itself. In this article, we expand on this work to investigate how the organization of the design space can affect optimization procedures in that space.
While generative deep learning models on their own do not resolve the fundamental needle-in-haystack issue of finding sequences with significant antimicrobial potency and/or other properties of interest, they can produce candidates or design spaces to be used in conjunction with optimization techniques. Broadly speaking, finding a sequence with high antimicrobial activity can be modelled as an optimization problem, with a corresponding objective function that maps from peptide sequences to antimicrobial activity. In a scenario where there is a straightforward function mapping sequences to antimicrobial activity, a litany of optimization algorithms would allow us to find the sequence (or sequences) corresponding to maximal antimicrobial activity. However, how antimicrobial activity relates to the peptide sequence is not sufficiently understood to write down a straightforward function, instead leaving us in the situation of having to treat the relationship as a black box we can probe through expensive experimental assays, machine learning models limited by their data sets, and/or long, resource-intensive MD simulations depending on our available resources.
Bayesian optimization (BayesOpt) is an iterative method of finding the optimum of a black-box function. BayesOpt fits a data-driven, simple, and flexible surrogate model to approximate the relationship between input points (such as peptide sequences) and the objective function (such as antimicrobial potency). To perform BayesOpt one must also define an acquisition function that scores to what extent a point in the design space is predicted to improve on the current best observed value. At each iteration, the acquisition function combined with the surrogate model identifies a point of interest for the next round predicted to most improve upon the optimal point found so far.
One can employ BayesOpt to search through the latent space of an associated generative model; this is referred to as latent Bayesian optimization (latent BayesOpt).26–29 Latent BayesOpt has been previously successfully applied to the design of small molecules26 and synthetic optoelectronic peptides.30 Although screening peptides sampled from a generative model to design AMPs without using Bayesian optimization also has precedent,23,24 we focus on latent BayesOpt because of its data efficiency and potential for application with higher-fidelity, more costly oracles.
Other works have explored how the organization of the latent space can impact the efficiency of latent BayesOpt. A number of ways to organize the latent space have been investigated, with particular focus on small molecule discovery where large amounts of data are readily available. In Gómez-Bombarelli et al.,26 the authors investigated whether jointly training a property predictor and a decoder can improve the efficiency of latent BayesOpt for the task of finding optimal small molecules, finding compelling evidence that it can. Lee et al.28 examined how latent BayesOpt performs in a latent space induced to be correlated with the objective they are trying to optimize, finding promising results. In Grosnit et al.27 the authors compared a number of contrastive learning methods, finding that a triplet loss can be effective. Given these works, it seems the efficiency of latent BayesOpt can be improved by organizing the underlying latent space being searched through, but this has not been previously investigated in the context of antimicrobial peptide design. One further consideration for latent BayesOpt is that often the associated neural network has a relatively high-dimensional latent space. BayesOpt can have difficulty in high-dimensional spaces. On the other hand, reducing the dimensionality of the latent space may corrupt the ability of the associated generative model to generate diverse, novel, and realistic sequences. This leaves us with a dilemma where we may expect either worsened BayesOpt performance due to a high-dimensional latent space, or worsened generative model performance due to too small a latent space.
Unlike in the small molecule design space, where large amounts of data relevant to the performance of such molecules as antibiotics are readily available, the corresponding data on antimicrobial peptides is relatively sparse. There are hundreds of thousands or even millions of examples of small-molecule data as experimental drug-target binding affinities and well-established computed properties related to drug-likeness and solubility readily available. Conversely, in the context of peptides, antimicrobial activity in the form of experimentally-measured minimum inhibitory concentration (MIC) values is scarce; specifying particular bacterial species as targets restricts the available data even further. For example, for E. coli there are just 11k entries with corresponding MIC measurements, and if we restrict ourselves to those sequences with more than one reported MIC measurement, then we are left with 4k entries.31 Experimentally-measured peptide hemolysis, which is important for the assessment of host toxicity, is available for even fewer examples, with DBAASP containing 8k sequences with reported hemolysis measurements that fall to 1.6k when restricting ourselves to more than one measurement per sequence.31 While this may be a sufficient amount for traditional machine learning methods, it is unclear how this amount may impact the effectiveness of generative deep learning models and the organization of their latent space. Moreover, it is unclear whether using easily-computed physicochemical properties as proxies may offer a benefit over the more direct but limited amount of activity data.
In this work, we determine the best practices for performing latent Bayesian optimization in the context of peptide sequence spaces, with a focus on the antimicrobial peptide context in particular. We introduce and assess a modification of latent Bayesian optimization as it has been previously applied in drug discovery applications by performing Bayesian optimization over a reduced space representation of a peptide sequence-based latent space (see Fig. 1 for a schematic overview of the versions of latent Bayesian optimization we investigate). Rather than performing Bayesian optimization over the high-dimensional latent space representation, we explore the possibility of performing Bayesian optimization over a much lower-dimensional projection of that space. The purpose of this approach is twofold: (i) we expect an increase in efficiency by using BayesOpt in a lower-dimensional space without modifying the size of the generative model's latent space, and (ii) we expect an increase in interpretability by performing optimization in a more similar space to that which we employ for visualization compared to using reduced projections solely to visualize (a proxy of) the design space. We also note that we employ a proxy objective function (an “oracle”) related to predicted AMP activity for better control of algorithmic design prior to deployment on more realistic, expensive objectives, such as experimental measurements of peptide activity.
We further investigate three important questions related to incorporation of prior knowledge to latent Bayesian optimization in peptide sequence-based design spaces. We first ask whether organizing the latent space with easily-computable physicochemical properties that are not directly related to predicted AMP activity is more informative or higher-performing than organizing the latent space with a sparse amount of predicted AMP activity data. We secondly ask whether organizing by more properties leads to a more structured, easier to optimize in, latent space. We thirdly ask whether increasing the percentage of labelled data available has a direct relationship to latent BayesOpt efficiency and latent space organization, or if there are diminishing returns. By providing an analysis of latent BayesOpt in both a semi-supervised and latent space organization context, we hope to further the discussions of how best to perform Bayesian optimization in latent spaces in a manner that informs us about the design space that is being traversed by the optimization algorithm, and introduce the technique to the therapeutic peptide design problem in the context of AMPs.
We use a Gaussian process regression (GPR) model for the surrogate objective function. A GPR is a probabilistic model where, for any point
in our design space, we assume we have a Gaussian distribution (our prior distribution), fully determined by a mean m and covariance matrix σ = (σij). The mean and covariance are parameterized by a mean function m(
) and kernel function k(
i,
j), each of which depend on data points assessed according to the true objective function. For a data set
of n points in an F-dimensional space with associated objective values
= {(
j, yj)}nj=1, where
∈
F, at iteration l, we fit the mean and covariance,
![]() | (1) |
![]() | (2) |
F×F is a diagonal matrix of fitted length-scales, which are fitted using the exact marginal log-likelihood. Here we use the BoTorch default choice of a constant mean function – specifically the sample mean of the observed objective function values – so it does not depend on the input vector. Together these give a prior distribution
(ml, σl).
| (3) |
Ml( ) = ml( ) + σl( , xD)Tσl(xD, xD)−1·(f(xD) − ml(xD))
| (4) |
∑l( ) = σl( , ) − σl( , xD)Tσl(xD, xD)−1σl( , xD),
| (5) |
i}, σl(xD, xD) is an n × n matrix with entries [σl(xD, xD)]i,j = k(
i,
j), and σl(
, xD) is a vector of length n with entries [σl(
, xD)]i = k(
,
i). We use the radial basis function kernel (default when using BoTorch's SingleTaskGP32), but other choices exist. Because the radial basis function kernel is distance-dependent, there is an implicit assumption that points near each other have similar objective values. Through the distance-dependence, points near each other in the design space will be predicted to have similar surrogate objective values. With the posterior distribution in hand, predictions are made either by sampling from the posterior distribution
( ) ∼ (Ml( ), ∑l( ))
| (6) |
As mentioned, in addition to the Gaussian process surrogate model, we must select an acquisition function, which identifies the next point to be evaluated with the true objective. A common choice is the Expected improvement, defined as,
is the best value we have observed up to iteration l,
(
) is the predicted value of the surrogate function at point
, and (
(
) −
)+ = max(0,
(
) −
) is the improvement. EI(
) encourages exploitation by searching for a point that is likely to improve on the current best point, and it encourages exploration through the standard deviation and cumulative distribution of the GPR function.
For our case, rather than using the expected improvement, we use the log expected improvement, which has been found to be nearly identical but is easier to numerically optimize.34 It is calculated as the logarithm of the expectation value of the improvement over the previous best value according to the surrogate model,
l+1 = argmax (LogEI( )) |
l+1-th point using the true objective function, we add the pair (
l+1, f(
l+1)) to the dataset
=
∪ {(
l+1, f(
l+1))} and start the next iteration.
:
X → Z encoding sequences into a latent space, and a decoder D
:
Z → X that reconstructs the sequence corresponding to a latent point. Here Z is typically the latent space of a generative model. The conventional choice is to use the latent space of a VAE, where the encoded mean
θ(
) of a trained VAE with fitted model weights θ is used in the optimization.26 More formally, latent Bayesian optimization is the process of solving![[f with combining circumflex]](https://www.rsc.org/images/entities/i_char_0066_0302.gif)
:
Z →
approximates the relationship between latent points and their corresponding objective value; for a peptide sequence
and its encoded representation
θ(
), the surrogate model approximates
θ(
) → f(
). At iteration l, the acquisition function determines the subsequent point in the latent space
l+1 to decode into a peptide sequence
l+1 we can evaluate with the objective function f(
l+1). In other words, rather than performing the Bayesian optimization on what we want to design (peptide sequences) we perform it over a (learned) representation of those sequences.
In this pilot study, we wished to assess the impact of algorithmic choices and ensure our understanding of the design space prior to engaging in time-consuming experimental work. Therefore, we use a proxy model (also called an “oracle”) based on traditional machine learning techniques (see next section) as our “ground truth” to estimate the MIC values of sequences for which no experimentally-measured MIC is available. This is analogous to proof-of-concept small-molecule discovery projects that optimize for computed estimates of drug-likeness and solubility; a discipline-relevant proxy model is used to motivate an algorithmic exploration. In future work, we plan to leverage the procedure here to inform peptide design in more realistic contexts.
:
X → Z, which takes an input sequence and projects it to a continuous-valued latent space z = Eϕ(x); and the decoder Dθ
:
Z → X modules, which maps a point in the latent space back to the original input sequence space
= Dθ(z) with ϕ and θ representing learned model weights.37 Once the model has been adequately trained, we can use the latent space and the decoder to generate new peptides, sampling from a distribution that is similar to the distribution found in the training set.
Training a VAE entails minimizing the evidence lower bound (ELBO), a combination of a reconstruction loss, which measures the difference between the input sequence and the corresponding generated output sequence, and the Kullback–Leibler divergence which functions to push the probability distribution of the latent points to be close to a Gaussian prior modeled by the latent space. We define the reconstruction loss as is typically done for sequence-based tasks38 as the cross-entropy of the probability of predicting the next token xi+1 conditioned on a context
, usually the previous tokens in the sequence:
) is the predicted probability distribution, and A is the number of different tokens in the alphabet. Because the target token values come from the training dataset, and because we are using the maximum likelihood principle, we say that pc = 1 when c is the correct class and = 0 when c is not, leading to the
We use a VAE with a transformer architecture as described previously,25,39 known therein as a TransVAE. The transformer architecture is largely similar to the architecture originally proposed in Vaswani et al.40 Input sequences were tokenized into integers then fed through a learned embedding layer. During training, sequences were predicted autoregressively, with the (n + 1)-th token predicted from a context of the previous n tokens, where the previous n tokens are given directly from the training set and the reparameterized latent point representation
. The embedding layer and a positional encoding layer then together feed into the encoder module consisting of self-attention blocks followed by convolution blocks. Each self-attention block is a sequence of layers: layer normalization, multi-head self-attention, dropout, a residual connection, then layer normalization, feed-forward, dropout, and another residual connection. After the self-attention blocks, there is a convolution block consisting of three layers of convolution, pooling, and leaky ReLU. The convolution block outputs the mean and variance for the VAE latent space. The decoder module consists of largely the same architecture in reverse: convolution blocks followed by masked self-attention blocks. In our previous results25 we found evidence suggesting that the TransVAE architecture may limit latent space interpretability compared to other architectures, however we did observe high performance at the assigned tasks. Here, we chose to use a transformer architecture because of their effectiveness at sequence-to-sequence tasks, and because traversing their latent space can correlate well with a predicted property,25 an important aspect of this work. Additionally, to limit the scope of this work, we focused on only the TransVAE architecture, but in future work will expand to other architectures and representation learning choices.
As previous work has shown, jointly training a property predictor can induce latent space organization.25,26 In this work, we investigate the impact of the choice of predictive task and the availability of data associated with that task on the performance of the latent BayesOpt algorithm. To do so, we train multiple TransVAE models jointly with a property predictor consisting of two feedforward layers (see the next section). The property predictor loss term
PropPred is a simple mean-squared error between the property predictor's outputs ypred, and the computed property values ytrue,
![]() | (7) |
To compute these physicochemical properties, we used the python package peptides.43 The properties were then normalized to have zero mean and unit standard deviation using their respective training set means and standard deviations; the test set property values were normalized using the training set means and standard deviations.
Overall, we studied the following models: (i) TransVAE trained jointly with a regressor directly predicting oracle values, i.e. log10MIC values estimated by a predictor model; (ii) TransVAE trained jointly with a regressor predicting Boman index values; (iii) TransVAE trained jointly with a regressor predicting charge values; (iv) TransVAE trained jointly with a regressor predicting hydrophobicity values; (v) TransVAE trained jointly with a regressor predicting both the Boman index and charge; and (vi), TransVAE trained jointly with a regressor predicting the Boman index, charge, and hydrophobicity values.
To assess the effect of data sparsity, we trained each model with different numbers of available labels for the associated regressor. For (i), we trained models with 100% and 2% (corresponding to 10
538 points in the training set) of the labels in the training set. For (ii)–(vi), we trained models with 100%, 75% (corresponding to 392
586 points in the training set), 50% (corresponding to 261
647 points in the training set), 25% (corresponding to 130
746 points in the training set), and 2% (corresponding to 10
538 points in the training set) of the labels. The loss function for jointly training the TransVAE with a property predictor was the sum of the ELBO loss and the mean-squared error (predictor's loss), the latter of which was simply set to 0 for any data-point without an associated value in the training dataset. Thus, training proceeded in a semi-supervised or fully-supervised manner for the predictors, depending on the extent to which labels were made available for the training dataset.
All 2 + 5 × 5 = 27 models were trained on the associated loss function for 100 epochs using PyTorch.
772 sequences. We sourced sequences from AMP-specific databases DRAMP V3.0,44 APD3,45 GRAMPA,36 dbAMP 2.0,46 StarPep,47 and general protein databases SwissProt and TREMBL.48 Using peptide sequences from databases of known antimicrobial peptides ensures that the data distribution—and therefore the distribution being learned by the TransVAE—contains examples of sequences that are AMPs. Additionally, to ensure the training distribution also contained non-AMPs, we excluded those peptides from SwissProt annotated with the keywords antifungal, antimicrobial, antibiotic, antiviral defense, antiviral protein, secreted; although the impact of keyword combinations on predictive model performance has been discussed elsewhere.49 Lastly, we clustered TREMBL sequences at 90% identity using CD-HIT,50,51 then randomly sampled 7500 peptides at each length, yielding a subset of 668
343 peptides (the smallest length in the TREMBL set after clustering was 15). Once we removed duplicates, the final dataset of unique peptides from every database was 665
772 sequences. This final amount was split into a training set of 523
848 sequences and a test set of 141
924 sequences where we ensured that the training set and test set had the same proportion of verified AMP and verified non-AMP peptides (Fig. S2).
We employ principal component analysis (PCA) for dimensionality reduction, a commonly employed linear dimensionality reduction method.52 In PCA, the centered covariance matrix of the data is decomposed into its eigenvectors, which are then ordered by their corresponding eigenvalues. The top eigenvectors—the top principal components—capture the most variance in the data points. PCA has the advantage of being a mathematically-invertible function and of being highly interpretable due to its linear nature; however, it suffers from an inability to accurately project nonlinear surfaces.
For each of the models we trained, we compare searching through their 64-dimensional latent spaces with searching through different dimensional PCA projections of their latent space. To obtain the projected space, for a given VAE, we encoded the peptide training data set into the VAE's latent space, then performed a PCA decomposition of those points to obtain the components capturing most variation in the data. We used the top-n components to make the BayesOpt search space, with n varying across 2, 5, 10, 20, 32. During a given BayesOpt iteration l, optimization of the acquisition function yields a candidate point
l+1 in the 64-dimensional latent space; alternatively if we are searching through a PCA-reduced space, then the candidate point is mapped back to its corresponding 64-dimensional latent space through a PCA inverse projection, giving
i+1. We then decode
l+1 to get a peptide sequence
l+1 that is given to the oracle to get a predicted log10(MIC) value. We take the negative of that prediction to obtain the objective value for that point in the latent space (or in the PCA projection of the latent space).
In Fig. 2, we visualize the PC components most highly correlated with the Boman index, charge, and hydrophobicity for the models trained with all three. In line with previous results,25,26 we observe latent space organization when jointly-training a decoder and a property predictor. Additionally, training a property predictor predicting multiple different properties results in an organized latent space where different properties correlate most with different pairs of dimensions (Fig. 2 rows; Table S5, particularly the Boman index and charge). Furthermore, varying percentage of property labels—thereby limiting the amount of labels the model has access to in training—still results in latent space organization, as indicated by the persistence of an orderly color gradient between the first and last rows of Fig. 2. Overall, increasing the number of properties and masking even a large proportion of the information for the predictor did not prevent the latent space from being organized (for a more detailed discussion of correlations between PC components and organizing variables, we refer the interested reader to section S2.3 in the SI).
Given that for 64-dimensional latent spaces we are visualizing just two-dimensional slices through PCA projections, it is possible that the organization is being induced by the projection. To check the extent to which PCA is distorting the original, high-dimensional manifold, and thus the extent to which it may be providing an illusion of organization when the high-dimensional cloud of points is not actually organized, we compute four manifold distortion metrics as we have done previously:25 trustworthiness and continuity,53 steadiness and cohesiveness.54
These distortion metrics quantify different ways in which arrangements of points can be disrupted when transforming between the high-dimensional manifold and the low-dimensional manifold (“reduced space”). All run from zero, indicating high distortion (low faithfulness), to one, indicating low distortion (high faithfulness). Trustworthiness measures how frequently points that are not clustered together in the high-dimensional space become clustered together in the reduced space. Conversely, continuity measures the extent to which clustered points in the high-dimensional space become no longer clustered together in the reduced space. Unlike trustworthiness and continuity, which measure the number of nearest neighbours being lost or gained, steadiness and cohesiveness estimate the amount of stretching and compressing occurring when transforming between spaces. Steadiness quantifies the extent to which groups of points that are separated in the high-dimensional space remain separated in the reduced space. Cohesiveness quantifies the extent to which groups of points near each other in the high-dimensional space remain near each other in the reduced space. A more detailed description, including algorithmic details, is available in Jeon et al.54
We compute the extent of manifold distortion between the full latent space and the 5-dimensional PCA projections using a held out test set of sequences. We observe similar levels of distortion between high-dimensional latent spaces and their PCA projections for models trained with different organizing properties and different percentages of labels (Fig. 3). Trustworthiness, continuity, and steadiness are all relatively high, with each organizing method exhibiting values >0.75 (Fig. 3a–c), suggesting high-quality projections with few artificial clusters (trustworthiness), minimal loss of clusters (continuity), and little stretching of distances between points (steadiness). Cohesiveness, lying near 0.6 for all, is moderate, suggesting that compression is occurring (Fig. 3d), as it necessarily must when moving from high to low dimensions. The similarity of the results and the comparatively low-to-moderate distortions observed provide confidence that the visualizations reflect extant trends between, and patterns in, the high-dimensional latent spaces.
We note in particular similar levels of distortion at various label percentages. Comparing within subpanels of Fig. 3, the bar plots are relatively stable in value. From these results, we infer that the original high-dimensional latent spaces are being organized even when label percentage is low, and that we are not merely seeing an illusion of organization induced from the PCA projection. In conjunction with Fig. 2, this confirms that the number of labels can be relatively sparse and still result in the latent space being organized; perhaps most surprisingly in the case of having access to only 2% of the property labels.
For a final note, each model was trained to the same stopping point (100 epochs). At the end of training, the models all had similar training and validation loss values (Fig. S3). We interpret this to mean that the peptide representations learned by the models are the essential difference between them, not their reconstruction accuracy.
Somewhat to our surprise, we observe that BayesOpt in higher-dimensional linearly-projected spaces tends to result in a higher average final objective score than in lower dimensional ones, 〈
〉 (Fig. 4). Indeed, for the BCH-organized space when 100% of the property labels were available at VAE training time (Fig. 4a), we observe a monotonic increase with dimensionality in the final objective value achieved, although scores are comparatively similar for all linearly-projected spaces for about the first 100–200 iterations. For the oracle-organized space when 100% of the property labels were available at VAE training time and the BCH-organized space when the 2% of the property labels were available at training time, we observe similar behavior for most of the searches in linearly-projected spaces but 10, 20, and 32 dimensions slightly outperform 5 dimensions by the final iteration, and all outperform 2 dimensions, which flat-lines rather early (Fig. 4b and c). The trends are less clear for the oracle-organized space at 2% available label percentage; however, we do note that 20 and 32 dimensions outperform 2, 5, and 10 dimensions (Fig. 4d).
Despite the fact that we observe a correlation between available search space dimensionality and performance in the linearly-projected spaces, better performance is available in all cases in at least one of the lower-dimensional linearly-projected spaces than in the higher-dimensional full 64-dimensional space. When just 2% of the labels were available at training time, BayesOpt in a linearly-projected space of the BCH-organized VAE outperformed BayesOpt in the original latent space for all but the 2-dimensional projection (Fig. 4b). In the case when 100% of property labels were available, we observe better performance when optimizing in a 20 or 32-dimensional PCA reduced space than when optimizing in the latent space directly (Fig. 4a) for the BCH-organized space, while we observe better performance when optimizing in a 10, 20, or 32-dimensional PCA reduced space than when optimizing in the latent space directly for the oracle-organize space (Fig. 4c). Perhaps most interesting is the behavior we observe in the case when few labels are available to organize the VAE latent space, which is closely-related to real scenarios of finding optimally antimicrobial peptides with relevant experimental labels. In this case (Fig. 4d), optimizing in 20 PCA components clearly outperforms optimizing in the corresponding latent space directly, with scores rising more rapidly to a higher final average score after 500 iterations (〈
final〉 = 0.896 ± 0.092 compared to 0.719 ± 0.061). Optimizing in 32 PCA components performs about the same as optimizing directly in the latent space, reaching a final average score of 〈
final〉 = 0.702 ± 0.063. Surprisingly optimizing in just 2 PCA components performs nearly as well: although the scores plateau at low values between ∼100–175 iterations, they rise rapidly thereafter, attaining a final average score of 〈
final〉 = 0.564 ± 0.084, mostly outperforming optimization in 10 components, and quite substantially outperforming optimization in 5 components.
When we expand to investigate all trained models, we also observe weak evidence of early higher performance in PCA-projected spaces (Fig. S10. Models are labeled using the following convention: S-prop-l%, where S is ‘id’ for searching through the full latent space and ‘PCAX’ for searching through X-dimensional PCA projections, ‘prop’ is the organizing property, and l% is the label percentage). After just 50 iterations, most models are not performing particularly well yet, though we observe more variation in performance (both positive and negative) in BayesOpt runs in PCA-projected spaces than in the full latent space. The highest-performing models at this point are PCA5-bch-25% (average score above 0.4) and PCA5-oracle-100%, PCA20-oracle-100%, PCA5-bch-2%, PCA10-bch-2%, and PCA20-oracle-2% (average score above 0.3). Since all highest-performing models are in PCA projections, this provides weak evidence that PCA projections, particularly if given more properties or more relevant information can provide an early advantage; however, given that, e.g., PCA2-bch-100% has an average score of −0.4, it is clear that this is not a reliable or easily-accessible advantage.
Overall, we observe that performing BayesOpt in a PCA projection of VAE latent spaces may provide certain overall benefits. While the optimal number of PCA components to use is not obvious (e.g. 32 vs. 10 in Fig. 4a and b), the number is less than the original latent space and in almost all cases provides a performance advantage. Additionally, it is substantially easier to visualize and interpret the search trajectory in lower dimensions, with two dimensions allowing direct visualization of the trajectory.
We consider single property organization first. There is no clear advantage of any single property when searching through the 64-dimensional latent spaces – the best-performing single-property models compared to other models with only organizing property changed are id-c-100%, id-b-75%, id-h-50%, id-h-2% at 100 iterations (Fig. 5a), and id-b-100% and id-c-100%, id-b-75%, id-d-50%, id-c-25%, and id-b/c-2% at 500 iterations (Fig. 5b). For PCA-projected spaces, we observe that charge-organized spaces tend to demonstrate better performance: at 100 iterations, PCA5-c-100% clearly outperforms b/h-100%, PCA5-c-75% clearly outperforms b/h-75%, PCA5-c-50% clearly outperforms b/h-50%, and PCA5-c-2% clearly outperforms b/h-2%. Only for the PCA5-25% models is charge not the clear winner, and in that case, it performs nearly as well as PCA5-h-25%. At 500 iterations, PCA5-c-100% clearly outperforms b/h-100%, PCA5-c-75% clearly outperforms b/h-75%, and PCA5-c-50% clearly outperforms b/h-50%. However, PCA-5-c-25% performs worse than b/h-25%, and all PCA5-2% models perform roughly the same, displaying rather poor performance at <0.3 average best scores. Thus, charge is a better organizing property than the Boman index or hydrophobicity, particularly when more labels are available. This trend matches with a similar trend in relevancy: charge correlates more strongly and has higher mutual information with the oracle values when looking at the full peptide training set (section S2.3).
Considering multi-property organization, we still do not see clear trends in when searching through the full 64-dimensional latent space. After 100 iterations (Fig. 5a), id-bch-100% is one of the best performers (with performance about equivalent to id-c-100%), and id-bch-75% outperforms other models, but the highest-performing model for lower label percentages is one or more of the single-property models. After 500 iterations (Fig. 5b), we see similar model performance across the board, though id-bch-100% is one of the highest-performing models (tied with id-b-100%) and id-bc-75% is one of the highest-performing models (tied with id-b-75%) but for lower label percentages single-property models are the best, and the performance is not too dissimilar within error bars.
When searching through a PCA projection, there is not a clear advantage to using multi-property organization. After 100 iterations (Fig. 5a), at 100% and 75% label percentage, charge clearly outperforms multi-property organization. At 50% label percentage, PCA5-bc-50% outperforms PCA5-c-50%, at 25% label percentage, PCA-bch-25% by far outperforms any other models, and at 2% the highest performing model is PCA5-bch-2%. After 500 iterations (Fig. 5b), at 100% and 75% label percentage, charge still outperforms multi-property organization. For lower label percentage, multi-property organization does outperform single-property organization (PCA5-bc-50% > PCA5-bch-50% > PCA5-c-50%, PCA5-bch-25% is clearly the best performer, PCA5-bch-2% outperforms PCA5-bc-2% and both clearly outperform the single-property organizers), but neither bc nor bch is consistently better than the other. This suggests that multi-property organization may be better at lower label percentage, an observation which is supported by the fact that the best-performing model we observed in the study was PCA-20-oracle-2%, with a final performance of 〈
final〉 = 0.896 ± 0.092, where the oracle is itself a multi-property organizer, although it was not part of the ablation study.
Overall, we find that performance depends more heavily on latent space organization when performing BayesOpt in PCA-projected latent spaces than when performing it in the full latent space. We find that for PCA-projected latent spaces, more relevant variables result in improved performance, with more types of information in the form of multiple properties or highly-relevant oracle values providing an advantage particularly in the very low percent label regime. Moreover, at iteration 500, using a linear projection of the oracle-organized latent space leads to the highest 〈Mbest〉 over all experiments we tried. This suggests that the case of having very relevant data but very low label percentage (the situation we find ourselves in with AMP datasets associated with experimentally-measured antimicrobial activity) could be a good scenario for using a linear projection of the latent space as the BayesOpt search space.
) mapping design points into a feature space that the kernel function operates on, the conventional BayesOpt kernel, k(
1,
2), is replaced by k(gθ(
1), gθ(
2)), where gθ is a neural network. The weights θ of gθ are fit simultaneously with the parameters of the kernel k(,) at each BayesOpt iteration. GP-DKL is an a promising alternative approach to deriving a low-dimensional space for BayesOpt to traverse, as it operates directly on the high-dimensional space and leverages a learnable non-linear projection (as opposed to the static, linear one we have used) enabling the GP to determine the projection necessary to find an optimal point.
Here we implemented GP-DKL, and performed BayesOpt with five-dimensional GP-DKL in five of our TransVAE spaces: unorganized, BCH-organized with either 100% or 2% of the labels, and oracle-organized with either 100% or 2% of the labels. Across three different architectures for the GP-DKL neural network (dklv1, dkl-v2, dkl-v3; see section S3.3 for more details) we find that, in all but one case, BayesOpt in a five-dimensional PCA-projected space is able to consistently perform as well as, or better than, GP-DKL (Fig. 6 and S13).
In an unorganized TransVAE model, two architectures from GP-DKL (dkl-v2, orange; dkl-v3, green) perform as well as BayesOpt in a PCA projection (red), but one GP-DKL variant (dkl-v1, blue) does slightly better than searching in the PCA projection. However, we observe that searching directly in the high-dimensional space (purple) results in substantially higher values attained than either GP-DKL or BayesOpt in the linear projection, thus pointing to fundamental limitations of searching in a lower-dimensional space without additional information to permit an informed choice of directions.
Once we move to organized spaces, there is a clearer difference. In the BCH-organized spaces, all three GP-DKL variants have average best scores below zero while the BayesOpt searching through the high-dimensional space and searching through the PCA projection are able to attain average greater-than-zero scores by at latest the 100th iteration. This trend also holds when we look at the BayesOpt runs done in the oracle-organized spaces. This may be a result of differences in data availability to the different algorithms. The PCA projector is fit to the (embedded) training set of the VAE, which is O(100k) points. This projector is then not updated throughout the BayesOpt loops. In DKL, the neural network projector is fit during the BayesOpt iterations on the fly, meaning it can search in different directions while severely restricting the amount of data available (O(100) points in our scenario) compared to the PCA projection. Further work could be done to optimize a nonlinear projection scenario.
We compute the hypervolume of the input points sampled in each BayesOpt run, which is essentially a measure of the extent to which we explore different objective function inputs. The hypervolume was computed using the pygmo python package;56 it computes the hypervolume of a set of points compared to a reference point. To choose our reference point, we used the corner of the search box with the maximum value in each dimension (10 in each dimension). When looking at the distribution of resulting hypervolumes, we observe more hypervolume is explored on average when optimizing in a PCA search space compared to the high-dimensional search space; the PCA distribution (Fig. 7a orange) is shifted to the right of the high-dimensional search space distribution (Fig. 7a, blue), with μPCA,HV = 1.1 × 10−17 > 0.55 × 10−17 = μId,HV. This suggests that a broader slice (about 50% more) of the underlying high-dimensional latent space was explored despite the PCA runs being confined to a subspace of it.
![]() | ||
| Fig. 7 Distributions of exploration quantities for optimization done in the full latent space (“Id”, blue) or the PCA projection of it (PCA, orange). Distribution of each run's: (a) computed hypervolume; (b) variance in scores sampled, Var(M); (c) best sampled point's distance from the oracle's training set (eqn (8)); (d) total path length (eqn (9)). For (c) and (d) the distances are computed in the two PCs most correlated with the oracle values. | ||
We further compute the variance in objective scores sampled throughout each BayesOpt run, which is essentially a measure of the extent to which we explore different objective function outputs. When comparing the distribution of variances in objective scores for runs searching through the PCA space (with center μPCA,Var(M) = 0.24 (±0.005)) vs. the high-dimensional latent space (with center μId,Var(M) = 0.17 (±0.004)), we observe that the BayesOpt runs done in PCA spaces tend to have more variance in objective function scores (Fig. 7b); μPCA,Var(M) = 0.24 > 0.17 = μId,Var(M). We test the hypothesis that these two means are the same using an independent samples t-test, yielding a p-value = 1.58 × 10−21, suggesting a statistically robust difference. This suggests that when optimizing through linear-projections of the high-dimensional space, more of the objective function is sampled (about 30% more).
Additionally, we compute the distance of the best point in a run from the SVR oracle's training set as an assessment of how exploration of input may be related to exploration of output and an assessment of the algorithm's ability to suggest novel optima. Because computing distances in high-dimensional spaces can be misleading, we computed these distances in a two-dimensional projection of the latent space. We construct this two-dimensional projection using the two principal components most correlated with oracle values. To do this for a run, we first encoded the oracle's training set into the high-dimensional latent space the BayesOpt run was done in. We then compute a PCA projection of the high-dimensional latent space. The oracle's training set is then projected into the PCA-reduced space. Then, we compute the Euclidean distance between the best point found during the run and each point in the oracle's training set, in the PCA-reduced space. The minimum distance found is used as the distance between the point and the oracle's training set TSVR:
d(x, TSVR) = inf{d(x, a) : a ∈ TSVR}
| (8) |
Lastly, we compute the total path length of a BayesOpt run as a way of measuring the exploration of the relevant input space. For the k-th point sampled during a BayesOpt run, we compute the Euclidean distance from it to the (k − 1)-th point sampled during the run, in the two PCs most correlated with the oracle values of the full VAE training set. We compute the distance between each consecutive pair of points in the PCA-reduced space for the run. Then, we sum the distances to find the total path length in the PCA-reduced space. That is, for a given BayesOpt run with sampled latent space points {
1,
2, …,
500}, projected into a two-dimensional reduced space {
1,
2, …,
500} we compute its total path length as
![]() | (9) |
We further examined the BayesOpt runs in their associated latent spaces qualitatively. To do this, we plotted the peptides sampled during a BayesOpt run in a two-dimensional PCA projection of the associated latent space using the two PCs most correlated to oracle values. In Fig. 8, we depict a full example trajectory showing the latent space (panel a), the structures of 17 selected peptides along the run as predicted by ESMFold (panel b), and their corresponding scores (panel c). This trajectory illustrates the best-scoring run of five that were performed in a 20-dimensional projection of the oracle-organized latent space. This figure has a clear interpretation and allows us to gain a strong physical intuition for the progress of the search. We see the BayesOpt trajectory begin at a low objective value (1) and rapidly increase by moving to the right in the latent space, towards the average direction of increased objective value (lower MIC). After attaining point 7, BayesOpt performs a large explorative jump to the left to point 8, which only marginally increases the best score. It then proceeds by generally drifting to the right and eventually jumping upwards (point 16). Concomitantly, we see that as the search proceeds, the predicted structures increase strongly in helicity. Given the over-representation of α-helical AMPs in most datasets, helicity is a reasonable proxy for anti-microbialness that the BayesOpt is exploiting; a straightforward example of reward hacking. This confirmation of reward hacking provides another reason for using the most relevant data possible (less hackable), even if less of it is available (because e.g. it is expensive to compute).
![]() | ||
| Fig. 8 Best peptides identified during a BayesOpt run. (a) PCA plot of the associated oracle-organized latent space (at 2% of labels available), with bar plot depicting an average decrease in predicted-log10(MIC) from left-to-right across the latent space. The latent space locations of the best peptides are depicted with large circlular markers; white integers represent the sequence in which these peptides were identified during a BayesOpt run, but not the iteration at which they were identified. (b) ESMFold-predicted structures57 of the best peptides found during the BayesOpt run, visualized using ChimeraX.58 In this case we note a strong increase in helicity with oracle value, a reasonable – but biophysically misleading – proxy that the oracle might learn for anti-microbialness, which may be ameliorated with higher-fidelity evaluations such as simulation. (c) The corresponding best score Mbest found throughout the BayesOpt run. | ||
In Fig. 9, we visualize the best-scoring run of five for all oracle-organized latent spaces at 100% and 2% label percentage (for more information cf. also Fig. S15–S18). As previously quantified, we visually observe that BayesOpt runs tend to cover more of the 2D visualizations when their associated search space is a PCA projection (e.g., compare Fig. 9g with Fig. 9h–l). Additionally, there is substantially greater spatial exploration when the percentage of labels is low, although as we have seen, BayesOpt is still able to converge to satisfactory optima in this case. In particular, when comparing the low-label regime (2%) with the full label regime (100%), we observe that the BayesOpt runs explore substantially more of the visualization; this is especially the case when BayesOpt runs search through the top two, five, or ten PCA dimensions (more detailed trajectories supporting this discussion may be found in Fig. S15–S24).
The previous observations are certainly due at least partly to the fact that the search procedure is more restricted to the visualized directions for the PCA projections than for the full search space. This leads us to note that interpretation of BayesOpt runs is easiest when the search space is similar to the visualization. We can also observe this in that when the organizing property is a physicochemical property the BayesOpt runs are centered, appearing to only explore a portion in the directions used to make the visualization, since the projection is not well-aligned with the directions of the search. However, when looking at the oracle-organized spaces, we can clearly see the BayesOpt runs identifying the side of the space that contains better objective function values (low value oracle predictions correspond to higher objective scores) after covering more of the visualized region, particularly when fewer labels are available.
We then compare the variance of scores during a run with the best objective value encountered during that run (Fig. 10b). For BayesOpt runs done in PCA reduced spaces, we observe a strong Pearson correlation coefficient (PCC) of 0.652 that is statistically robust with a p-value < 0.001 (Fig. 10b, crosses). For BayesOpt runs done in the 64-dimensional latent spaces, we observe a moderate PCC of 0.498 that is statistically robust with a p-value < 0.001 (Fig. 10b, dots).
Additionally, we compare the best objective score found in each run to the distance of the corresponding point from the SVR's training set, and to the total path length of a BayesOpt run in two dimensions (Fig. 10c and d). We observe no statistically significant correlation in either of these values.
Overall, sampling a greater range of objective scores results in better final scores, while exploration of the input space as measured here is not strongly correlated with performance.
In line with previous work25,26 we found that jointly-training a property-predictor and a VAE can lead to an organized latent space. Expanding on that work, we investigated whether such organization persists even in a semi-supervised scenario, finding evidence that just 2% of property labels suffice to induce organization along that property; additionally we showed that jointly-training with multiple properties can lead to organization along each property.
We employed these organized latent spaces for projected latent BayesOpt and traditional latent BayesOpt. When we varied the number of PCA components to optimize in, we found that optimizing in a PCA projection could outperform optimizing directly in the high-dimensional latent space, but it is not obvious a priori how to select the optimal number of PCA components to optimize over. In general, using projected latent BayesOpt may provide an early or late advantage and can show impressive performance; however, it is less reliable than performance in the full latent space, which demonstrates less variability in performance with latent space organization and other hyperparameters.
To better understand the use of projected latent BayesOpt, we assessed the effects of different organizing properties of the latent space. We hypothesized that if the properties provided a weak but ample signal regarding the oracle, it may improve the BayesOpt performance through either increasing the rapidity at which it found potent peptides, or through increasing the max value attained throughout an optimization run. We find that, among the single physicochemical property spaces, charge tends to give better BayesOpt performance in terms of final objective values, and how quickly the objective value increases. This occurs primarily when searching through the linear projection of the original 64-dimensional latent space, but also maintained strong performance when searching through the full high-dimensional space. As charge is the most informative of the physicochemical properties we studied, this demonstrates that BayesOpt in PCA-projected spaces works best when the space is organized by relevant information. We also demonstrated that at low label percentage, we could obtain high performance from employing multiple organizing properties or using the oracle. Overall, in a real-world situation, we expect to see good performance from projected latent BayesOpt if the space is organized by a highly-relevant property; the more types of information available at low label percentage, the better.
Despite the fact that the performance of projected latent BayesOpt can be more variable and more sensitive to latent space organization than latent BayesOpt, projected latent BayesOpt has a significant advantage in terms of interpretability. We can more easily and accurately visualize and understand the trajectories of projected latent BayesOpt. In addition, we showed that projected latent BayesOpt explores more of the search space than latent BayesOpt, in terms of both full hypervolume and exploration along relevant dimensions. Furthermore, the variability in performance may also be framed as heightened exploration of output scores sampled, which we showed to be correlated with better performance. Although projected latent BayesOpt has a weakness, we suggest that it can be ameliorated by running multiple short trajectories in spaces of different dimensionalities, and it comes with the advantage of better understanding of the search procedure. We also note that this work further underscores an important point we have previously raised,25 which is that naive use of low-dimensional projection algorithms to visualize latent spaces of generative models can be misleading. The community should not over-rely on the assumption that a two-dimensional PCA or t-SNE projection is sufficient to reveal qualities of the latent space, at least not without further analysis.
In sum, latent Bayesian optimization can provide an approach to identifying highly-potent antimicrobial peptides leveraging the continuous-valued latent spaces of generative models. We found that VAEs based on transformers can have their latent spaces organized by multiple different physicochemical properties simultaneously through the joint-training of multiple property predictors with the VAE. We further compared whether using physicochemical properties to organize the latent space is better than using few oracle values, finding evidence (section 3.3) that using few oracle evaluations can be better when restricting the search space to a linear projection of the latent space; i.e. having a more relevant organizing property with few labels may be a good use case for searching through a PCA projection of the latent space. Using the linear projection comes with the additional benefits of being more interpretable and easier to visualize.
Although we have primarily discussed our work in the context of AMP design, given that BayesOpt is an efficient, iterative algorithm for searching a design space to optimize a function under limited data, our results extend straightforwardly to any peptide design question. The only limitation is the definition of a relevant objective function. For example, one could optimize for cell-penetrating peptides through in silico measurements of maximum force experienced by a peptide as it is pulled through a lipid bilayer in constant-velocity steered molecular dynamics; alternatively, one could optimize for anticancer peptides through in vitro measurements of the minimum peptide concentration required to inhibit cancer cell growth, or through in silico approximations of binding against a target protein. More broadly, the application of BayesOpt to a particular peptide sequence design challenge is limited primarily by the availability of an oracle that can be regularly queried for estimates of the underlying objective.
Supplementary information (SI): further detailed analyses, tables and figures available in PDF form (3 sections, 10 tables, and 24 figures). See DOI: https://doi.org/10.1039/d5me00225g.
| This journal is © The Royal Society of Chemistry 2026 |