Yutao
Ma
a,
Rohan
Kapoor
a,
Bineet
Sharma‡
b,
Allen P.
Liu
bcde and
Andrew L.
Ferguson
*a
aPritzker School of Molecular Engineering, University of Chicago, Chicago, Illinois 60637, USA. E-mail: andrewferguson@uchicago.edu; Tel: +1 773 7025950
bDepartment of Mechanical Engineering, University of Michigan, Ann Arbor, Michigan 48109, USA
cDepartment of Biomedical Engineering, University of Michigan, Ann Arbor, Michigan 48109, USA
dCellular and Molecular Biology Program, University of Michigan, Ann Arbor, Michigan 48105, USA
eDepartment of Biophysics, University of Michigan, Ann Arbor, Michigan 48105, USA
First published on 22nd September 2022
Giant lipid vesicles have been used extensively as a synthetic cell model to recapitulate various life-like processes, including in vitro protein synthesis, DNA replication, and cytoskeleton organization. Cell-sized lipid vesicles are mechanically fragile in nature and prone to rupture due to osmotic stress, which limits their usability. Recently, peptide vesicles have been introduced as an alternative chassis material for synthetic cells that are more robust and stable than lipid vesicles, and can withstand harsh conditions including pH, thermal, and osmotic variations. In this work, we combine coarse-grained molecular simulation, enhanced sampling free energy calculations, Gaussian process regression, and Bayesian optimization to construct an active learning screening for diblock amphiphilic elastin-like polypeptides capable of forming thermodynamically stable vesicular structures suitable for the self-assembly of synthetic peptide vesicles. Our computational screen identifies a number of promising sequences that form peptidic vesicles with high thermodynamic stabilities relative to isolated peptides in bulk solvent on the order of 10–15kBT per amino acid residue.
Design, System, ApplicationPeptides have become a popular class of building blocks for self-assembly due to their versatility and modulability. In this work, we combined coarse-grained simulations, free energy calculations, and black-box optimization algorithms to develop a high-throughput screening protocol to discover optimal diblock amphiphilic elastin-like polypeptides that could self-assemble into stable vesicles. We performed molecular dynamics and enhanced sampling simulations to quantify the thermodynamic stability of the vesicle formed by the polypeptides, trained Gaussian process regression surrogate models to relate polypeptide sequence to vesicle stability, and Bayesian optimization to rationally traverse the polypeptide design space to efficiently identify peptides that maximize vesicle stability. Self-assembling peptidic vesicles can serve as promising chassis materials for synthetic cells which have a wide range of applications in drug delivery and biosensing. Our high-throughput screening protocol could be extended to other self-assembling peptides or proteins to optimize the amino acid sequence for particular structural or functional goals. |
Elastin-like polypeptides (ELPs) have recently been demonstrated as a structurally robust and biocompatible chassis material for synthetic cells11,12 that can form ∼50 nm diameter unilamellar vesicles13 and can be templated to form giant vesicles with diameters in excess of 50 μm.14 ELPs are synthetic biopolymers that share structural characteristics with intrinsically disordered proteins such as tropoelastin. The general motif of ELP polymers is a pentapeptide repeat (VPGXG)n, where V is valine, P is proline, G is glycine and X can be any guest residue except proline. ELPs are intrinsically disordered polymers that exhibit temperature-triggered phase transition: below a lower critical solution temperature (LCST) the ELP adopts a random coil configuration, while above the LCST the ELP undergoes and ordering transition into β-spiral secondary structures constituted of type II β-turns.15,16 The guest residue X has a strong influence upon the LCST, where the LCST typically decreases as the hydrophobicity of the guest residue increases; proline cannot be used as a guest residue as it compromises the LCST behavior.15,17,18 Amphiphilic diblock and multiblock ELPs have drawn particular interest because their temperature-dependent ordering and self-assembly behavior can be manipulated by controlling the guest residues in each block. Micelles are the most common self-assembled structures from amphiphilic diblock ELPs, where the hydrophobic block has the lower LCST Tlo and the hydrophilic block has the higher LCST Thi. At Tlo < T < Thi, the hydrophobic blocks associate with each other to form the core of the micelles while the hydrophilic blocks form the corona.19 However, several recent experiments have indicated that amphiphilic diblock and triblock ELPs could self-assemble into large vesicular structures that are stable under harsh conditions such as extremes of pH or temperature.20 For example, Vogele et al. have utilized glass bead method to direct a diblock ELP with glutamic acid as the hydrophilic guest residue and phenylalanine as the hydrophobic guest residue to form giant vesicles.11 Schreiber et al. compared the vesicular structures formed by two kinds of diblock ELPs with the same lengths but different guest residues and concluded that guest residue composition could be a key factor in modulating the vesicle stability.21 Frank et al. demonstrated the formation of giant ELP vesicles by using solvent evaporation method.22 Very recently, we demonstrated the production of giant (>50 μm) vesicles from amphiphilic ELP diblocks using emulsion transfer techniques.14
The vast number of potential amphiphilic ELP diblock sequences means that principled approaches are required to efficiently traverse and optimize within this design space. Consider the space defined by the sequence (VPGX1G)m(VPGX2G)n, where X1 is one of twelve hydrophilic amino acid residues (except proline) categorized under the Kyte–Doolittle hydropathy scale23 {G, T, S, W, Y, H, E, Q, D, N, K, R}, X2 is one of seven hydrophobic residues {I, V, L, F, C, M, A}, and the degree of the hydrophilic m and hydrophobic n repeats can vary over the range 5–100, this defines a design space of 12 × 7 × 96 × 96 = 774144 possible sequences. The engineering of amphiphilic diblock ELPs for vesicle formation is generally conducted using chemical intuition to specify the identity of the guest residues and the lengths of the hydrophilic and hydrophobic blocks. Opportunities exist to systematize and accelerate this search using data-driven and model-guided approaches to efficiently identify sequences capable of forming stable vesicle structures and simultaneously develop fundamental understanding and design rules linking the ELP sequence to the emergent vesicle stability. Machine learning techniques such as kernel regression,24,25 support vector machines (SVM),26 and artificial neural networks,27 have provided useful tools for the prediction of chemical or physical properties of soft materials and the discovery of novel materials with specific functionalities. For example, Leslie et al. proposed a string kernel based on a tree data structure to perform SVM classification of proteins for several benchmark tasks.28 Lee et al. used SVMs to identify and discover new membrane-active and antimicrobial α-helical peptides.29 Zhou et al. used Gaussian process regression model with custom kernel to predict the antimicrobial abilities of various pentadecapeptides.24 Lei et al. built a deep learning framework based on convolutional neural networks to predict peptide–protein interactions.30 Mohr et al. trained a regularized autoencoder to embed small organic molecules onto a latent space and perform Bayesian optimization over the latent space to discover molecules capable of permeating cardiolipin-containing membranes.31 In general, machine learning techniques are powerful tools that can assist the discovery of novel functional materials by learning predictive or generative models from computational and/or experimental data.
In this work, we propose a high-throughput screening protocol that combines coarse-grained molecular dynamics simulation, enhanced sampling free energy calculations, Gaussian process regression (GPR), and Bayesian optimization (BO) to discover optimal peptides from a library of diblock amphiphilic ELPs that could form vesicular structures with high thermodynamic stability. Our screening efficiently identifies high-performing ELP sequences as good candidates for the formation of stable synthetic cell membranes, furnishes a predictive model linking ELP sequence to thermodynamic stability and provides a filtration of the vast ELP design space to identify the top candidates for experimental testing. Our computational screen is approximately 15× faster than experimental assessment of the ELP candidates, presenting a relatively higher throughput means to prospectively identify the top performing candidates for future experimental testing. The high-throughput screening protocol can be straightforwardly extended to the design of multiblock ELPs32 or ELP conjugates with other biopolymers such as collagen-like polypeptides.33,34 More generally, the approach can be applied to the design and optimization of polypeptide sequences with other desired structural or functional properties measured by computational and/or experimental assays.
All-atom representations of ELP molecules (VPGX1G)m(VPGX2G)n were constructed using PyMol36 and then coarse grained using the Martini force field version 2.2.37 We employ a coarse-grained modeling approach in order to allow us to reach the length scales necessary to model a substantial patch of the bilayer wall and the time scales needed to converge the free energy calculations (see section 2.2) by which we estimate bilayer stability. A limitation of the Martini force field is that it requires specification of the secondary structure of each region of a protein. As such, changes in secondary structure cannot be modeled over the course of the simulation, although changes in the tertiary structure are, within this approximation, accurately treated.38 For each system we assume that the LCST of the hydrophobic block Tlo lies below the ambient temperature T and therefore model the secondary structure of this block as a β-turn, whereas the LCST of the hydrophilic block Thi lies above the ambient temperature and is therefore modeled as a random coil.15,16 The precise LCSTs of various ELPs are not precisely known, so we simply conduct all of our calculations at T = 300 K and P = 1 bar under these secondary structure assumptions. This carries the benefit of enabling us to compute thermodynamic stabilities at a consistent thermodynamic state point without requiring knowledge of the LCSTs but is a significant assumption that we are impelled to make due to the absence of comprehensive LCST data for all ELP sequences and the secondary structure limitations of the Martini model. Although we do not do so here, we suggest two possible strategies to relax this assumption. First, one may consider employing all-atom simulations to estimate the LCSTs and then use this information to conduct simulations at Tlo < T < Thi. The computational burden to do so is quite high, but would lead to a more accurate coarse-grained simulation protocol. Second, one could consider employing an alternative coarse-grained model that does not require specification of secondary structure such as the SIRAH force field.39,40 In the present work, we assume that the trends in the thermodynamic stabilities calculated under our simplifying assumptions still serve as a useful ranking and filtration of ELP sequences predicted to form stable vesicles. As discussed below, post hoc validation of our protocol is provided by its identification of a top-ranked candidate that has been experimentally demonstrated form stable vesicles with lifetime of several hours.
We model an approximately 81 nm2 patch of the vesicle wall by constructing a 10 × 10 grid of fully extended amphiphilic diblock ELP chains separated in x and y directions by 0.9 nm to create the upper leaflet and another 10 × 10 grid of ELP chains to create the lower leaflet. The ELP chains in the lower leaflet were flipped upside down so that the hydrophobic blocks of these two layers lie adjacent to one another to form the hydrophobic core of the bilayer sandwich. We then solvated the bilayer patch comprising the 200 ELP chains by adding non-polarizable Martini water molecules41 at a density of 1 g cm−3 up to a distance of 34.5 nm above and 10 nm below the x–y plane of the bilayer. When creating topology, we set the solvent-exposed N-terminus to be positively charged (corresponding to VAL-NH3+) and the C-terminus buried within the solvent-excluded hydrophobic core of the bilayer to be neutral (corresponding to GLY-COOH).42 Ionization states of amino acid sidechains were specified with the most prevalent protonation state at pH 7. All guest residues in the hydrophobic X2 position are electrically neutral. A subset of those in the hydrophilic X1 position adopt charged states: {E: (−1), D: (−1), K: (+1), R: (+1)}. Where necessary, a number of monovalent Martini counterions, employing Qa beads for Cl− ions and Qd beads for Na+ ions, were randomly inserted into the water region in order to maintain charge neutrality. This initial state of the system exists in a 9 × 9 × 67 nm3 box with periodic boundary conditions applied in all three dimensions. The z-dimension of the box was chosen to be sufficiently large to provide an initial linear separation of 44.5 nm between the upper and lower leaflets of the bilayer through the periodic wall in z, thereby effectively eliminating direct interactions between periodic images of the bilayer, minimizing any artifacts associated with the periodic boundary conditions and enabling us to conduct the chain extraction free energy calculations detailed in section 2.2. An illustration of the initial bilayer setup is presented in Fig. 2a.
After setting up the initial bilayer, we first ran steepest descent energy minimization to eliminate forces in excess of 1000 kJ mol−1 nm−1 (Fig. 2b), followed by assignment of initial atom velocities from a Maxwell–Boltzmann distribution at 300 K, then a 10 ns NVT equilibration simulation at 300 K followed by a 10 ns NPT equilibration simulation at 300 K and 1 bar to allow relaxation of the extended ELPs. Finally, we performed 200 ns NPT production runs at 300 K and 1 bar after which the energy, temperature, pressure, and structure all attained stable values allowing the bilayer to attain its equilibrium configuration and place the system into a relaxed state for the enhanced sampling free energy calculations (Fig. 2c). We also confirmed that the density profiles of the solvent and bilayer no longer changed during the 200 ns simulation. For all simulations, the temperature was controlled by a stochastic velocity re-scaling thermostat with a time constant of 1 ps.43 For NPT equilibration runs we employed a Berendsen barostat44 with a time constant of 12 ps and a compressibility of 3 × 10−4 bar−1. For the NPT production runs we used a Parrinello–Rahman barostat45 with a time constant of 12 ps and a compressibility of 3 × 10−4 bar−1. In all cases we employed semi-isotropic pressure coupling, which was isotropic in x and y directions (in-plane of bilayer) but decoupled from z (normal to bilayer). The step size for all simulations was set to 20 fs and equations of motion were integrated using the leap-frog scheme.46 Lennard-Jones interactions were smoothly shifted to zero at a cutoff of 1.1 nm. Electrostatics were treated using the reaction field method with εrf = ∞ and εr = 15, as appropriate for the non-polarizable water model.41 All molecular dynamics simulations were performed using the Gromacs 2019 simulation suite.47 Calculations were performed on 10 × 2.40 GHz Intel Xeon Gold 6148 CPU cores and one NVIDIA TITAN V GPU, achieving execution speeds of ∼4.2 μs per day. Simulation trajectories were visualized using VMD.48 The input files required to perform each stage of the simulations are provided in the Github repository available at https://github.com/tommayutao/ELP-Screening.
We estimate ΔG by computing the potential of mean force (PMF) of the dissociation process N → (N − 1) + 1 along a reversible pathway connecting the associated and dissociated states. Since the path is reversible, it is of course possible to also compute this quantity along the association pathway, but it is more challenging to construct a path to efficiently insert and relax the incoming peptide within the bilayer than to extract a peptide from a pre-assembled bilayer. The free energy difference between the start and end of this path provides an estimate of ΔG.
Due to the possible existence of many local free energy minima that the system could be trapped in, enhanced sampling techniques are usually used to facilitate sufficient sampling of all relevant configurations to ensure good estimate of free energy profile.51 There exist many techniques to perform enhanced sampling, including metadynamics,52 adaptive biasing force53 and umbrella sampling.54 Trajectories produced by enhanced sampling methods that introduce artificial biasing forces to help the system escape local free energy minima must be post-processed to obtain unbiased estimates of the PMF, and various analysis techniques such as multistate Bennett acceptance ratio (MBAR)55 and weighted histogram analysis method (WHAM)56–58 have been proposed to perform that task. In this work, we use a combination of umbrella sampling and WHAM to estimate the unbiased PMF.
After creating and equilibrating the ELP bilayer as detailed in section 2.1, we randomly chose a peptide chain from the upper bilayer for extraction. We extract the chain to effect the dissociation process N → (N − 1) + 1 by constructing a reversible pathway through a space spanned by the two collective variables zhead and ztail (Fig. 4a). zhead is the z-component of the displacement from the center of mass (COM) of the bilayer to the COM of hydrophilic block of chosen chain and ztail is the z-component of displacement from the COM of the bilayer to the COM of hydrophobic block of the chosen chain. The pulling process was divided into three stages (Fig. 4b).
In stage I, the lower hydrophobic block of the chain is pulled upwards into the upper hydrophilic components of the bilayer. This pathway is generated by constructing a set of overlapping umbrella potentials in (zhead, ztail) space in which zhead = 4.81 nm is fixed to its initial value using a harmonic spring and ztail = [1.21, 5.57] nm is subjected to harmonic restraints with progressively increasing z values in each successive umbrella window to chart a vertical path in (zhead, ztail) space (the precise value of zhead and range of ztail depends on the sequence in question, so here and below we report representative values for the (VPGYG)5(VPGCG)4 sequence). The free energy change associated with the first stage is large and positive due primarily to a large unfavorable enthalpy associated with moving the hydrophobic block of the chain from a hydrophobic to a hydrophilic environment.
In stage II, the chain is extracted from the upper hydrophilic portion of the bilayer into the bulk solvent. We achieve this by laying down a set of umbrella windows with progressively increasing values of zhead = [4.86, 11.86] nm and ztail = [5.48, 12.48] nm that move in lock step to chart out a diagonal path in (zhead, ztail) space. After much trial-and-improvement, we found that extracting the chain from the bilayer in a collapsed configuration – as opposed to simpler approaches of, for example, just pulling on the chain COM or pulling first the head and then the tail – helps prevent snagging of the pulled chain on loops formed by other chains in the bilayer, avoid strong hysteresis effects associated with overextension and rapid relaxation as the chain tail exits the bilayer, and achieve good equilibration and overlap between successive umbrella windows. The free energy change associated with the second stage is also large and positive due primarily to the loss of favorable enthalpic interactions between the extracted chain and the other chains in the bilayer. We ensure that the chain is removed sufficiently far from the top of the bilayer that we observe a plateau in the free energy profile indicating that there are no longer any direct interactions between the pulled chain and the bilayer and that the chain can be approximated to exist in bulk solvent. A sufficiently large z-dimension of the simulation box is critical to enable us to reach this regime before the pulled chain begins interacting with the opposing leaflet of the bilayer through the periodic boundary.
In stage III, the hydrophobic tail is fixed in place and the hydrophilic head pulled away from it to extend the chain out in bulk solvent. We achieve this by constructing a horizontal pathway of umbrella windows in (zhead, ztail) space wherein in each window ztail = 12.47 nm is held at the same value taken on at the end of the second stage and zhead = [11.86, 15.85] nm is subjected to harmonic restraints with progressively increasing z values in each successive umbrella window. The purpose of the last stage is to allow the peptide chain to relax to its equilibrium chain length in bulk solvent and we terminate this stage after we observe a local minimum in the free energy profile. The free energy change associated with chain relaxation in the third stage is small and negative.
We employed 144, 145, and 28 equally spaced umbrella windows in each of stages I, II, and III, respectively, where the values of zhead and ztail in each window were subjected to harmonic restraints of the form,
(1) |
Initial configurations for each umbrella window were generated by non-equilibrium pulling simulations conducted as follows. In stage I, spring constants of khead = ktail = 15000 kJ mol−1 nm−2 were applied to a chain initially embedded in the bilayer and the harmonic center was fixed at the starting value of zhead. was gradually increased from the starting value with a rate of 5 × 10−5 nm ps−1 until it reached . In stage II pulling, both spring constants were set to 20000 kJ mol−1 nm−2 and both harmonic centers were increased from their starting values with a rate of 5 × 10−5 nm ps−1 until the chain left the upper layer. In stage III pulling, both spring constants were set to 5000 kJ mol−1 nm−2. was kept fixed at the starting value of ztail and was increased from the starting value with a rate of 5 × 10−5 nm ps−1 for 80 ns. In all pulling simulations, temperature was controlled by stochastic velocity re-scaling and pressure was controlled by Parrinello–Rahman barostat. For each umbrella window centered on , the configuration along the non-equilibrium pulling path closest in (zhead, ztail) was selected as the initial configuration.
In each umbrella sampling simulation, we first performed 40 ns NPT equilibration with stochastic velocity re-scaling thermostat and Berendsen barostat followed by a 100 ns production run with stochastic velocity re-scaling thermostat and Parrinello–Rahman barostat. All other simulation parameters were specified as detailed above. Simulations were performed using Gromacs 2019 (ref. 47) and the WHAM analysis was conducted using the program developed in Grossfield Lab61 to reconstruct the unbiased PMF in (zhead, ztail) collective variable space.
The free energy of association ΔG is taken as the difference between the global free energy minimum when the peptide chain is embedded in the bilayer within stage I and the local minimum free energy of the chain in bulk solvent within stage III. Uncertainties in ΔG were estimated by five-fold block averaging. We also note that the chain in bulk solvent is still harmonically restrained in the z-dimension and we can analytically estimate the free energy change associated with the release of these constraints and the COM translational entropy gain associated with the chain exploring the free volume accessible to it at a standard concentration of 1 mol L−1.62–64 To do this, we assume an effective harmonic constraint on zCOM, the z-component of the displacement from the COM of bilayer to the COM of the entire pulled chain, when the pulled chain exists in bulk solvent. We estimate the spring constant Kr of the effective harmonic restraint by collecting histograms of zCOM from the equilibrated portions of all umbrella sampling windows within bulk solvent, fit Gaussian distributions to each of these histograms, estimate the spring constants from the standard deviations of a Gaussian fit, and take the mean as our estimate of Kr. We then use the analytical correction arising from the ratio of the partition functions in the constrained and unconstrained states to estimate free energy change associated with the translational entropy loss associated with the imposition of this constraint,62
(2) |
ΔG = Gbilayer − Gsolvent,restrained + (Gsolvent,restrained − Gsolvent,free) = min(Gstage I) − min(Gstage III) + ΔGt. | (3) |
Evaluation of ΔG for a single ELP candidate requires a total of approximately 30 GPU h by running in parallel on 8× NVIDIA RTX 2080 Ti GPU cards.
Fig. 5 Active learning cycle for data-driven identification of ELP sequences capable of assembling thermodynamically stable vesicles. (a) The ELP design space comprises the 168 diblock amphiphilic ELPs of the form (VPGX1G)m(VPGX2G)n, where X1 is one of twelve hydrophilic amino acid residues (except proline) {G, T, S, W, Y, H, E, Q, D, N, K, R}, X2 is one of seven hydrophobic residues {I, V, L, F, C, M, A}, m = 5, and n = {4, 5}. (b) Enhanced sampling free energy calculations are conducted to compute the association free energy yi = ΔGi of a candidate ELP sequence xi defined by the parameters {Xi1, Xi2, mi, ni}. (c) All ELP sequences simulated to date define the training data 1:t = {(x1,y1),…, (xt,yt)} over which we train a GPR model ŷ = (x) to predict the association free energies of ELP candidates outside the training set. The green dots represent the training points, the black line the GPR predicted mean, and the blue shading the GPR predicted standard deviation (i.e., uncertainty) in the mean prediction. For visual convenience, the GPR model ŷ = (x) is represented over a 1D projection into the highest variance direction within the sequence space calculated by multidimensional scaling59 based on the Jukes–Cantor distance between biological sequences.60 As such, the plot represents a projection of the multidimensional response surface and the topography of the surface should not be over-interpreted within this low-dimensional projection. (d) The GPR model is interrogated by a BO acquisition function known as the expected improvement EI(x) that prioritizes the unsampled ELP candidates within the design space most likely to possess high values of our objective function (i.e., minimize ΔG(x)). The red line indicates EI(x) within the 1D projection and we have indicated by a purple star the location of the top performing unsampled candidate (xt+1 = argmax EI(x)). The active learning cycle is closed by identifying this candidate xt+1 and subjecting it to the next round of enhanced sampling free energy calculations to compute yt+1, retrain the GPR model over the augmented training set 1:t+1 = 1:t+1 ∪ {(xt+1,yt+1)}, and perform a new round of BO to identify the next top performing ELP sequence. The iterative loop is terminated when the GPR model ceases to improve after multiple consecutive rounds indicating that we have fitted an accurate model over the full design space and/or we cease to see improvements in the top performing candidate identified in multiple consecutive rounds. |
(4) |
(5) |
= K(X*,X*)[K(X,X) + Σ]−1, | (6) |
cov(*) = K(X*,X*) − K(X*,X)[K(X,X) + Σ]−1K(X*,X)T. | (7) |
(8) |
The key component in any GPR model is the choice of kernel function K.65,74 Valid kernels must be positive semi-definite to assure that for any set of inputs X = {x1,…, xn}, the Gram matrix K(X,X) is positive semi-definite.72 Moreover, in our case the inputs are amino acid strings, so we need kernels that operate on string data. Many string kernels have been proposed for peptide and protein data. For example, the Hamming distance kernel measures the number of positions containing the same amino acid residue within a pair of equal length strings, and the Levenshtein kernel is based on the minimal number of insertions, deletions, and replacements necessary to convert one string into another.75 The weighted degree kernel76 goes beyond single positions to compute similarity based on the co-occurrences of k-mers at corresponding positions within two equal length sequences. In our application, the size of the hydrophilic and hydrophobic blocks of the ELP can vary so it is vital that we employ a kernel capable of operating between strings of different lengths. In this work, we choose to adopt the generic string kernel of Giguère et al.,25 which defines the distance between two amino acid strings x = (x1,…, x|x|) and with, potentially unequal, lengths |x| and |x′| as,
(9) |
(10) |
At each iteration of Bayesian optimization, we trained a GPR model based on the currently explored peptides 1:t. Then, for all the unexplored peptides, we compute the GPR posterior predictions of ΔG (eqn (5)) and evaluate the acquisition function (eqn (10)). Then we select one single peptide with the maximum acquisition function value as the next peptide to explore. Approaches exist to perform batched sampling of multiple simultaneous candidates in each BO round in order to maximize utilization of screening resources.80,81 In the present case, our enhanced sampling free energy calculations for each candidate are themselves embarrassingly parallel and so we maximize computational resource usage in sampling a single new candidate. We track round-to-round performance of the GPR/BO screening loop by monitoring the coefficient of determination R2 of GPR models during the fitting process by leave-one-out cross validation (LOO-CV), the Bhattacharyya distance between posterior Gaussian distributions (eqn (5)) returned by successive GPR models,82 and the objective function value f† = miny∈(y) of the best candidate discovered to date. When we observe convergence in all three of these metrics we can surmise that the GPR model has ceased to improve with additional data collection and the optimization may then be terminated.83 The terminal GPR is then applied globally to the candidate design space to predictively rank all candidates' performance. A summary of the GPR/BO iteration process is provided in Algorithm 1.
Algorithm 1 Bayesian optimization |
---|
Initialize training data 0whileR2 or Bhattacharyya distance or f† do not plateau do Build GPR model based on 0:t−1 Find xt = argmax u(x|0:t−1) Evaluate the (noisy) target function yt = f(xt) + εt Augment training data 0:t = 0:t−1 ∪ {(xt,yt)} end while Output x* with minimum target function value in 0,1,2… |
The primary objective of this work is to discover amphiphilic diblock ELPs to maximize the thermodynamic stability of peptidic vesicles as novel chassis for synthetic cells. We assume that the vesicles are fabricated by a templated assembly mechanism such as solvent evaporation22,35 or emulsion transfer.14 Since we assume the vesicles are produced by directed assembly we need not program the molecules or environmental conditions to spontaneously self-assemble into a vesicle. Our only optimization criterion is to maximize the thermodynamic stability of peptides within the vesicle bilayer relative to an isolated peptide in bulk solvent. A deficiency of our approach is that we do not explicitly consider the relative thermodynamic stability of competing aggregates (e.g., micelles, sheets, gels). It is conceivable that alternative states not considered in our analysis may be more thermodynamically stable, but our motivating rationale is that placing the vesicle into a deep thermodynamic free energy well maximizes its lifetime by minimizing their propensity to disaggregate or transition into any alternative assembled structures due to both thermodynamic stabilization and kinetic trapping. As post hoc validation of this strategy, we find that one of the top performing (i.e., maximally thermodynamically stable) ELP sequences discovered by our screening is experimentally known and, even for such a short sequence, has been shown to form stable vesicles with lifetimes of several hours, with longer variants anticipated to form vesicles with lifetimes of months.12
We commenced our active learning screening by generating an initial set of 20 ELP sequences over the design space to serve as the initial training data for the GPR/BO models and pursued the active learning strategy described in section 2 interleaving successive rounds of enhanced sampling free energy calculations, Gaussian process regression, and Bayesian optimization (Fig. 5). We present in Table S1 in the ESI† an accounting of the full active learning screening showing the round in which each ELP candidate xi was sampled, its computed value of yi = ΔGi from enhanced sampling free energy calculations, and the predictions ŷi = (xi) of the terminal GPR model.
We present in Fig. 6 an illustrative PMF for one particular ELP sequence (VPGYG)5(VPGCG)4. In Fig. 6a we show our estimate of the 2D unbiased PMF G(zhead, ztail) computed by WHAM.56–58,61 For ease of visualization, in Fig. 6b we present a 1D projection G(zCOM) of the unbiased landscape into the center-of-mass displacement of the full ELP from the bilayer midplane.58 As expected, the value of PMF gradually rises when the chain is pulled from the bilayer out to the bulk solvent. Upon extension of the chain in bulk solvent, the PMF shows a weak relaxation as the chain is extended to its equilibrium length and then rises again as it is further extended. The ΔG value for this specific ELP sequence is calculated according to eqn (3), which, measured on a per amino acid residue basis, is ΔG = (−10.3 ± 0.4) kBT where uncertainties are estimated by five-fold block averaging. We note that the total free energy change is dominated by the first two stages of the umbrella sampling pathway (pulling the chain into the upper hydrophilic layer, extraction of the chain into bulk solvent), with the third stage (chain relaxation in bulk solvent) contributing less than 5% to the overall free energy for all ELP candidates considered within our screen. This suggests that the computational burden of the screen may be attenuated by omitting stage III of the umbrella sampling protocol without substantial loss of accuracy. Since the computational cost is dominated by the umbrella sampling calculations within the bilayer that require closely spaced umbrella windows, this would provide a modest computational savings of ∼4%.
Fig. 6 Illustrative PMF for (VPGYG)5(VPGCG)4. The hydrophilic (VPGYG)5 is shown in blue and the hydrophobic (VPGCG)4 block in red. The bilayer is made transparent to highlight the opaque pulled chain. Solvent and ions are not shown for clarity. (a) Two-dimensional unbiased PMF in the umbrella sampling collective variables G(zhead, ztail). (b) One-dimensional projection of the unbiased PMF G(zCOM) onto the center-of-mass displacement of the ELP from the bilayer midplane. The black, green and red dashed lines demarcate the three stages of the umbrella sampling pathway illustrated in Fig. 4. |
In Fig. 7 we illustrate our monitoring of the round-to-round performance of the GPR model over the course of the active learning screening by tracking the GPR coefficient of determination R2 under leave-one-out cross validation (LOO-CV), the Bhattacharyya distance between GPR posterior in successive rounds,82 and the cumulative minimum ΔG. We observe convergence of all three metrics after approximately six active learning rounds (i.e., after screening 20 + 6 = 26 ELP candidates) indicating that the trained GPR model is accurately predicting the performance of novel candidates, its posterior distribution has stabilized, and that we are no longer identifying new top performing candidates in our screen. These observations suggest that the 26 candidates over which we trained the GPR model are sufficiently representative in spanning the molecular design space that the model is not substantially changing upon the addition of more training points and that it can make quite accurate predictions over the remaining candidates within the space. We confirm that we have indeed reached convergence and the GPR model is no longer improving with successive training data by running our screening out to ten active learning rounds (i.e., 30 candidates, ∼18% of the 168-molecule design space) at which point we terminate the screen.
We present the top 10 ELP candidates identified by our active learning screening in Table 1. A full accounting of the ΔG values predicted by the terminal GPR model is provided in Table S1 in the ESI.† Our screening has identified a number of diverse ELP sequences capable of forming vesicle bilayers with large thermodynamic stabilities amounting to ΔG = 10–15kBT per amino acid residue. As illustrated in the table, we see no clear pattern in the hydrophobicity and hydrophilicity scores of the two guest residues in the top performing candidates. This appears to indicate the absence of simple single-residue design principles for amphiphilic diblock ELPs and that the active learning screening has uncovered more subtle multibody design rules for the engineering of vesicle thermostability. As an encouraging post hoc validation of our active learning screening and choice of objective function, our screening discovered as the fourth-ranked candidate the (VPGHG)5(VPGLG)4 sequence that has been experimentally demonstrated by Schreiber et al. to form stable vesicles with lifetime of several hours.12 The remaining sequences have not, to our knowledge, been previously experimentally investigated. Interestingly, our screening quite strongly favors histidine as the guest residue in the hydrophilic block although we see more diversity in the residue identity in the hydrophobic block. Possessing ΔG per residue values 0.2–1.5kBT lower than the experimentally demonstrated (VPGHG)5(VPGLG)4 sequence, we propose that these candidates may represent particularly interesting opportunities for future experimental testing and the assembly of ultrastable peptidic vesicles. For computational tractability this study has focused on relatively short 45–50 residue peptides, but we conjecture that the rank ordering of the measured thermostabilities will be preserved upon moving to the ∼350-residue peptides frequently employed in experiment.
ELP sequence | Computed ΔG per residue (kBT) | Discovery iteration | Hydropathy score of X1 | Hydropathy score of X2 | Previously known? |
---|---|---|---|---|---|
H5V5 | −14.9 ± 0.2 | 2 | −3.2 | 4.2 | No |
H5F4 | −14.2 ± 0.2 | 5 | −3.2 | 2.8 | No |
H5V4 | −13.5 ± 0.1 | 3 | −3.2 | 4.2 | No |
H5L4 | −13.3 ± 0.3 | 0 | −3.2 | 3.8 | Ref. 12 |
H5F5 | −12.6 ± 0.1 | 6 | −3.2 | 2.8 | No |
H5L5 | −11.5 ± 0.4 | 1 | −3.2 | 3.8 | No |
H5C4 | −11.3 ± 0.1 | 4 | −3.2 | 2.5 | No |
Y5F4 | −11.2 ± 0.1 | 0 | −1.3 | 2.8 | No |
H5A4 | −10.5 ± 0.2 | 7 | −3.2 | 1.8 | No |
Y5I4 | −10.4 ± 0.3 | 0 | −1.3 | 4.5 | No |
In future work, we would expand the search space to longer chains. Experimentally, the diblock ELPs usually contain dozens of hydrophobic and hydrophilic blocks that enable relatively thick vesicles to be formed.11,12,21,22 These larger vesicles structurally resemble the compartments of biological cells1 and enable encapsulation of interesting bioactivities, such as compartmentalized peptide synthesis.11 Besides diblock ELPs, triblock ELPs with hydrophilic blocks on two ends and hydrophobic blocks in the middle have also been experimentally explored to form vesicles.32 Therefore, in future work it would be interesting to expand the search space to include larger diblock and triblock ELPs. It would also be interesting to consider other thermodynamic state points in temperature, pressure, and salt concentration to determine the degree to which the ELP rank ordering computed in this work is transferably preserved under other conditions relevant to the intended deployment environments for these vesicles. Although in principle the generic string kernel25 could operate on amino acid sequences with very different lengths, the evaluation of kernel could become computationally expensive as the sequence lengths grow.25 It might be helpful to first train an encoded representation, such as variational autoencoder66 or doc2vec model,86,87 to embed all peptides into an Euclidean space and perform Bayesian optimization over this embedding. The GPR model in this case would consist of kernel functions defined in the embedded space, such as radial basis kernel or Matérn kernel,73 that are less computationally expensive to evaluate. Another potential challenge is that the umbrella sampling simulations may become inefficient in larger systems due to the slower relaxation of the longer chains. Thus, it would be interesting to explore alternative enhanced sampling techniques, such as metadynamics52 or adaptive biasing force53 with the potential for improved sampling efficiencies. We also propose that our active learning screening approach can be generically extended to other applications in biopolymer and biomolecular design where it is necessary to navigate large sequence spaces by coupling the GPR/BO approach to alternative computational and/or experimental measures of performance such as protein–ligand binding free energy88 or antimicrobial activity.89,90
Footnotes |
† Electronic supplementary information (ESI) available: Table S1 listing the round in which each ELP candidate xi was sampled within the active learning screen, computed values of yi = ΔGi from enhanced sampling free energy calculations, and predictions ŷi = (xi) of the terminal GPR model. See DOI: https://doi.org/10.1039/d2me00169a |
‡ Present address: Department of Chemistry and Chemical Biology, Rutgers University-New Brunswick, Piscataway, New Jersey 08854, USA. |
This journal is © The Royal Society of Chemistry 2023 |