Open Access Article
Kirill
Shmilovich
a,
Sayak Subhra
Panda
b,
Anna
Stouffer
b,
John D.
Tovar
b and
Andrew L.
Ferguson
*a
aPritzker School of Molecular Engineering, University of Chicago, Chicago, Illinois 60637, USA. E-mail: andrewferguson@uchicago.edu
bDepartment of Chemistry, Johns Hopkins University, Baltimore, Maryland 21218, USA
First published on 7th June 2022
Biocompatible molecules with electronic functionality provide a promising substrate for biocompatible electronic devices and electronic interfacing with biological systems. Synthetic oligopeptides composed of an aromatic π-core flanked by oligopeptide wings are a class of molecules that can self-assemble in aqueous environments into supramolecular nanoaggregates with emergent optical and electronic activity. We present an integrated computational–experimental pipeline employing all-atom molecular dynamics simulations and experimental UV-visible spectroscopy within an active learning workflow using deep representational learning and multi-objective and multi-fidelity Bayesian optimization to design π-conjugated peptides programmed to self-assemble into elongated pseudo-1D nanoaggregates with a high degree of H-type co-facial stacking of the π-cores. We consider as our design space the 694
982 unique π-conjugated peptides comprising a quaterthiophene π-core flanked by symmetric oligopeptide wings up to five amino acids in length. After sampling only 1181 molecules (∼0.17% of the design space) by computation and 28 (∼0.004%) by experiment, we identify and experimentally validate a diversity of previously unknown high-performing molecules and extract interpretable design rules linking peptide sequence to emergent supramolecular structure and properties.
The primary goal of the present work is to discover members of the Xn-quaterthiophene-Xn (Xn-4T-Xn) family of π-conjugated peptides capable of self-assembling into pseudo-1D nanoaggregates with in-register stacking of the quaterthiophene π-cores guided by the Xn peptide wings containing n = 1–5 amino acids. Overlap of the π-cores is a structural prerequisite to supramolecular π electron delocalization and the emergence of charge transport functionality. We chose to explore the quaterthiophene π-core due to its demonstrated applications in organic electronic field-effect transistors and photovoltaics19,20 and also our previous measurements of high charge mobilities in oligopeptide–quaterthiophene conjugates.14 The peptide wing containing n = 1–5 amino acids is denoted as Xn, where each X is one of the 20 natural amino acids. We limit the wing length to a maximum of five residues in order to simplify synthetic manipulation. We place two additional design constraints on the peptide wings. First, we require the oligopeptides to be head-to-tail invariant such that they are chemically symmetric about the quaterthiophene core: one peptide wing is symmetric to the other and each molecule possesses two C-termini. This symmetry imposes parity between the left and right sides of the molecular building blocks to promote the formation of linear supramolecular aggregates. Second, we require that the Xn sequence contain at least one acidic residue (i.e., Asp, Glu) and no basic residues (i.e., Arg, His, Lys). The acidic side-chains together with the two carboxyl C-termini allows us to wield pH control of the oligopeptide protonation state such that they possess a formal negative charge of at least (−4)e at high pH and are formally charge neutral at low pH. It has been previously shown through molecular modeling calculations and fluorescence correlation spectroscopy that intermolecular coulombic repulsion and enhanced molecular solvation at high pH disfavors assembly and maintains the system as a mixture of monomers and small oligomers.21 Acidification protonates the ionizable groups to eliminate the coulombic repulsion and serves as a trigger for large-scale supramolecular assembly. Under these two constraints, the Xn-4T-Xn family comprises 694
982 unique molecules possessing oligopeptide wings containing between n = 1–5 amino acids. The design challenge is to discover the members of this design space that self-assemble into the most highly ordered linear nanoaggregates.
The large volume of sequence space means that no more than a tiny fraction of molecular candidates can be experimentally explored due to the time and labor costs associated with oligopeptide synthesis and assays. Edisonian trial-and-improvement experimental search is therefore highly inefficient and limited. Chemical intuition can help focus the search, but prior knowledge is restricted to a small number of previously studied candidates and also introduces human bias that can impede the discovery of high-performing, non-intuitive solutions. Computational modeling presents a means to conduct high throughput in silico screening of molecular space. For example, Frederix et al. identified design rules for the assembly of tripeptide sequences by exhaustively simulating all possible amino acid combinations using coarse grained molecular dynamics simulation.22 For more complicated molecules and larger molecular search spaces, exhaustive enumeration becomes intractable and it is profitable to combine computational screening with data-driven modeling and active learning. The essence of this approach is to train on-the-fly sequence–property relationships over all computational screening data collected to date and use these models to guide subsequent rounds of the computational screen within a virtuous feedback loop.23–27 For example, Li et al. used machine learning algorithms such as random forests, gradient boosting, and logistic regression to predict the assembly and formation of hydrogels from possible peptidic precursors.28 Nagasawa et al. employed artificial neural networks and random forests for the discovery of conjugated polymers for organic photovoltaic applications.29 In the context of π-conjugated peptides, we previously combined coarse-grained molecular simulation with deep representational learning and Bayesian optimization to identify molecules predicted to exhibit superior assembly into pseudo-1D linear aggregates30 and we recently synthesized and tested perylene diimide based peptide-π conjugated materials based on quantitative structure property relation models trained over molecular simulation data.17,31
A deficiency of data-driven virtual screening is the weak coupling between computation and experiment. High-throughput virtual screening using computation is used as an initial coarse filtration of the design space that identifies a manageably small number of candidates for synthesis and testing in a subsequent low-throughput experimental screen.32,33 The serial nature of this process means that there is no provision to incorporate experimental feedback into the data-driven search of the design space. This is a lost opportunity since the experimental data can serve as a source of high-quality information to better guide the search and correct for approximations and uncertainties inherent in the computational models. A hybrid data-driven search comprising parallel computational and experimental screens has the potential to offer the best of both worlds – high-throughput approximate computation to achieve broad coverage of the design space and low-throughput experimentation directed towards the most promising candidates. The enabling component of such a procedure is a data-driven model capable of constructing on-the-fly sequence–property relationships from experimental and computational screens that operate asynchronously and in parallel and measure/predict different properties of the molecular system. The trained model is then used within an active learning paradigm to select the most promising molecules for subsequent rounds of computational and experimental screening.
In this work, we develop and deploy a hybrid computational/experimental active learning approach for the data-driven design of Xn-4T-Xn π-conjugated oligopeptides capable of self-assembling into pseudo-1D linear aggregates. We perform high-throughput computational screening using all-atom molecular dynamics simulations that predict the structural morphology of the self-assembled oligopeptide nanoaggregates. We conduct low-throughput experimental oligopeptide synthesis and characterize their assembly using UV-visible spectroscopy. We integrate the computational and experimental screening results to construct on-the-fly sequence–property models that performs asynchronous on-demand selection of the next batch of samples for computational or experimental screening. After sampling only 1181 (∼0.17%) of the 694
982 molecules in the design space by computation and only 28 (∼0.004%) by experiment, we discover and experimentally validate a diversity of previously unknown high-performing oligopeptides capable of spontaneously assembling supramolecular aggregates with a high degree of H-type character and extract interpretable design rules linking peptide sequence to emergent supramolecular structure and properties.
![]() | ||
| Fig. 1 Schematic of the hybrid computational/experimental active learning workflow for the discovery of self-assembling Xn-4T-Xn π-conjugated peptides. We perform separate computational and experimental active learning loops within a shared low-dimensional latent space embedding of the molecular design space learned using regularized autoencoders (RAE). Each active learning loop then consists of three parts. (i) Evaluating the quality of a given molecule k in the Xn-4T-Xn design space by either performing high-throughput all-atom molecular dynamics simulations to measure the average number of contacts per molecule κ(k) and radius of gyration Rg(k) of the self-assembled nanoaggregate or low-throughput experimental synthesis and measurement of the blue-shift λ(k) in the mode of the UV-vis spectrum. (ii) Fitting surrogate models using Gaussian process regression (GPR) to predict the performance of untested candidates given the accumulated simulation and experimental data collected to date. Two separate GPRs are maintained for the two computational objectives GPRκ and GPRRg. We build a multi-fidelity GPR (mfGPR) as our experimental surrogate model GPRλ that also incorporates data from the computational GPRs to improve prediction accuracy beyond what would be possible from the limited experimental data alone. (iii) Employing Bayesian optimization (BO) to interrogate the GPR model and select the next most promising molecular candidates for computation as those lying on the κ − Rg Pareto frontier and for experimentation as those with large values of λ. The molecular renderings in this figure, and throughout the paper, are generated using the Visual Molecular Dynamics (VMD) software.35 | ||
We simultaneously operate a low-throughput experimental screening loop. In this loop, we synthesize Xn-4T-Xn oligopeptides and characterize their assembly by the blue-shift λ in the mode of the UV-visible spectrogram between the unassembled (high-pH) and assembled (low-pH) states as a quantitative measure of the degree of H-type (i.e., co-facial) aggregation. Again we construct a surrogate sequence–property model relating oligopeptide sequence to the spectral shift where we use the same low-dimensional embedding furnished by the RAE. Importantly, the regression model we use to fit this relationship is a multi-fidelity GPR (mfGPR)41 that we train over both the experimental measurements and the computational predictions. Despite measuring different observables of the molecular system – κ and Rgvs. λ – the mfGPR can make use of the voluminous computational predictions to supplement the scarce experimental measurements to furnish a higher accuracy sequence–property model for the spectral shifts than would be possible using the experimental data alone. The computational and experimental screening loops operate asynchronously and in parallel and the data-driven GPR models are continually updated with each new batch of screening data.
The goal of the hybrid computational/experimental screening process is to discover and experimentally validate Xn-4T-Xn molecules with unprecedentedly large values of λ indicative of exceptional in-register π-stacking and H-type character that is a prerequisite for supramolecular electronic delocalization and emergent optical and electronic functionality.
The fitness of a particular Xn-4T-Xn molecule was defined within our molecular simulations based on its ability to form pseudo-1D linear nanoaggregates with in-register stacking of the π-cores. We quantified this structurally via the average number of contacts per molecule κ and the radius of gyration Rg (ref. 47) of the self-assembled nanoaggregates averaged over the terminal 50 ns of our production runs. An intermolecular contact is defined according to our previously reported “optical distance” dopticali,j that measures the minimum intermolecular distance between any pair of thiophene rings within the π-cores of molecules i and j.30,48–50 We have previously shown that adopting a threshold of dopticali,j < 0.7 nm by which to define an intermolecular contact assures close proximity and in-register π-stacking between at least one pair of thiophene rings.30,49–51 Our computational active learning loop ultimately aims to explore the Pareto frontier of molecules and discover well assembling π-conjugated peptides that possess good thiophene π–π stacking (i.e., large κ) within high-aspect ratio linear nanoaggregates (i.e., large Rg). In general, we found maximization of either objective function alone was insufficient to promote in-register stacked pseudo-1D nanoaggregates: high κ in the absence of high Rg corresponds to globular structures with promiscuous multi-molecular π-stacking, whereas high Rg in the absence of high κ corresponds to weakly associated elongated threads lacking π-core overlaps. As we will show, the hybrid computational/experimental active learning framework learns the appropriate balance between κ and Rg that is most predictive of high-performing experimental candidates with large values of λ.
982 molecules is large, discrete, and high-dimensional. It is possible to perform active learning directly over this space using, for example, kernel models,23,52–57 but superior search efficiencies can be achieved by first projecting the molecular design space into a smooth, continuous, low-dimensional space that is more amenable to the construction of robust regression models and deployment of optimization algorithms.26,30 We learn a bespoke latent space for the Xn-4T-Xn family using a regularized autoencoder37 (RAE) trained on our entire molecular design space (Fig. S3 in the ESI†). An RAE is a deterministic variant of the popular variational autoencoder (VAE) architecture58 that possesses the potential advantage of enabling more expressive flexibility in the data distribution within the low-dimensional embedding by not assuming or enforcing a prior (generally Gaussian) distribution over the latent dimensions. We do not perform a training/validation/testing split since the purpose of our RAE model is to provide a low-dimensional latent space embedding of the complete design space of all 694
982 candidate molecules for downstream active learning, not as a predictive tool for out-of-sample inference beyond this design space. As such, we do not use the decoder to generate new candidate molecules and it is therefore not a disadvantage for our applications that the latent distribution is not approximately Gaussian distributed, which is generally desirable for efficient generative sampling. We have previously demonstrated an application of RAEs in the data-driven design of small drug-like cardiolipin-selective molecules.59 This encoder–decoder architecture consists of a message passing neural network60,61 encoder and a decoder performing explicit graph matching to ensure end-to-end invariance of node permutations. The encoder62–64 accepts graphical representations of the Xn peptide sequences the nodes of which are featurized by the 553 single amino acid physiochemical properties contained in the AAindex database and the edges of which are featurized by the 135 pairwise amino acid contact potentials and mutation matrices.65 The bottleneck layer defining the interface between the trained encoder and decoder contains a nonlinear latent embedding of the molecular design space ξ learned by the encoder that preserves relationships between molecular candidates and is sufficiently informative for the decoder to accurately reconstruct the molecular graphs (Fig. S4 in the ESI†). We determine an appropriate dimensionality d = 32 for the latent space using exploratory hyperparameter tuning optimizing the RAE reconstruction loss.66,67 This low-dimensional latent space defines a smooth and continuous representation ξ of the molecular design space over which fit (mf)GPR surrogate models and conduct BO-enabled active learning.
= f(ξ) and
that perform supervised learning of mappings from the latent space coordinates to the two structural observables. The predictions of the two GPRs are passed to a multi-objective BO routine that seeks to simultaneously maximize κ and Rg using the method of random scalarizations.39,40 The trained GPRs for κ and Rg are used to construct two independent BO upper confidence bound (UCB)68 acquisition functions defining the relative desirability of each candidate Xn-4T-Xn molecule in maximizing each of these two objectives. We then collapse these two acquisition functions into a single scalarized acquisition function constructed as a randomly weighted linear sum. The scalarized acquisition function is then used to perform univariate Bayesian optimization. A particular random scalarization corresponds to a particular choice of relative weightings between the two design objectives and defines a vector within the 2D κ − Rg space along which to maximize. Under sufficiently many repeated random scalarizations, the random vectors span the κ − Rg Pareto frontier to discover a family of Pareto optimal solutions. We perform batched selection over different random scalarizations and choices of the UCB hyperparameter balancing the BO exploit–explore tradeoff to propose 25 new candidate molecules per round. MD simulations of these 25 molecules are performed and the three part active learning cycle is iteratively repeated. We assess convergence of the iterative screen by monitoring the set of Pareto optimal points that define the κ − Rg Pareto frontier and terminate sampling once the Pareto frontier ceases to advance with additional rounds of sampling. We conducted 38 rounds of computational screening over which we considered 1181 candidate molecules. A full accounting of the molecules identified in each round of the computational active learning loop is provided in the Data availability statement.34
= h(ξ) to predict the spectral shift λ as a function of latent space coordinates. In this case we have only a single objective function λ but we wish to construct a multi-fidelity surrogate model incorporating both direct experimental measurements of λ and computational predictions of κ and Rg. The rationale is that the computational predictions for κ and Rg should be correlated with and predictive of the experimental measurements of λ. This is expected to be the case since κ and Rg are structural measures of the degree of in-register π-stacking in elongated nanoaggregates that are prerequisites for H-aggregate character manifested in measurements of λ. A multi-fidelity model trained to learn a nonlinear mapping from the low-fidelity computational predictions to high-fidelity experimental measurements can take advantage of the abundant computational data to produce a superior model than that obtained by training over only the sparse experimental data alone. Indeed, by the terminal round of experimental active learning, incorporation of computational screening data within the multi-fidelity paradigm leads to a ∼27% improvement in the predictive accuracy of our surrogate model compared to a single-fidelity model fitted only over the experimental observations (Fig. S5 in the ESI†). This predictive improvement highlights the capabilities of our mfGPR to leverage plentiful low-fidelity data from our computational κ and Rg metrics to improve our predictive performance for our target high-fidelity objective in λ. We observe that the flexibility of the mfGPR framework enables this information transfer even in situations where the low- and high-fidelity observables are measuring different properties, are only weakly correlated or even anti-correlated, and where the degree of correlation may change over the domain.71 The only requirement is that the low-fidelity response surface is informative of the high-fidelity response surface and that this relationship can be extracted from the data within the mfGPR model.
We construct multi-fidelity surrogate models fusing the computational (low-fidelity) and experimental (high-fidelity) data using the multi-fidelity Gaussian process regression (mfGPR) formalism.41 The mfGPR model is then passed to a standard BO routine39 employing an expected improvement (EI) acquisition function39,72,73 and the Kringing believer74 batched sampling. We use the BO to propose a batch of molecules for the next round of sampling, which we manually down-select to 8–9 molecules for experimental synthesis and characterization. By incorporating “human-in-the-loop” curation of the selected molecules we hope to balance purely data-driven candidate proposal with chemical intuition and thereby incorporate some degree of prior knowledge and human experience into the search process without, we hope, imposing too much bias on the search. The success of this collaborative human-machine paradigm has been previously demonstrated in the data-driven discovery of molecular organic light emitting diodes.33 The experimental measurements are fed back into the low-throughput experimental active learning loop that is executed asynchronously and in parallel with the high-throughput computational loop. We execute three rounds of experimental active learning over the course of the 38 rounds of computational active learning that are executed at computational rounds 0, 14, and 22. Given the good performance of the candidates studied in the third experimental round together with the relatively modest advances in the Pareto frontier observed over computational rounds 23–38, we elected to terminate our experimental screen after its third round. A full accounting of the molecules identified in each round of the experimental active learning loop is provided in Table 1 and have been made available as detailed in the Data availability statement.34
| Rank | Peptide wing, Xn | Measured spectral shift, λ (nm) | Discovery round | Previously known? |
|---|---|---|---|---|
| 1 | DGG | 55.06 ± 1.00 | 2 | Yes70,86 |
| 2 | DG | 53.42 ± 4.16 | 0 | N |
| 3 | ESA | 50.01 ± 0.53 | 2 | N |
| 4 | EGG | 49.97 ± 1.00 | 0 | Yes14 |
| 5 | ETGG | 45.80 ± 0.62 | 2 | N |
| 6 | DGA | 44.15 ± 1.49 | 2 | N |
| 7 | DDDAA | 42.07 ± 0.46 | 2 | N |
| 8 | DVAA | 41.65 ± 1.15 | 0 | N |
| 9 | DSG | 40.32 ± 1.15 | 1 | N |
| 10 | AEVGA | 40.15 ± 1.12 | 2 | N |
| 11 | DVAG | 35.70 ± 1.49 | 0 | N |
| 12 | DNDN | 29.58 ± 4.76 | 1 | N |
| 13 | DANN | 25.70 ± 0.32 | 2 | N |
| 14 | VEFAG | 21.75 ± 2.07 | 0 | N |
| 15 | VEVEV | 18.43 ± 0.62 | 0 | N |
| 16 | VD | 18.02 ± 2.66 | 0 | N |
| 17 | AAD | 15.98 ± 1.00 | 0 | N |
| 18 | EYIQG | 15.01 ± 7.18 | 1 | N |
| 19 | EV | 14.70 ± 1.20 | 0 | N |
| 20 | DT | 14.70 ± 1.20 | 1 | N |
| 21 | AAED | 13.68 ± 1.46 | 0 | N |
| 22 | SSD | 13.68 ± 1.15 | 1 | N |
| 23 | VEF | 11.99 ± 1.68 | 0 | N |
| 24 | DLAG | 11.49 ± 0.46 | 2 | N |
| 25 | GFGFD | 10.97 ± 1.77 | 1 | N |
| 26 | DGL | 10.25 ± 1.20 | 1 | N |
| 27 | IDSV | 7.70 ± 3.83 | 1 | N |
| 28 | EN | 4.33 ± 1.49 | 1 | N |
100 simulation snapshots harvested from the ensemble of 1181 simulation trajectories. We define these pairwise distances using the smooth overlap of atomic positions (SOAP) kernel81–84 between the heavy atoms constituting the 4T π-cores as a distance metric that is naturally invariant to rotations, translations, and permutations of atoms, and which – as a π-core-centric metric – can be applied between oligopeptides with different wing lengths. The influence of the wings is implicitly retained through their impact on the configurations adopted by the π-cores. A density-adaptive variant of diffusion maps85 is then applied to furnish embeddings of the assembly trajectories into a 2D manifold that exposes the assembly pathways and mechanisms followed by the various Xn-4T-Xn molecules. We note that our diffusion maps furnish a low-dimensional embedding of the configurational coordinate space traversed by our MD simulations, and that this embedding is completely independent of the low-dimensional embedding of the molecular design space furnished by the RAE that is employed within our active learning protocol.
:
1 MeOH
:
water solution with 1% ammonium hydroxide. ESI spectra for each synthesized peptide are provided in the ESI: ESI spectra.†
982 Xn-4T-Xn candidate molecules. We conduct 38 rounds of computational screening to simulate a total of 1181 Xn-4T-Xn molecules interleaved with three rounds of experimental screening in which we synthesize and test a total of 28 molecules. A full accounting of the molecules identified in each round of the computational and experimental active learning loops are provided as detailed in the Data availability statement.34 Round 0 of the computational and experimental screens commence simultaneously, respectively screening 228 and 11 molecules. The two screens then proceed asynchronously and in parallel. The high-throughput computational loop considers 25 candidate molecules per round and iterates more rapidly than the low-throughput experimental screen that considers 8–9 molecules per round. We track the progress of the experimental screen via the measured spectral blue shifts λ upon assembly as a measure of the prevalence of H-type co-facial π-stacking (Fig. 2a). We track the progress of the computational screen via the advancement in the κ − Rg Pareto frontier that we quantify through the mean distance from the origin of all ni molecules cumulatively simulated over the first i rounds (Fig. 2a and b),![]() | (1) |
We recall that large values of the average number of π-core contacts per molecule κ and radius of gyration of the self-assembled aggregates Rg are anticipated to lead to correlate with the formation of pseudo-1D nanoaggregates with high degrees of H-type character.
Round 1 of the experimental screen is performed upon completing computational round 14, at which time a total of 578 candidate molecules have been computationally assessed. These computational screening data are passed to the experimental surrogate model and Bayesian optimization within the multi-fidelity hybrid computational/experimental active learning framework in order to better inform the design of experimental round 1 than would be possible by analyzing the 11 experimental data points alone. Under our human-in-the-loop selection protocol, we selected nine molecules for round 1 of experimental screening by filtering a 35-molecule list outputted from our BO routine. Down-selection was performed on the basis of high anticipated performance based on our previous experimental and computational work15,30 and maintenance of a diversity of peptide wing compositions and lengths. This human-in-the-loop selection process serves as a simple means to inject prior knowledge into the data-driven search process, which can be particularly valuable in the early stages of the search where the models are trained over small numbers of data points, by directing the search process to regions of molecular design space that are anticipated to be particularly profitable.
Round 2 of the experimental screen is conducted after completing computational round 22, at which point we have simulated 780 candidate molecules. Again, the totality of these computational screening data are used to augment the multi-fidelity experimental surrogate model and used to pick eight candidate molecules for experimental testing down-selected from the 75 top candidates proposed by the BO routine. We inject one addition piece of human intuition into the down-selection process by making a single Tyr to Ala amino acid mutation of one of the predicted sequences – YEVGA to AEVGA – based on prior understanding that aromatic side chains are known to π-stack with the π-cores and therefore liable to disrupt favorable in-register 4T stacking.30 This modification is substantiated by both observing low-ranking candidates possessing bulky aromatic residues in the first two experimental rounds (EYIQG: rank 18/28, VEF: rank 23/28, GFGFD: rank 25/28) along with previous experimental work noting the presence of aromatic residues resulting in reduced UV-vis blue-shifts.15
We continue to conduct an additional 16 rounds of computational screening (rounds 23–38) while experimental round 2 is being completed in anticipation of possibly conducting a fourth experimental round. As illustrated in Fig. 2a, we observe a relatively rapid expansion of the computational exploration of the κ − Rg design space over the course of computational rounds 0–20, which we quantify by dPareto(i) defining the mean distance from the origin of all molecules cumulatively simulated over the first i rounds. This trend, however, plateaus at dPareto ≈ 3.3 by round 29 and exhibits only a 0.7% increase in dPareto relative to round 22. This observation is mirrored by a relatively modest advancement of the Pareto frontier between rounds 23–38 (Fig. 2b). Experimentally, the mean spectral shift λ in experimental round 2 is 46% better than round 0, and the top performing round 2 candidate has a 3% better spectral shift compared to that in round 0. The diminishing returns evinced by the computational dPareto and the successful discovery of a molecule with superior λ impelled us to terminate our search after experimental round 2/computational round 38.
In all, we simulated 1181 molecules comprising ∼0.17% of the 694
982 molecules constituting the Xn-4T-Xn design space, corresponding to 236.2 μs of simulation time, and requiring ∼4.97 GPU-years of parallel compute. Experimentally, we synthesized and characterized a total of 28 Xn-4T-Xn molecules over the course of the course of eight months corresponding to exploration of 0.004% of the molecular space.
We present in Fig. 3 an embedding of all 1181 simulated molecules and 28 experimentally tested molecules into the κ − Rg objective function space used to identify high-performing molecules in the computational screening loop. Molecular renderings of the self-assembled nanoaggregates provides qualitative visual conformation that Xn-4T-Xn molecules producing aggregates with large values of both κ and Rg do indeed tend to self-assemble into pseudo-1D structures with good stacking of the 4T π-cores.
![]() | ||
| Fig. 3 Embedding of the computational and experimental molecules sampled in the active learning screen into the κ − Rg objective function space. (a) Embedding of the 1181 Xn-4T-Xn molecules explored in the computational screen. Points shown in the same color were sampled during the same computational active learning round. The labels associated with each point corresponds to the Xn peptide wing sequence. (b) Highlighting the 28 experimentally tested Xn-4T-Xn molecules (colored points) superposed onto all 1181 computationally simulated points (grey points). The color indicates the round of experimental active learning within which the molecule was tested. Encircling the plot are snapshots from our molecular simulation trajectories showing the terminal self-assembled nanoaggregates. The heavy atoms constituting 4T π-cores are rendered as gold space-filling spheres and the Xn peptide wings as semi-transparent ball-and-stick representations. Molecules producing aggregates with large values of both κ and Rg tend to self-assemble into pseudo-1D structures with good stacking of the 4T π-cores. For the experimentally tested candidates we also report the measured values of the spectral blue shift λ. A full accounting of the computed κ and Rg values and measured λ values for all molecules considered in our screen have been made available as detailed in the Data availability statement.34 | ||
The primary result of our hybrid computational/experimental active learning screen are experimental measurements of 28 Xn-4T-Xn molecules reported in Table 1. The 11 round 0 molecules were selected based on human intuition and used to seed the experimental active learning. The nine round 1 and eight round 2 molecules are the result of our multi-fidelity active learning search. Of these molecules, 26 are completely novel and on par with known high-performing sequences14,70,86 while also possessing greater diversity amino acid sequences previously unknown to correlate with good spectral blue shifts λ. The high values of λ for these molecules are indicative of a high degree of H-type co-facial stacking and the potential for long-range supramolecular electronic delocalization and emergent optoelectronic functionality. Additionally, we were encouraged that our active learning procedure spontaneously discovered DGG-4T-GGD as a previously known high-performing candidate.70,86 A complete list of predicted κ and Rg values from the terminal computational surrogate model and predicted spectral shift measurements λ from the terminal experimental surrogate model for all 694
982 molecules within the Xn-4T-Xn design space have been made available as detailed in the Data availability statement.34
Second, consistent with the favorability of core-adjacent Gly and Ala residues, the non-C-terminal amino acids within the Xn peptide sequences of the top-performing molecules tend to also be enriched in small hydrophobic residues such as Ala, Gly or Val (one-sided Mann–Whitney U test, p-value = 0.004). Interestingly, residues containing polar hydroxyl groups such as Ser and Thr are also over-represented within high-performing sequences when Ser or Thr are non-terminal residues and Asp or Glu are C-terminal such as ESA: rank 3/28, ETGG: rank 5/28, and DSG: rank 9/28 (one-sided Mann–Whitney U test, p-value = 0.03). Other polar residues like Asn also perform relatively well in the π-core proximate position when Asp occupies the distal slot (DNDN: rank 12/28; DANN: rank 13/28).
Third, the presence of larger hydrophobic and bulky aromatic residues such as Leu, Ile, Phe, and Tyr at any location are correlated with poorer performing candidates such as DLAG: rank 24/28, DGL: rank 26/28, EYIQG: rank 18/28, and IDSV: rank 27/28, with the poorest-performing candidates enriched in these four amino acids (one-sided Mann–Whitney U test, p-value = 0.002). We have previously shown that favorable interactions between these large hydrophobic residues and the aromatic π-cores can disrupt supramolecular association between the cores.30 This observation is also consistent with our prior observation that oligopeptides possessing oligophenylenevinylene π-cores exhibited larger spectral blue shifts when the peptide wings contained small Gly and Ala residues compared to larger Ile and Val residues.15
Finally, although no Pro containing molecules were sampled in the experimental screen, we note that a number of simulated molecules containing a Pro residue lie at or near the κ − Rg Pareto frontier and were highly ranked in predicted blue shift λ by the terminal mfGPR surrogate model (e.g., DPG: rank 329/694
982, AEPP: rank 183/694
982, SDPD: rank 10/694
982, EAP: rank 13/694
982, DDPA: rank 23/694
982, GEPG: rank 15/694
982). This finding is somewhat surprising because Pro has been largely unexplored in previous experimental and computational π-conjugated peptide studies. Proline, with its unique conformational properties including its conformational rigidity and absence of hydrogen bond donor capacity, appears to be quite favorable in promoting good in-register stacking between the π-cores and the formation of high-aspect ratio nanoaggregates. We suggest that experimental testing and further computational exploration of the molecular mechanisms underpinning these predictions may be a profitable avenue for future investigations.
In Fig. 4c–f we illustrate the temporal evolution of the self-assembly pathways for particular Xn-4T-Xn molecules over the ψ2 − ψ3 manifold. Each molecular trajectory begins at the right-most edge of the of the manifold corresponding to the initial monodisperse state. Lateral leftward movement across the manifold corresponds to condensation of the system to smaller Rg values due to the formation of nanoaggregates. Vertical upward movement corresponds to the accumulation of inter-molecular contacts between the π-cores and an elevation in κ. The assembly pathways of the top-performing candidates typically terminate in the upper-left corner of the manifold that contains pseudo-1D nanoaggreates containing κ ≈ 3 intermolecular contacts and Rg ≈ 2 nm corresponding to elongated linear stacks. We observe that Rg necessarily decreases as the system self-assembles from dispersed monomers and that this behavior is not in conflict with the active learning goals of maximizing Rg and κ of the self-assembled nanoaggregates in order to promote high-aspect ratio linear morphologies with good π–π stacking. One prototypical class of assembly pathways for high-performing molecules is exemplified by EGG-4T-GGE (rank 4/28), which traverses the upper edge of the manifold (Fig. 4c). This assembly route corresponds to the rapid formation of small oligomeric stacks, the formation of which is likely promoted by the small size of the peptide wing, that ultimately associate into an elongated aggregate with good in-register and global stacking. DVAG-4T-GAVD (rank 11/28) is another high-performing candidate that is prototypical of a different assembly route followed by high-performing molecules (Fig. 4d). This pathway commences with an initial rapid hydrophobic aggregation of the system corresponding to a rapid leftward lateral motion over the manifold. The absence of any early upward vertical motion is indicative of no initial substantive increase in κ due to the larger peptide wings seemingly preventing good π-core stacking. This initial collapse is, however, then followed by a more gradual structural ripening as the cores do achieve good stacking and we observe late upward motion over the manifold corresponding to an increase in κ.
Trajectories that terminate within the bulk of the manifold and far from the upper-left corner typically fail to form nanoaggregates containing globally connected pseudo-1D stacks. DLAG-4T-GALD (rank 24/28) is emblematic of a poor-performing molecule that initially builds a reasonable number of intermolecular contacts, but then fails to further condense into an in-register stacked nanoaggregate (Fig. 4e). Differing only in a V to L mutation relative to the high-performing DVAG-4T-GAVD, the presence of the bulkier hydrophobic Leu residue appears to preclude structural ripening into the desired elongated stack. Finally, molecules rich in large hydrophobic side chains such as EYIQG-4T-GQIYE (rank 18/28) tend to exhibit moderate leftward motion over the manifold corresponding to hydrophobic collapse but accompanied with unfavorable downward motion indicative of the formation of very few intermolecular π-core contacts (Fig. 4f). This behavior can be attributed to the bulky aromatic hydrophobes that stack against the π-cores and prevent the formation of core–core contacts.
Whereas Fig. 4 provided anecdotal insights into the self-assembly trajectories traced out by particular representative Xn-4T-Xn molecules, in Fig. 5 we present the entire distribution of trajectory end points for all molecules considered in our active learning screen. In Fig. 5a and b we illustrate the end points of the 1181 molecules sampled in our computational screen colored by the Rg and κ values of the terminal nanoaggregates and in Fig. 5c the 28-molecule subset of these candidates that were experimentally tested colored by the measured spectral blue shift λ. Focusing on the 28 experimental molecules, we observe a clustering of 17 molecules in the upper left region of the manifold that we bound by a purple box. As anticipated by the understanding exposed by the diffusion map, these molecules tend to be high-performers comprising nine of the top 11 experimentally-tested molecules with spectral blue shifts λ ≥ 35 nm. Further, the molecules within the box possess a mean spectral shift of λ = 31 nm compared to the mean value for those outside the box with λ = 21 nm (one-sided Mann–Whitney U test, p-value = 0.07). The correlation between (Rg, κ) and λ within the diffusion map embedding provides a strong post hoc substantiation for the use of the former measures as a computational proxy for the latter within the active learning screen, and demonstrates the power and value of the high-throughput computational screen in focusing and guiding the low-throughput experimentation.
982 candidate molecules, making its exhaustive exploration impracticable by either simulation or experiment. By fusing computational and experimental data streams within an integrated computational–experimental active learning framework, we perform a data-driven efficient traversal the space of Xn-4T-Xn peptides that minimizes computational and experimental burden required to discover and validate new high-preforming candidates. Our platform employs a combination of all-atom molecular dynamics simulations, deep representational learning, single- and multi-fidelity Gaussian process regression, and single- and multi-objective Bayesian optimization. A computational active learning loop serves as a high-throughput and cheaply available experimental proxy used to refine a surrogate model that predicts the experimental performance of untested candidates. Using this platform, we discovered a diversity of high-performing new molecules experimentally validated to form pseudo-1D linear nanoaggregates after sampling only 1181 molecules (∼0.17% of the design space) by computation and 28 (∼0.004%) by experiment. Subsequent interrogation of our experimental screening data exposed molecular design rules linking sequence to the emergent structure and function of the self-assembled nanoaggregates. Analysis of the computational screening results revealed two prototypical assembly mechanisms and pathways shared by the high-performing molecules: (i) hierarchical assembly of small in-register supramolecular oligomers that undergo further assembly into a single linear aggregate with ordered π-stacking and (ii) rapid hydrophobic collapse followed by slow structural ripening and the emergence of in-register ordering of the π-cores.
This work exposes new understanding of how variation in oligopeptide sequence in π-conjugated peptides impacts assembly behavior using an integrated experimental-computational active learning platform. Our findings corroborate prior physico-chemical understanding and chemical intuition of π-conjugated peptide assembly, but also reveals new design rules and understanding of molecular assembly mechanisms. Our hybrid computational/experimental active learning platform demonstrates the power of tightly integrated collaboration between theory and experiment, and this paradigm is transferable to other generic molecular design and discovery applications.
Footnote |
| † Electronic supplementary information (ESI) available. See https://doi.org/10.1039/d1dd00047k |
| This journal is © The Royal Society of Chemistry 2022 |