Bálint
Tamás‡
,
Pietro Luigi
Willi‡
,
Héloïse
Bürgisser
and
Nina
Hartrampf
*
Department of Chemistry, University of Zurich, Winterthurerstrasse 190, 578057 Zurich, Switzerland. E-mail: nina.hartrampf@chem.uzh.ch
First published on 3rd January 2024
Computer-assisted methods, which hold the promise to transform synthetic organic chemistry, are often limited by experimental data lacking in quality, diversity, and quantity. In solid-phase peptide synthesis (SPPS), automated flow chemistry is well-suited to deliver such data, which is key for prediction and optimization of sequence-dependent “difficult couplings”, and insights obtained in flow-SPPS can be transferred to batch-SPPS. The current data analysis techniques rely on the height and the width of fluorenylmethoxycarbonyl (Fmoc) deprotection peaks and perform well under standard conditions. Yet any deviation in parameters (e.g. temperature, flow rate, resin loading) leads to incomplete capture of information and exclusion from the dataset. Here, we present a flexible and robust processing and analysis method that is based on the Gaussian shape of the deprotection peaks to overcome these challenges, which drastically increases the interpretable size of our data set. Using this straightforward method retains the full information and data quality while the generation of hazardous dimethylformamide solvent waste is reduced by 50%. Overall, this work highlights how the interplay between synthetic and computational analysis enables the collection of high-quality data even under non-ideal, non-standard conditions.
Fig. 1 Sequence-dependent aggregation leads to “difficult couplings”, which severely impact solid-phase peptide synthesis (SPPS). A) Aggregation occurs through the formation of β-sheets between growing peptide chains.5,18 B) In-line UV–vis analysis in flow-SPPS allows for monitoring of reaction efficiency and kinetics. C) Aggregation factor analysis of in-line collected UV data based on subtraction of normalized height from normalized width versus peak angle based on the exploitation of the asymmetry and Gaussian character of the peak. |
The phenomenon of “difficult sequences” has hampered SPPS since its conception in 1986, and flow chemistry allows for its monitoring (Fig. 1B):3,7,9 while UV–vis monitoring of fluorenylmethyloxycarbonyl (Fmoc)-deprotection steps in batch-SPPS gives indirect information about the coupling success itself (absolute amount of Fmoc removed), flow-SPPS allows to additionally detect aggregation, which is characterized by a broadening of the Fmoc-deprotection peak (deprotection peak shape).10–12 The systematic study of aggregation, however, is impeded by the sheer number of possible sequences. For example, a short 10-mer peptide made up from the 20 naturally occurring amino acids already results in ∼1013 unique sequences. Additionally, 12 of the 20 amino acids carry sidechain protecting groups, required for SPPS, which can be modified without altering the final amino acid sequence itself. Finally, non-canonical amino acids can expand the sequence space in SPPS even further. The features of each building block (functional groups, protecting groups, stereocenters)13 as well as their position in the peptide sequence may impact aggregation during SPPS. A more detailed understanding and prediction of synthesis conditions or preferred building blocks to circumvent aggregation would therefore require a larger data set and advanced data analysis methods.12
High-throughput data analysis is becoming a powerful tool in organic chemistry, especially when coupled with automated reaction setups and in-line data collection.14,15 The latter is crucial for capturing both positive and negative results on automated synthesis platforms,16 and can even be used for the optimization of multistep reactions.17,18 For peptides, an automated fast-flow peptide synthesis (AFPS) platform developed by Pentelute and co-workers allows for the collection of in-line UV–vis data to analyze aggregation (Fig. 1B).19,20 With rapid reaction times enabled by elevated temperatures, AFPS is ideally suited to collect ample analytical data to ultimately decipher the cause of aggregation.12 Current data processing methods use the height and the width of the Fmoc-deprotection signals. However, this method is best suited to data sets showing defined UV-trace baseline resolution—without peak oversaturation—and using standardized synthesis conditions (e.g., flow rate and coupling temperature). Any data collected under non-standard synthesis conditions, or showing non-ideal raw data (e.g., saturated and non-baseline resolved peaks) therefore cannot be analyzed, leading to significant shrinking of the interpretable data set.12 Washing steps with excessive amounts of hazardous dimethylformamide (DMF) solvent are, for example, required to obtain interpretable data with baseline-resolved peaks, even though these steps are likely not required for SPPS itself. New data processing methods are therefore needed to handle non-standard and non-ideal data sets, in order to maximize workflow and solvent efficiency.
We now report the development of a new and robust data processing and analysis method for in-line UV data collected in flow-SPPS (Fig. 1C). Previous methods to analyze these data are characterized by two major limitations: (1) a normalization step is required, which leads to information loss (e.g., on the impact of temperature, solvent volume, resin, and linker) as well as error propagation, and (2) analyzing the difference between peak height and width at half maximum (“aggregation factor”) requires exact knowledge and determination of peak baseline and height, leading to exclusion of many “saturated” or unresolved peaks using previous processing methods. To overcome these limitations, we based our new method on the Gaussian shape of the deprotection peaks, rather than their height and width. Using this method, we investigate the resin loading- and temperature-dependence of aggregation and “difficult sequences” in flow-SPPS, and demonstrate that these findings also directly translate to batch-SPPS.3,4,7 The robustness of our new method furthermore enables analysis of saturated and non-baseline resolved peaks, which ultimately results in a 50% reduction of hazardous DMF used in flow-SPPS without losing analytical information.
Fig. 2 Analyzing the peak angle to obtain absolute values on Fmoc-deprotection peak broadening. i) The peak front and tail are separated at the median of its maxima. Both are mirrored and a parameterized Gaussian function (A = recorded UV absorption, t = time, and a, b, c = fitted parameters) is fitted on them separately (dashed line). ii) The function is differentiated with respect to time to give the first derivative (black line) of the fitted Gaussian (grey line). iii) The gradient is determined at the inflection point (maximum of the derivative) and the peak angles are calculated as shown in the formula. The angle belonging to the front and tail of the peak are summed to obtain the peak angle describing the full peak. For a more detailed description of peak angle fitting see ESI† chapter 2. |
The total amount of resin with a given loading has a major effect on the Fmoc-deprotection peak area and angle but should not have an impact on aggregation itself. Therefore, to standardize across a data set with varying resin amounts, we investigated the introduction of a mass correction factor. To confirm that the absolute resin mass (with identical loading) does not impact aggregation, Barstar[75–90] (aggregating) and NBDY[53–68] (non-aggregating) were synthesized on various resin masses (50, 100, 150, 200 mg, resin loading = 0.41 mmol g−1). By in-line UV–vis analysis, an excellent linear correlation between resin mass, the integral of the deprotection peak, as well as the peak angle was observed with no impact on crude peptide purity from each experiment (see ESI3.1.1†). Owing to the peak angle's (α) approximate linear scaling with mass (m), an arbitrary “standard mass” (mst) of 150 mg resin was chosen, and all other syntheses were scaled accordingly. The calculated mass correction factor was then implemented into the peak angle function, giving a “resin mass-independent peak angle” (αst) (see Fig. 3A). In addition to the definition of a mass correction factor, the linear correlation of deprotection integral by in-line UV–vis and the resin mass (Fig. 3B) allows for the indirect determination of resin loading and the prediction of truncation and deletion side-products (see ESI2.7†).
Next, individual outlier peaks from temperature differences, originating from tailored synthesis conditions for the coupling of sensitive or racemization–prone amino acids (e.g., cysteine and histidine couplings) had to be filtered out to improve the detection of permanent increases in peak angle. These isolated “spikes” in the UV–vis data can mislead aggregation detection methods as they introduce a point with increased peak angle owing to the reduced reaction rate and diffusion. For couplings and deprotections performed outside of a ±20% window of the mean temperature of the whole synthesis, the value of the peak angle was therefore replaced by the mean of the two closest neighbors within the temperature range (see ESI2.4†). After applying the developed standardization functions, the unified data set was used to investigate analytical methods to define and characterize aggregation (e.g., point of onset and severity). Two methods were developed, both with different scopes and limitations: method A involves fitting of a sigmoid function onto the peak angle trace (Fig. 3C). The sigmoid was implemented because of its monotonically increasing characteristic whereby its point of inversion would be fitted onto the onset of aggregation. Its advantage and disadvantage both lie in its simplicity: outliers throughout the synthesis are ignored, avoiding their influence on the aggregation detection, and a relative value is assigned to the extent of aggregation. However, it is not suitable to detect multiple aggregation events within the same synthesis and is also misled by gradually increasing peak angles. Method B (Fig. 3D) is a pointwise summation of the slope of a peak angle with respect to all other peak angles divided by the peptide length. The major advantage of this method lies in its capability to detect multiple aggregation events, while its disadvantage is increased sensitivity to sharp non-permanent increases in angle, leading to false aggregation detection.
The ideal choice of aggregation detection method is dependent on the envisaged application: method A performs better for smaller, noisier datasets with simpler peptides, whereas higher-quality datasets with longer peptides could be preferably analyzed using method B. In the next step, the developed methods were applied to investigate the impact of various reaction parameters on aggregation.
Next, we investigated the impact of the total resin volume and the total number of deprotected sites on peak broadening. We previously determined that different amounts of resin (same resin loading) lead to a linear increase of the peak angle, however, it was not clear if this increase originates from an increased amount of Fmoc from the high-loading resin or from increased diffusion through the larger resin volume (see Fig. 3). We therefore synthesized Barstar[75–90] on a 1:1 mixture of SPPS resin with an unreactive, capped resin to give an average loading of ≈0.20 mmol g−1 (150 mg resin, 30.0 mmol) (Fig. 4; reactor 3). Reactors 1 and 3 have the same loading on individual resin particles and the same resin volume, but a different number of “reactive sites”. Interestingly, the peak angle determined by in-line UV–vis analysis shows identical results for high-loading and “mixed” loading resin, indicating that the total number of reactive sites with the same loading does not have a significant impact on peak broadening, whereas resin volume does. As determined during the development of the mass correction factor, this effect is only observed by in-line UV–vis analysis and has no impact on aggregation.
Fig. 5 Test of the sequence vs. parameter dependence of the aggregation characteristics of peptides. A) Temperature does not change the point of aggregation but alters its synthetic effect: the lower the temperature the more significant the drop in coupling efficiency after onset of aggregation; this also translates to synthetic purity (see ESI3.3†). B) Demonstration of sequence dependence of aggregation: 68 amino acid-long non-aggregating microprotein NBDY was synthesized on AFPS and batch-SPPS. Both syntheses show similar side-product profiles by LCMS and UHPLC. |
The transferability of results from flow-SPPS to batch-SPPS (at room temperature) was investigated next. To this end, a non-aggregating protein (NBDY, 68 amino acids, see Fig. 5B)21 and a short aggregating sequence (Barstar[75–90], see ESI†)20 were prepared by both AFPS and batch-SPPS. As expected, AFPS synthesis of NBDY results in high crude purity (63%, see ESI3.4†) due to the lack of aggregation. Strikingly, batch-SPPS of NBDY using standard methods (coupling conditions: 5 eq. amino acid, 23 °C, 30 min) also showed excellent crude purity (66%). For the aggregating peptide Barstar[75–90], both batch-SPPS and AFPS syntheses show almost identical side-product profiles (see ESI3.4†), resulting in crude purities of ∼55%. Overall, these results support the notion that the onset of aggregation mostly depends on the sequence, and not on the synthesis method (batch vs. flow).3,4,7 Insights obtained from in-line UV analysis in the AFPS can therefore directly be translated to the more common method of batch-SPPS.
Fig. 6 New data processing method tolerates saturated and non-baseline resolved signals. A) i) Method for extrapolating oversaturated UV signals: the maximum value is removed, and mirrored at its median, and similarly to non-saturated peaks a Gaussian function is fitted. All following steps are identical to peaks shown in Fig. 1. ii) Difference between analysis of increasingly oversaturated deprotection peaks: in-line collected UV–vis is artificially trimmed at gradually decreasing percentages of the tallest peak. R2 is computed between the peak angle and the aggregation factor of original and oversaturated UV signal. At only 20% oversaturation, R2 of the aggregation factor is significantly decreased compared to the unsaturated signal. See ESI2.8† for an artificially oversaturated signal example ESI2.8.† B) Reduction of washing volume: i) reduced washing volume results in decreased baseline resolution, as indicated by the gray areas on the smaller plot (C: coupling, W: washing, D: deprotection). Effect of DMF washing volumes on peptide purity (measured by UHPLC@214 nm): reduction to 16 or 12 mL does not affect crude purity. The plot in the bottom right corner shows the significant change in the baseline resolution of the deprotection which is the source of the data loss. ii) The peak angle shows consistent result across all applicable volumes; iii) despite similar synthetic efficiency at 12 mL washing volumes/coupling cycle, aggregation factor fails to reliably detect aggregation owing to the unresolved baseline. |
Using the peak angle also removes the requirement for baseline resolution, which significantly reduces the amount of solvent used in AFPS and decreases total synthesis time. In the past, excessive quantities of DMF (32 mL) were required in the washing steps between couplings and deprotections to maintain good data quality. It was unclear, however, if these extended washing steps were required for synthesis success. Before comparing the two data processing methods, the effect of washing volume reduction on the purity of the synthesized peptide was tested: reducing the washing volume by 50% (to 16 mL DMF) and even 63% (to 12 mL DMF) during aggregating test peptide Barstar[75–90] synthesis does not impact purity, this occurs only at 75% (8 mL DMF) (Fig. 6Bi). We next compared the peak angle and aggregation factor's capability of capturing peak broadening during these syntheses. At 50% solvent reduction, both peak angle and aggregation factor have sufficient accuracy. With 12 mL of washing volume the aggregation factor could not capture aggregation accurately anymore, (Fig. 6Biii), while the peak angle retains similar trends as observed with higher washing volume quantities (Fig. 6Bii). Through these reduced washing steps, overall DMF consumption of a synthetic cycle was decreased by 50%, and the synthesis time by approximately 33%, while retaining peptide crude purity and efficiency of UV analysis.
A systematic investigation of parameters provides additional insight into their effect on aggregation. We were not only able to transfer the statements published by Kent4 and Milton3 on aggregation from batch- to flow-SPPS but also gained additional insights. Aggregation decreases coupling efficiency, resulting in lower crude purity: the non-aggregating NBDY[53–68] had a 10–15% decline in the deprotection peak integrals, but aggregating peptide Barstar[75–90] showed a decline of 30–40%, which directly translates to their crude purity (see ESI†). It was furthermore determined that aggregation is independent of synthetic strategy, conditions, or used amino acid equivalents but mainly sequence- and loading dependent (Fig. 5B). We finally also determined that synthesis temperature (in addition to accelerating reaction kinetics) almost exclusively had an impact on coupling efficiency past the onset of aggregation. Taking these results into consideration, the largest impact on solving “difficult sequences” in SPPS is expected from understanding the contribution of individual amino acid building blocks, however, owing to the large sequence space, this will require a large amount of data.
Organic chemistry data sets for large-scale analysis are scarce, and there is currently a disconnect between the collection of data by synthetic organic chemists and computational scientists using these analytical data. While experimental chemists optimize their workflows for ideal reaction outcomes (minimized reagents and reaction times, non-standardized analytics, lack of negative data), computational scientists require standardized, “interpretable” analytical data. Advanced data analysis methods that make use of seemingly low-quality data are therefore needed to collect a dataset that is sufficient in diversity and size. We demonstrated the importance of processing and analysis methods for the improvement of reaction time, reagent consumption, and the identification of challenging couplings. In the future, several potential applications can be envisaged: 1) expansion to other analytical techniques such as resin volume monitoring,22 IR,23 or refractive index24 for peptide chemistry, as these are either yet to be adapted or widely used. 2) Investigation of other sequence-defined polymer synthesis methods (e.g., for polysaccharides or oligonucleotides). Once established, advanced in-line analytical methods furthermore hold the potential for real-time optimization in flow thus eliminating the need for sequence-dependent trial and error optimization campaigns.
Footnotes |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d3re00494e. The python code is available at: https://github.com/Hartrampf-Lab/AggregationAnalysis. |
‡ These authors contributed equally to the project. |
This journal is © The Royal Society of Chemistry 2024 |