Gökçe Geylan*ab,
Mikhail Kabeshova,
Samuel Genheden
a,
Christos Kannas
a,
Thierry Kogeja,
Leonardo De Maria
c,
Florian Davidb and
Ola Engkvist
ad
aMolecular AI, Discovery Sciences, BioPharmaceuticals R&D, AstraZeneca, Gothenburg, Sweden. E-mail: gokcegeylan96@gmail.com
bDivision of Systems and Synthetic Biology, Department of Life Sciences, Chalmers University of Technology, Gothenburg, Sweden
cMedicinal Chemistry, Research and Development, Respiratory & Immunology, BioPharmaceuticals R&D, AstraZeneca, Gothenburg, Sweden
dDepartment of Computer Science and Engineering, Chalmers University of Technology, University of Gothenburg, Gothenburg, Sweden
First published on 8th September 2025
Incorporating non-natural amino acids (NNAAs) into peptides enhances therapeutic properties, including binding affinity, metabolic stability, and in vivo half-life time. The pursuit of novel NNAAs for improved peptide designs faces the challenge of effective synthesis of these building blocks as well as the entire peptide itself. Solid-Phase Peptide Synthesis (SPPS) is an essential technology for the automated assembly of peptides with NNAAs, necessitating careful protection for effective coupling of amino acids in the peptide chain. This process requires orthogonal protection of the reactive groups in individual amino acids after synthesizing them, presenting a challenge in bridging in silico peptide design with chemical synthesis. To address this, we have developed a first-of-its-kind synthesis assistance tool, NNAA-Synth, that plans and evaluates the synthesis of individual SPPS-compatible NNAAs. Our tool unifies (i) introducing orthogonal protecting groups to NNAAs, (ii) retrosynthetic prediction to propose synthesis routes, and (iii) scoring the synthetic feasibility of these routes. We demonstrate how the tool facilitates optimal protection strategy selection for individual NNAAs. Additionally, it enables synthesizability-aware NNAA ranking and prioritization during computational screening, enhancing the quality of the in silico design by assessing the accessibility of individual building blocks.
To meet these design requirements, it is essential to explore the peptide or protein chemical space exhaustively. The traditional sequence space is vast due to the combinatorial possibilities of the proteinogenic building blocks composed of 20 natural amino acids.7 Although the large sequence space can offer potential drug candidates for molecule optimization, enhancing especially the peptide properties heavily relies on the incorporation of non-natural amino acids (NNAAs) in drug discovery and development projects.5 NNAAs expand the chemical diversity of peptides to a space theoretically similar to that of small molecules with unique functional groups, varying sidechains, and the modifications in the backbone.9 Common modifications to natural amino acids such as backbone N-methylation or stereocenter inversion are known to improve the oral bioavailability and permeability of peptidic drugs.10 More complex amino acid designs such as novel sidechains that introduce hydrogen bonding between sidechain and backbone have been incorporated into peptides to improve the passive diffusion of peptides through cell membranes and to enhance their solubility.10,11 Recently, these design strategies have been explored by generative models. We have recently developed PepINVENT, a generative model for peptide design.12 PepINVENT navigates the vast space of natural and non-natural amino acids to propose valid, novel, and diverse peptide designs.12
The novel designs and complex derivatives of commercially available amino acids introduce synthesis challenges when considering NNAAs for peptide optimization. To bridge in silico designs to wet lab, a careful consideration of synthesis planning is required. Solid-phase peptide synthesis (SPPS) is an efficient method that enables sequential addition of amino acids to a growing peptide chain anchored to a solid support.13 A peptide containing one or more NNAAs can be chemically synthesized using SPPS techniques, provided all the amino acids are available.13 The reactive fragments in the building blocks or the amino acids of the peptide of interest need to be protected before being incorporated to the SPPS process. The amino acids must be protected with orthogonal groups between the backbone and sidechain, to ensure the correct assembly of the peptide chain.13,14 During the peptide synthesis, the protection groups are selectively removed and used to elongate the chain in the correct sequence by coupling reaction. The cycle of deprotection and coupling continues until the peptide sequence is completed and the process is finalized with the removal of all protection groups from the chain.15 This deprotection and coupling strategy minimizes side reactions by ensuring the addition of amino acids to specific positions and with specific connectivity to the peptide chain. SPPS facilitates controlled synthesis of peptides and the protected versions of many natural and NNAAs are commercially available from vendors, ready to be plugged in to SPPS.13 However, any novel NNAA, not readily available needs to be chemically synthesized and orthogonally protected in order to be compatible with SPPS. The choice of the protection strategies for the reactive groups in NNAAs can influence the yield of the SPPS-ready building block as well as the coupling reaction during production.15
Synthetic feasibility of novel and protected NNAAs provides a significant challenge with complex synthetic routes, availability of the starting materials, and the protection strategy. To address these challenges, we introduce a novel cheminformatics tool, NNAA-Synth, designed to evaluate the synthesizability of protected α-NNAAs. Our tool uniquely aims to provide insights into the synthetic feasibility of the protected amino acids through the optimal protection strategies, retrosynthetic planning of the NNAAs and deep learning (DL)-based feasibility scoring. By approaching the chemical synthesis and protection challenges of NNAAs as a single, integrated problem, NNAA-Synth provides synthesis solutions to streamline the SPPS-compatible amino acid production and to inform the design process with building block accessibility. To our knowledge, no existing solution combines the protection, synthesis planning and feasibility assessment into an all-in-one tool for NNAA design and synthesis planning. In this paper, we will demonstrate potential use cases for our tool with respect to (i) selecting the most synthetically feasible NNAA from a set of options to help the medicinal chemists design drug candidates, and (ii) choosing the optimal protection strategy for a novel NNAA for efficient incorporation into peptide synthesis.
![]() | ||
Fig. 1 Depiction of three NNAAs with their respective reactive groups annotated. The amino acids are labeled with their names as specified in Amarasinghe et al.9 |
Each reactive group identified by the SMARTS patterns was associated with one or more protection groups (Fig. S1). To obtain broad coverage of the large chemical space of α-amino acids, four classes of mutually orthogonal protecting group were combined (Table S2), each can be cleaved by a distinct method: acid, base, hydrogenation, oxidation or fluoride. This strategy enables stepwise, independent exposure of the desired functional groups without protecting group interference during deprotection. In the amino acid backbone, carboxyl termini were masked with tert-butyl esters (tBu), which are rapidly released by strong acids such as trifluoroacetic acid. In contrast, amino termini was protected with fluorenylmethoxycarbonyl (Fmoc) carbamate, that can be selectively removed under basic, non-nucleophilic conditions, with piperidine.15,17 This mirrors the classical Fmoc/tBu regime used in SPPS yet remains compatible with solution-phase couplings.17,18 In the sidechain, a range of protection groups was introduced to allow for selective deprotection under conditions orthogonal to those used for deprotecting the backbone. Benzyl-based groups (benzyl (Bn) for acids/alcohols and 2-chlorobenzyloxycarbonyl (2ClZ) for amines) are removed by hydrogenolysis, making them entirely orthogonal to both acid and base lability and stable to most peptide coupling reagents.17 For sidechain hydroxyls or thiols must survive the foregoing manipulations, oxidatively labile p-methoxybenzyl (PMB) ethers and sulfides are introduced. These can be cleanly detached with DDQ, leaving all benzyl-type protections untouched.17 Finally, the trimethylsilyl-ethyl (TMSE) esters and ethers, when used to protect sidechain acids or alcohols, can be selectively removed with tetrabutylammonium fluoride (TBAF), while withstanding acid, base, hydrogenation and oxidative conditions. Overall, strategic permutation of these protecting group classes grants a controlled deprotection sequence over the assembly and modification of complex peptides and accommodating diverse sidechain functionalities.15,19,20 This approach enables late-stage diversification of densely functionalized NNAA scaffolds while preserving sensitive γ- or δ-heteroatoms. For example, the backbone of an amino acid protected by Fmoc/tBu while the sidechain reactive moieties by Bn/2ClZ, PMB, TMSE can go under deprotection order of base → hydrogenolysis by H2/Pd → oxidation by DDQ → fluoride-mediated cleavage by TBAF, respectively (Table S2). The suggested algorithm follows the principles articulated in the protecting group monographs of Greene and Wuts and of Kocienski, and it has been exploited recently in modular syntheses of non-canonical residues for ribosomal incorporation, peptidomimetics and macrocycle design.17,19–24
The library of NNAAs, protected with various strategies, were input to AiZynthFinder that returns the predicted synthetic routes. Expansion and filter models trained on data from the United States Patent and Trademark Office (USPTO)28 were used. Publicly available eMolecules building block set16 and the selected protection groups were supplied as the library of purchasable starting materials. The maximum search depth was set to 15 to generate a comprehensive exploration of potential synthetic routes. A key aspect of the runs involved applying a filter strategy to restrict the decomposition of bonds in the extensive protective groups with complex substructures such as TBDMS, Fmoc, DNP, TMSE, tBuS, Tos, DPSide. Freezing the protection groups as substructures enables them to be included into the predicted routes as purchasable starting material, rather than requiring synthesis from scratch. The remaining parameters were set to their default values. All reactions in the proposed routes were annotated with their reaction classes using the NameRxn software.29
Chemformer, a transformer-based model pre-trained on SMILES, was fine-tuned for both product prediction from reactants (forward Chemformer) and reactant prediction from products (backward Chemformer) as forward Chemformer using 18.7 million public and AstraZeneca's proprietary reactions.30,32 These models were used to assess round-trip accuracy by measuring the consistency between forward and backward predictions in retrosynthetic analysis. Demonstrating high round-trip accuracy in both single-step (above 0.97) and multistep retrosynthesis with unseen reactions, it is suggested for use in a mixed-policy approach with the template-based AiZynthFinder for route evaluation.30 The product prediction, or the forward Chemformer, was utilized as a feasibility filter that excludes chemically implausible transformations by generating a predictive probability-focused score for individual reactant(s)-product pairs in a route.30 Each score was calculated by predicting the product at each step in the retrosynthetic tree, using the corresponding reactants provided by AiZynthFinder as input.30 For chlorination and bromination reactions, the output from AiZynthFinder was augmented with either chlorine or bromine, respectively if it was missing because not all retrosynthesis templates generate the halogenation reagent. Chemformer, with a beam size of 10, predicted a set of potential products for the given reactants and assesses whether the true product, in the AiZynthFinder prediction, appeared in this batch.30 If found, the probability of the prediction became the score; if not, the reaction was assigned a score of zero. Subsequently, scores across all steps in a route were aggregated into a single score by multiplying them. Routes with a feasibility score greater than zero were proceeded to the second round of assessment, while those scoring zero were considered synthetically unfeasible and excluded from further analysis.
The second and final route scorer was another DL method informed with chemists' expert assessment on multi-step synthesis routes by Guo et al.31 The expert augmented method provides an overall assessment of route feasibility by combining reaction-level, route-level, and target molecule-based descriptors. This model utilized 47303 historical synthetic routes from Journal of Medicinal Chemistry as reference routes and AiZynthFinder-generated routes for the same target molecules as proposed routes.31 The model was trained to predict the distance between proposed and reference routes based on embeddings of three features: (i) route, i.e. cost, reaction complexity, and precursor availability, (ii) reaction, i.e. a statistical assessment of the reaction feasibility of propriety data and structural difference fingerprint encodings, and, (iii) the target molecule represented with Morgan fingerprints.31 The distance-based scoring was further refined into an augmented score by incorporating route length with weights determined by fitting the scores to expert insights on feasibility for a subset of proposed routes.31 The resulting score was shown to agree with the expert opinion, achieving a Pearson correlation coefficient of 0.92 and therefore used as a proxy for the expert opinion.31 This score was then used as the route quality, or feasibility score and was categorized into “Good”, “Plausible”, and “Bad” routes based on ratings from the experts:31
Following the reactive group annotation, the NNAAs were appropriately protected using a custom mapping of reactive functional groups to protection groups. As most amino acids lacked reactive functional groups in their sidechains (approximately 7000 of them), the protected building blocks consisted of primarily backbone-decorated molecules. In contrast, the remainder of the library, around 3000 NNAAs, required sidechain protection, and was subjected to multiple strategies, generating a protected series of the individual amino acids (Fig. 4B). 9985 NNAAs were expanded into 15508 protected residues for retrosynthesis planning.
Metrics | Protected NNAAs | NNAAs with optimal protection |
---|---|---|
Total number of molecules | 15![]() |
9985 |
Infeasible molecules | 4435 | 1692 |
Starting material availability | 89.26 ± 10.40 | 89.19 ± 11.27 |
Number of reactions | 8.95 ± 4.92 | 8.60 ± 4.98 |
Chemformer-based feasibility score | 0.05 ± 0.15 | 0.07 ± 0.16 |
Expert-augmented feasibility score | 8.63 ± 5.48 | 7.86 ± 5.40 |
While the entire protected NNAA library and the subset of this library encompassing the NNAAs with their optimal protection strategy exhibit similar distributions in route composition, the expert-augmented feasibility score, demonstrates better outcomes for the subset, as reflected by lower scores (Table 1). This suggests that the synthetic challenge, influenced by the increased molecular complexity from building block protection, was reflected not directly in the route composition but in the ultimate multi-parameter feasibility assessment of individual reactions. The multi-parameter scoring system emphasized route quality by considering, but not prioritizing, factors like route length, reaction types or the starting material availability. As a result, 40% of the protected NNAAs had routes with fewer than four reactions, compared to 33% of the individual NNAAs for which such routes were proposed. In contrast, when assessing the most feasible routes for protected molecules, 20% were classified as “solved”, i.e. all starting materials were commercially available, compared to 25% for individual NNAAs, demonstrating an inverse relationship.
Finally, 4435 protected molecules were assessed to be completely synthetically infeasible, corresponding to 1692 of the NNAAs that cannot be synthesized in the protected form (Table 1). These infeasible NNAAs can thus be excluded when conducting in silico screening methods for a better candidate selection.
The NNAA amino acid, named 2G6, demonstrates how the transition from NNAA to SPPS-compatible NNAA can be made through the synthesizability assessment. 2G6 can be protected through four different set of protection groups (Fig. 5A). NNAA-Synth uses the components in its pipeline to protect the NNAA and later to rank the synthetic feasibility (SF) score of the protection strategies. One of the protection strategies, (Fmoc, tBu, TMSE), was deemed infeasible, as its route receiving the highest value for the expert-augmented feasibility, or the SF score of 19.91 and was not solved to commercially available starting materials (Fig. 5A and S4). Two of the protection strategies, with protection sets of (Fmoc, tBu, Bn) and (Fmoc, tBu, PMB), had moderate feasibility with score 5.12 and 7.13, respectively (Fig. 5A and S4). Finally, protecting the carboxylic and amine group in the backbone with Fmoc and tBu, respectively and the heteroaromatic acid of a five-membered ring in the sidechain with allyl was the most synthetically feasible strategy for 2G6, achieving the lowest SF score (Fig. 5A). The score was also below 5 indicating “good” feasibility. Although 2G6 with (Fmoc, tBu, Bn) and (Fmoc, tBu, allyl) contained 3-step synthesis with all starting materials available in the eMolecules stocks, the SF score enabled the distinction of route qualities between them. In addition, all three steps of the best suggested route-esterification, carboxylic acid allylation and Buchwald–Hartwig cross-coupling-belong to the established, broadly used reaction classes and very similar to the successful syntheses reported in the literature.33–35 The routes with moderate feasibility, despite having feasible disconnection strategy, have less optimal reactants and issues with regioselectivity. The proposed route could then be followed through to synthesize the NNAA and to incorporate it to the building block library of SPPS (Fig. 5B).
A 16-mer peptide derived from Nrf2, containing the 77DxETGE82 motif within the Neh2 domain, was identified for its high binding affinity to Keap1 with a dissociation constant (KsolutionD) of 23.9 nM.37,38 Subsequently, a 9-mer peptide, isolated from the 16-mer peptide, was shown as the minimal peptide sequence mimicking the binding interaction of Nrf2 with Keap1 (Fig. 6A).37 The 9-mer peptide, with the sequence of 76LDEETGEFL84, exhibited moderate affinity to Keap1 with KsolutionD of 352 nM.37
![]() | ||
Fig. 6 (A) The binding site of Keap1-Neh2 domain of Nrf2 structure. The complex was adapted from the Keap1-PDB ID: 2FLU.39 The VS experiment informed by synthetic feasibility of the screened NNAAs for the three docked positions were visualized for L76 in (B, E, H), D77 in (C, F, I), and E78 in (D, G, J). (B–D) The docking scores of NNAAs plotted against ΔG MM-GBSA, (E–G) the same scores colored by the synthetic feasibility scores, and (H–J) all three scores by projecting the synthetic feasibility scores to the third dimension are visualized for each position. |
Amarasinghe et al.9 focused on enhancing the binding affinity of the 9-mer peptide through a large-scale virtual screening (VS) campaign. They conducted VS by mutating the first three positions (Leu76, Asp77, and Glu78) of the peptide of the peptide-Keap1 complex (PDB ID: 2FLU).9,39 Each residue was individually mutated to all other natural amino acids as well as to the library of 10000 NNAAs, which also served as the foundational data for our study.9 Docking and molecular mechanics generalized born surface area (MM-GBSA) scores were calculated for each mutated peptide (Fig. 6B–D).9 NNAAs with superior docking scores compared to natural counterparts were identified as candidates for affinity improvement. While the study establishes an enumerated NNAA library to enhance peptide binding affinity, it does not investigate the practical synthesizability of these candidates for wet lab applications.
In this study, we extend this peptide site-specific mutagenesis study to include synthetic planning assessment (Fig. 6E–G). The SF score generated by our tool was integrated as an additional component in selecting candidate NNAAs (Fig. 6H–J). This incorporation allows for the selection of the NNAAs to potentially improve the peptide binding, informed by the synthesizability. The majority of the successfully docked residues showed “good” feasibility when protected according to the SF score (Table S3). Although the NNAA library was enumerated from eMolecules,16 approximately 30% of the docked NNAAs were assessed to be not SPPS-compatible as none of the potential protection strategies were deemed to be synthetically feasible.
Automated synthesis planning enables the assessment of proposed routes for the NNAAs chosen for wet lab testing to potentially improve peptide-Keap1 binding. Key qualities evaluated include the SF score, availability of feasible route precursors in stock, and the number of reaction steps required to complete the synthesis. Considering these features, the synthesizability of the most optimal protection strategy was selected for individual building blocks. Among the selected protected NNAAs, there are 2, 91 and 17 of them were synthetically suitable, with less than 3 reaction steps, to mutate our peptide of interest in positions Leu76, Asp77, and Glu78, respectively (Fig. 7). Moreover, 2, 15, and 0 NNAAs of those were either already in the stocks or can be synthesized with a single reaction step from a similar starting material in stocks for positions Leu76, Asp77, and Glu78, respectively (Fig. 7). Asp77 was shown to be the most mutatable position with highest number of synthesizable NNAAs with favorable docking scores. Consequently, the score can be utilized to rank NNAA candidates as well as to prioritize a mutation site in VS as a synthetically accessible position.
NNAA as the input, protects the reactive groups, and employs a retrosynthetic planning software to generate potential synthetic routes. It then utilizes two DL-based scoring methods to assess the feasibility of the protected NNAAs through evaluating these routes. The scoring algorithms are complementary: Chemformer score validates the chemical plausibility of individual reactant–product pairs while the expert-augmented score provides an overall route feasibility assessment evaluating the proposed reaction classes as well as route-level features. The dual scoring enables more reliable route selection and NNAA prioritization. We apply retrosynthesis software and scoring models originally developed for small molecules to protected NNAAs, as they are essentially small molecules themselves.
The tool uses NNAAs enumerated from eMolecules by taking their potential precursors into account. The diverse and extensive collection of NNAAs enables the tool to cover a broad spectrum of reactive substructures, ensuring wide applicability to various NNAAs. It uses a custom heuristic, based on the literature, developed for mapping reactive groups to suitable protective groups in amino acid protection, along with a dedicated cheminformatics tool designed for this purpose. The pipeline was implemented in a modular structure, facilitating easy expansion with new mappings of the reactive fragments to protection groups.
NNAA-Synth can be applied across various peptide-related projects. In peptide library design, it supports building block selection, enabling the efficient synthesis during peptide screening for targeted research objectives. The tool also streamlines the choice of appropriate protective strategies for individual amino acids, which is critical for the directed assembly of amino acids during chemical synthesis. Furthermore, it assesses the feasibility of amino acids in peptidomimetics, guiding the optimization efforts towards practically synthesizable candidates. Integrating NNAA-Synth as a post-processing step in generative model-driven peptide design, the computer-assisted DMTA cycle can be accelerated to propose optimal peptides containing synthesizable NNAAs. While adapting to diverse demands of peptide research, our tool fundamentally evaluates the synthetic complexity of SPPS-ready NNAAs.
To illustrate our tool in practical applications, we provided two examples. The first example showed the evaluation and ranking of potential protection strategies for a specific NNAA of interest. Because the synthesis of NNAAs must be complemented by appropriate protection to maintain the efficient peptide synthesis during SPPS, the protected NNAA is regarded as the target molecule. The most feasible set of protecting groups can then be identified from the corresponding proposed routes using our tool to accomplish the synthesis of SPPS-compatible NNAAs. The second example involved a virtual screening experiment in which a set of NNAAs were docked in multiple positions of a natural peptide to optimize its binding affinity. The docking experiment was reinforced with the synthetic feasibility scores of candidate NNAAs. Selection of NNAAs driven not only by VS results but also endorsed by the synthetic feasibility considerations, can enable prioritization of mutagenesis positions as well as boosting the overall efficiency of the hit selection and the drug discovery process.
Including protection to the synthesis challenge of NNAAs addresses the true complexity of utilizing these building blocks in SPPS. Ranking the feasibility of various protection strategies can ensures NNAAs seamless integration into peptide. The future studies would focus on expanding the current tool to evaluate the factors affecting the SPPS throughput. While the tool evaluates SPPS-compatibility, it does not consider the specific reaction conditions or constraints inherent to the SPPS reaction cycle.40 These include amino acid instability, the influence of the reactive groups on the post-synthesis purification, and the impact on the structural integrity of peptides, such as the cyclization efficiency or intramolecular cyclization.41 Thus, a straightforward scoring of the entire peptide, such that the least synthesizable NNAA representing a synthetic bottleneck as the most challenging step in the overall peptide synthesis process, does not fully capture the true complexity of peptide synthesis. Additionally, the current version is limited to α-amino acids even though the current heuristics can be used for other building blocks. While the open-source solution and generalizable methodology allow for easy extension of reactive substructure-protection group mappings, incorporating specific protection strategies for other building blocks such as β, γ-amino acids, could broaden its applicability. These potential improvements require aligning the tool with the evolving landscape of peptide synthesis. This is especially important given the rapid advancements in SPPS technologies aimed at achieving higher throughput and the recent initiatives toward green chemistry with more environmentally friendly reagents.41–43 Another aspect is the performance of the retrosynthesis planning also depending on the reactions included in training data for both AiZynthFinder and the DL-based scoring models. Therefore, augmenting the training data with commonly used reactions for amino acid synthesis and protection could also improve the quality of the proposed routes. Although our tool does not tackle every challenge related to chemical synthesis of peptides, it represents, to our knowledge, the first computational tool for synthesizability assessment of NNAAs.
This journal is © The Royal Society of Chemistry 2025 |