Xuan
Liu
abc,
Hongxiang
Li
abcd and
Huimin
Zhao
*abcde
aNSF Molecule Maker Lab Institute, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA. E-mail: zhao5@illinois.edu
bDepartment of Chemical and Biomolecular Engineering, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
cCarl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
dDepartment of Chemistry, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
eDOE Center for Advanced Bioenergy and Bioproducts Innovation, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
First published on 28th July 2025
Computer-aided chemoenzymatic synthesis planning integrates the advantages of enzymatic and organic reactions to design efficient hybrid synthesis routes for a target molecule. Existing tools rely on either a step-by-step strategy or a bypass strategy. Here we introduce a synthetic potential score (SPScore) to unify these two strategies. This score is developed by training a multilayer perceptron on existing reaction databases to evaluate the potential of enzymatic or organic reactions for synthesis of a molecule. We systematically evaluate the effectiveness of the SPScore in both single-step and multi-step hybrid retrosynthesis, demonstrating its strong ability to prioritize promising reaction types. In benchmarking various chemoenzymatic retrosynthesis algorithms guided by the SPScore, we find that an asynchronous search algorithm named ACERetro yields higher efficiency and robustness that can find hybrid synthesis routes to 46% more molecules compared with the state-of-the-art tool using a test dataset consisting of 1001 molecules. We then apply ACERetro to design efficient chemoenzymatic synthesis routes for 4 FDA-approved drugs. We anticipate that the application of the SPScore will provide a new avenue for computer-aided chemoenzymatic synthesis planning, thereby advancing the synthesis of functional molecules.
Computer-aided synthesis planning (CASP) enables massive search and design of synthesis routes for a target molecule by integrating template-based8,9 or template-free10,11 single-step retrosynthesis predictors with a search algorithm.12,13 To achieve computer-aided chemoenzymatic synthesis planning, currently there are two distinct strategies to build a search algorithm (Fig. 1A) including (i) step-by-step14,15 and (ii) bypass.16,17 The step-by-step strategy combines the results from single step enzymatic/organic reaction precursor predictors to build a hybrid synthesis route,14,15 whereas the bypass strategy identifies alternative reaction types in an existing or predicted synthesis route, i.e. identify enzymatic reactions as bypasses to chemical syntheses or vice versa.16,17 Levin et al.'s tool combines precursor prediction results from two reaction template prioritizers trained on separate reaction databases,14 but template prioritizers cannot heuristically identify the bypass without proper alignment as the two prioritizer models are trained separately. Similarly, Sankaranarayanan et al.'s tool employs an exhaustive search to identify biocatalytic opportunities for intermediates in predicted synthesis routes,16 but this tool is challenging to scale in an exponentially growing search space. More recently, Zeng et al.'s step-by-step strategy tool predicts reaction types (organic or enzymatic reaction) with a template-free precursor predictor,15 while Li et al.'s bypass strategy tool builds a reaction type score (RTscore) to distinguish synthesis reactions from decomposition reactions,17 highlighting the importance of heuristic methods for advancing computer-aided chemoenzymatic retrosynthesis. However, Zeng et al.'s method and Li et al.'s method can only be applied within their own step-by-step or bypass strategies. Finding a simple and effective way to unify the step-by-step strategy and the bypass strategy can help us further understand the principles behind computer-aided chemoenzymatic synthesis planning and promote the development of efficient hybrid synthesis planning tools. One precedent is that the introduction of synthetic complexity in retrosynthesis can help synthesis planning tools efficiently find concise and feasible synthesis routes.18,19 The context of chemoenzymatic retrosynthesis motivates us to introduce a synthetic potential score (SPScore)—a hypothetical metric describing the suitability of enzymatic and organic reactions for synthesizing a molecule. This score not only bridges the gap between the step-by-step strategy and the bypass strategy but also enhances the step-by-step strategy algorithm, transforming it from a mere replica of traditional retrosynthesis algorithms designed for single reaction types into a more versatile and innovative approach.
In this work, we integrate the above-mentioned two strategies by using the SPScore to prioritize the reaction type (enzymatic or organic) in step-by-step chemoenzymatic synthesis planning (Fig. 1B) and identify alternative reaction types in given synthesis routes (Fig. 1C). This score was developed by training a multilayer perceptron on existing reaction corpora (Fig. 1D). We evaluated the performance of the SPScore on both single-step and multi-step retrosynthesis and developed a SPScore guided asynchronous search algorithm for chemoenzymatic synthesis planning. The resulting strategy named the asynchronous chemoenzymatic retrosynthesis planning algorithm (ACERetro) can identify chemoenzymatic synthesis routes for 46% more molecules compared to the state-of-the-art tool when using 1001 molecules as a test dataset. To demonstrate its utility, we applied ACERetro to design promising chemoenzymatic synthesis routes for two FDA-approved drugs, ethambutol and Epidiolex. In addition, we applied ACERetro to optimize the synthesis routes of two additional FDA-approved drugs, rivastigmine and (R,R)-formoterol. A user-friendly web interface of ACERetro could be accessed at https://aceretro.platform.moleculemaker.org/search-routes.
Molecules were represented by ECFP4 (ref. 22) (extended connectivity fingerprint, up to four bonds) and MAP4 (ref. 23) (MinHashed Atom Pair fingerprint, diameter d = 4) with three different lengths (length = 1024, 2048, and 4096) and used to train several MLP models. Rather than predicting a binary label indicating the preferred catalysis type, the MLP is trained to generate two continuous values: the synthetic potential score for organic reactions (SChem) and for enzymatic reactions (SBio). These two scores are directly output by the MLP and reflect how favourable each reaction type is for a given molecule. Because the reaction corpus may not cover all possible transformations, formulating this task as a binary classification is not ideal. Instead, we use margin ranking loss as the training objective, which encourages the model to rank the more promising reaction type higher based on relative differences between SChem and SBio (see Methods). This loss function is well-suited for tasks where the goal is to learn a preference between two options. In our case, it allows the model to rank one catalytic option over the other, which better aligns with the decision-making nature of hybrid retrosynthetic planning. The SPScores range from 0 to 1, so they can act as the probability of a molecule being promisingly synthesized by each reaction type. When the difference between two SPScores of a molecule is within the margin, both reaction types are considered promising for the synthesis of that molecule. If a molecule's SPScore of one reaction type is greater than the other, and the difference is greater than the margin, the reaction type with the larger SPScore is more suitable for the synthesis of the molecule. In the training process, a margin of 0.15 was used, which helps ensure that the three regions have similar areas. A margin that is neither too severe nor too trivial benefits subsequent adjustments to user preferences without the need of model retraining on a different margin. An excessively large number of epochs will cause the model to overfit the distribution of the training data.18 Since reactions sourced from the USPTO are confined to patents and differ in distribution from those documented in the literature,24 the number of epochs is also considered in the evaluation criteria. Therefore, the best model was obtained by a comprehensive evaluation of precision, recall, and F1 on the validation dataset, as well as the number of epochs (Fig. S2). As a result, the model that used ECFP4 with a length of 4096 as molecular embedding will be used for the subsequent tasks.
Since there is no definitive ground truth for the synthetic potential of a given molecule, the evaluation of the SPScore relies on indirect evidence on retrosynthesis results to demonstrate its practicality and robustness. To further explore whether the SPScore effectively predicts a molecule's promising reaction type, we conducted one-step retrosynthesis in each reaction type by employing RXN4Chemistry26 for organic reactions and Levin et al.'s enzymatic templates14 for enzymatic reactions. For a given target molecule, RXN4Chemistry predicts possible organic reactions ranked by a confidence score, namely backward confidence, while Levin et al.'s enzymatic templates predict possible enzymatic reactions ranked by the template score. The average backward confidence of the top-5 predictions increases with the mean synthetic potential score in organic reactions predicted by our scoring model (Fig. 2C). A similar trend between the average template score and the mean synthetic potential score in enzymatic reactions is observed (Fig. 2D). This suggests that as the predicted probability of molecules being synthesized by a specific reaction type increases, the corresponding retrosynthesis tool's confidence in its predictions also tends to increase.
To assess whether the relative value of SChem and SBio can help identify the dominant reaction type, we analyzed both the average backward confidence and the average template score for top-5 predictions versus the mean SPScore difference (SChem − SBio) (see Fig. 2E and F). The result reveals that when SBio is larger than SChem, molecules tend to have a relatively high template score to be synthesized by enzymatic reactions and low backward confidence to be synthesized by organic reactions. Collectively, these trends shown in Fig. 2 suggest that our scoring function exhibits good ability to deduce the promising reaction type for molecules.
When using SPScores to guide the search where the margin is set as 0.15, 85.8% of molecules' synthesis field in shortest synthesis routes and 75.0% of molecules' synthesis field in all synthesis routes can be covered. By this way, it can save 40.2% searches in shortest synthesis routes and 33.8% searches in all synthesis routes because SPScores can give the correct prediction that matches the reaction type in the original synthesis routes (Fig. 3A and B). The observed trend indicates that as the margin expands, the reaction type of a greater number of molecules is encompassed. However, this comes at the expense of a reduced number of saved searches.
The reaction retention rate is defined as the proportion of reactions where the actual reaction type matches the predicted preferred reaction type for the product molecule, as determined by the SPScore (see Methods and SI). When the margin is set as 0.15, 89.9% of reactions in the shortest synthesis routes can be covered, and 86.0% of reactions in the near shortest synthesis routes (route length ≤ shortest route length + 2) can be covered (Fig. 3C). Next, we investigate whether the SPScore can provide guidance for finding the shortest synthesis route and discovering the diversity of shortest routes. Of 493 target molecules, the route retention rate is determined by counting the number of molecules that at least one shortest synthesis route whose actual reaction types can be covered by SPScores' prediction. In scenarios with the same margin of 0.15, 393 (79.7%) of the molecules have at least one shortest synthesis route that can be fully retained, while 109 (73.6%) molecules out of 148 have at least three shortest synthesis routes that can be retained (Fig. 3D). The results indicate that in the context of multi-step retrosynthesis, utilizing SPScores as a guide enables the retention of majority of favorable synthesis routes.
To explore search spaces of these three algorithms, we conducted a comparative study on a set of 1001 molecules from ZINC, which Levin et al.'s tool had explored under identical boundary conditions including search time and buyable dataset (see Methods). FHSync found synthesis routes to 597 molecules, SPSync found synthesis routes to 683 molecules, and ACERetro found synthesis routes to 720 molecules (Fig. 5A). Compared to Levin et al.'s tool, which found synthesis routes to only 493 molecules, FHSync can find synthesis routes to additional 104 (21.1%) molecules. This improvement is mainly attributed to the incorporation of the template-free model, RXN4Chemistry, in organic reactions. Moreover, the SPSync and ACERetro found synthesis routes to 190 (38.5%) and 227 (46.0%) more molecules compared with Levin et al.'s tool, respectively. These results underscore that the efficiency of ACERetro surpasses that of the state-of-the-art method.
In a self-benchmarking analysis, the hybrid search algorithms with SPScore guidance (SPSync and ACERetro) outperform the algorithm without SPScore guidance (FHSync), which could find synthesis routes to 86 and 123 more molecules, emphasizing the pivotal role of SPScore in optimizing search efficiency. In the comparison between SPSync and ACERetro, ACERetro could find synthesis routes to 37 more molecules, which indicates that the asynchronous search is more efficient than the synchronous search. Unlike the synchronous search, which drops the search for molecules' suboptimal reaction type, the asynchronous search keeps all suboptimal reaction type of molecules in the queue for later exploration. The algorithm will start to search suboptimal reaction type of molecules based on a comprehensive consideration including SPScores, search depth, and molecular complexity (see Methods).
Variations in search spaces and strategies across synthesis planners lead to the prediction of different synthesis routes for molecules. A proficient planner can discover synthesis routes to a greater number of molecules than other planners are able to find. Thus, we conducted a comparative analysis to evaluate the number of molecules whose synthesis routes were exclusively identified by the three algorithms in comparison to Levin et al.'s tool (Fig. 5B). It was observed that each algorithm has the capability to discover synthesis routes for molecules that Levin et al.'s tool did not identify. In particular, out of 1001 molecules, synthesis routes to 466 could be found by both ACERetro and Levin et al.'s tool. While ACERetro exclusively identified synthesis routes to 254 molecules, Levin et al.'s tool could exclusively find routes to only 28 molecules, indicating that ACERetro discovered approximately 26 times more exclusive molecules than Levin et al.'s tool. These findings imply that ACERetro achieves an expanded search space and better heuristic search strategy than the state-of-the-art tool.
The search quality of the synthesis planning tools can be evaluated from the number of reactions in the predicted synthesis routes through limited context that synthesis planning tools can provide, albeit it is not an exhaustive metric.27 A smaller number of steps usually imply the use of fewer reagents and fewer purification steps.28 We compared the length of the shortest synthesis route to 466 molecules found by both ACERetro and Levin et al.'s tool. ACERetro found optimized shortest synthesis route to 167 (35.8%) molecules and the shortest synthesis route of equivalent length for 260 (55.8%) molecules (Fig. 5C). This indicates that ACERetro can predict more optimized synthesis routes than Levin et al.'s tool.
To further study the difference in search space between ACERetro and Levin et al.'s tool, we compared the synthesis routes to (S)-verofylline (1), (3S)-3-hydroxy-β-ionone (2), dimenoxadol (3) (Fig. 5D–F), and other 7 syntheses (Fig. S8). In the synthesis of 1, ACERetro predicted a three-step hybrid synthesis route including one enzymatic reaction, while the shortest synthesis route predicted by Levin et al.'s tool included four reactions in organic reactions (Fig. 5D). The route predicted by ACERetro first uses an enzymatic reaction to synthesize 5 from 4. The recommended enzyme is 2,5-diamino-6-(5-phospho-D-ribitylamino)-pyrimidin-4(3H)-one deaminase (Rib2; EC number 5.4.99.28). 5 is subsequently alkylated with 6 containing a chiral center to form 7. The final step constructs an imidazole ring using acetic acid with 7 to produce 1. Note that the Levin et al.'s tool route uses the same strategy to introduce the chiral center and construct the imidazole ring to 1, but the difference in starting materials makes the route longer. In the synthesis routes of 2, ACERetro predicted a two-step enzymatic synthesis route, while Levin et al.'s tool predicted a four-step hybrid synthesis route (Fig. 5E). The former first uses a reductase to get the double bond starting with dihydro-beta-ionone (8) to form beta-ionone (9). Next, a hydroxylase is used to introduce the chiral hydroxyl group for 9 to form 2. Recommended enzymes are 13,14-dehydro-15-oxoprostaglandin 13-reductase (PGR; EC number 1.3.1.48) and ent-isokaurene C2-hydroxylase (CYP71Z6; EC number 1.14.14.76), respectively. The latter uses a different starting material, beta-cyclocitral (10) to form 9, and three steps to form 2 from 9. In the synthesis routes of 3, ACERetro predicted a two-step synthesis route including only chemical reactions, while Levin et al.'s tool predicted a four-step hybrid synthesis route (Fig. 5F). The former first constructs ether from benzilic acid (11) to form 12 and then constructs ester to form 3. The latter uses a similar reaction to form the final product from 12. However, it uses 1,1-diphenylethanol (13) as the starting material to synthesize 12via a three-step hybrid synthesis route.
The routes of 1, 2, and 3 predicted by ACERetro cover three scenarios: hybrid approach, purely organic approach, and purely enzymatic approach. The results show that ACERetro can often find shortcuts to synthesize compounds compared to Levin et al.'s tool, such as the synthesis of intermediate 7 in the synthesis route of 1, the synthesis route from 9 to 2, and the synthesis of 12 in the synthesis route of 3. For the predicted enzyme reactions, although those enzymes have not been reported to use molecules in the predicted routes as substrates, the predicted reactions still provide effective guidance for future enzyme discovery and engineering. Among all routes predicted by ACERetro, the SPScore of each product except 5 is consistent with the corresponding reaction type in the synthesis route. However, note that SBio of 5 is higher than that of all other products in the route, and its SChem − SBio has the smallest value, which indicates that 5 has higher potential to be synthesized by enzymatic reactions compared to other product molecules in the synthesis routes.
We conducted retrosynthesis planning on (S,S)-ethambutol by using ACERetro. The search parameters are the same as those used in the above-mentioned benchmarking study, except that the maximum search depth is set to 5 based on existing routes. The most promising predicted synthesis route connecting to buyable compounds is shown in Fig. 6F. The synthesis route first builds the chiral center through an enzymatic reaction of aminotransferase from cheap starting material 2-butanone (20) to form (2R)-butan-2-amine (21). 22 is synthesized by the acylation reaction of 23 and 21, followed by reduction to form 24. Two steps of symmetrical hydroxylation catalyzed by the same enzyme are used to complete the synthesis of 12.
The predicted route effectively employs a single enzymatic reaction to construct the chiral portion of the molecule. Compared to chemical methods reported in the literature, the enzymatic reaction conditions are milder. The enzyme recommended for this step is L-glutamine:2-deoxy-scyllo-inosose aminotransferase (G2DOIAT; EC number 2.6.1.100). The subsequent two symmetric hydroxylation reactions form a cascade, and a one-pot method can be employed to minimize the number of purifications. The CYP124 family of cytochrome P450 enzymes (CYP124; EC number 1.14.15.14) is recommended for this cascade. Moreover, introducing the hydroxyl group in the final step avoids side reactions during the acylation process and reduces the use of protecting groups. Although the predicted enzymatic reactions have not been experimentally verified for these substrates, the prediction still provides valuable guidance for future enzyme discovery and engineering.
Epidiolex is the brand name for (−)-cannabidiol ((−)-26), which is used for the treatment of epilepsy disorders. Kobayashi et al. developed the synthesis route using olivetol dimethyl ether (27) and 30 as the starting materials.36 The chirality is constructed through the nucleophilic addition of 28 and 31 to form 29. Another synthesis route designed by Shultz et al. uses Ireland–Claisen rearrangements to build chirality starting from olivetol 32.37 Gong et al. used the Friedel–Crafts reaction to build chirality starting with phloroglucinol (35) and cis-isolimonenol (36).38 The biosynthetic route of (−)-cannabidiol using hexanoyl-CoA as the starting material has also been reported.39 The cannabidiolic acid synthase (CBDAS; EC number 1.21.3.8) uses cannabigerolic acid as the substrate to close the ring and introduce stereochemistry (see Fig. 7A–D).
The predicted route by ACERetro with a maximum search depth set to 4 and ignoring geometric isomerism in the buyable molecule database is shown in Fig. 7E. The prediction provides a concise synthesis route, starting with the alkylation of olivetol 32 with geraniol 39 to form cannabigerol 40. Then an enzymatic step is used to form the final product (−)-26 with stereoisomerism. The first alkylation reaction has literature to support it,40 whereas the recommended enzyme for the second step, CBDAS, has not been proven to work using 40 as the substrate. However, the high similarity between 40 and 38 points to the possibility of finding enzyme mutants that allow the reaction to occur.
The chemoenzymatic synthesis route for (R,R)-formoterol 42 was predicted by Levin et al.'s tool. Top 3 steps with opportunities for improvement (46, 48, and 49) were identified where their predicted SPScores are far away from their reaction type in the original route. In particular, the SChem values of intermediates 46, 48, and 49 are larger than their corresponding SBio, yet enzymatic reactions were employed in the original route which causes a high optimization score. The new organic synthesis route for intermediate 46, utilizing 45 as the starting compound, was predicted by ACERetro with a search depth capped at 1. Given that intermediates 49 and 48 are in the same branch, only the analysis for 48 is shown in Fig. 8B, which was undertaken by ACERetro with a maximum search depth of 2. The proposed route employs one-step chemical reaction to synthesize 48, taking 50 and (+)-phenylethylamine as the precursor, which reduces the original three-step synthesis strategy to a single step. These predicted reactions for intermediates 46 and 48 have been corroborated by the literature.43,44
In addition, we capitalize on the characteristics of current organic reaction and enzymatic reaction databases. A sufficiently large organic reaction database can support the training of retrosynthesis tools based on language models, whereas a smaller-scale enzymatic reaction database is more suitable for rule-based reaction templates. Accordingly, we employ a template-free retrosynthesis tool, RXN4Chemistry, for organic reactions and a template-based retrosynthesis tool, ASKCOS, for enzymatic reactions. Free from the limitations imposed by a template prioritization system, ACERetro guided by the SPScore possesses the capability to integrate seamlessly with any existing retrosynthesis tool.
By comparing the confidence of single-step retrosynthesis and single-step retrobiosynthesis of 11003 molecules with the trend of SPScore distribution, it is shown that SPScores can effectively predict promising reaction types for molecules. The performance of SPScores in multi-step retrosynthesis was further verified by reaction type coverage, reaction retention rate, and route retention rate among predicted synthesis routes of 493 molecules. In the benchmarking study on 1001 molecules, ACERetro incorporating two single-step precursor prediction tools outperformed Levin et al.'s tool, a state-of-the-art method. Through a comparative analysis of the results obtained from FHSync, SPSync, and ACERetro, self-benchmarking reveals that the incorporation of the template-free model, the implementation of SPScores, and the adoption of asynchronous search methodologies each contribute to enhancing the performance of synthesis planning.
Examples of synthesis routes for (S)-verofylline, (3S)-3-hydroxy-β-ionone, and dimenoxadol reveal that our method can identify shortest synthesis routes with higher quality, and the predictions include not only hybrid synthesis routes, but also chemical reaction only synthesis routes and enzymatic reaction only synthesis routes. The case studies on synthesis planning of ethambutol and Epidiolex demonstrate that our approach can effectively design hybrid synthesis routes for complex molecules and find potential enzyme candidates to perform the predicted enzymatic reactions. The complementarity of the two reaction types will further broaden the scope for designing efficient synthesis routes for molecules of interest. The case studies on synthesis route optimization for rivastigmine and (R,R)-formoterol illustrate that SPScores can be effectively applied to optimize existing synthesis routes. Existing synthesis tools are often inadequate for lengthy synthesis steps. Finding steps with opportunities for improvement that may be optimized in existing synthesis routes and then conducting retrosynthetic analysis can simplify the search process and make full use of existing parts of the synthesis routes that have been experimentally verified.
The concept underlying SPScores involves inferring the most promising reaction type for a molecule based on existing catalysis data in a reaction database, employing a data-driven approach. This approach aims to differentiate the distinct reaction spaces of organic reactions and enzymatic reactions. In this work, we performed a simplified but fruitful verification of the synthetic potential with molecular fingerprint and MLP. Combining more complex models such as molecular graphs and reinforcement learning to predict the SPScore will be explored in a follow-up study. Utilizing SPScore in chemoenzymatic synthesis planning can expedite the search process by avoiding less promising reaction types. However, there remains a risk that the model might overlook viable reactions in the reaction types it avoids. Consequently, a comprehensive and high-quality dataset encompassing various types of reactions is crucial to ensure optimal model performance. It is noteworthy that the reaction spaces of organic reactions and enzymatic reactions are dynamic. The unique reaction space of each may expand or contract with the discovery of new catalysts or enzymes. In ACERetro, the SPScore is first used to identify the promising reaction type before conducting a retrosynthetic analysis. An alternative improvement strategy could be first conducting retrosynthetic analysis to identify all potential deconstruction sites and intermediates and then selecting the appropriate reaction type for each step.
In summary, this study transforms chemoenzymatic synthesis planning from a fragmented process into a unified framework by combining the concept of synthetic potential with practical algorithmic design. ACERetro overcomes the limitations of existing methods by integrating both template-based and template-free strategies, enabling comprehensive synthesis planning across diverse organic and enzymatic reaction databases. Its ability to design efficient synthesis routes and identify alternative pathways highlights its potential as a powerful tool in the field. We believe that computer-aided chemoenzymatic synthesis planning will expand the synthesis space by leveraging the complementary strengths of enzymatic reactions and organic reactions. This approach can accelerate the adoption of enzymes as eco-friendly catalysts, facilitating enzyme screening and engineering for improved catalytic performance.
• If y = 1 (i.e., the reaction type is organic), we consider the prediction correct if SChem > SBio + margin.
• If y = −1 (i.e., enzymatic), the prediction is correct if SBio > SChem + margin.
• If y = 0 (i.e., overlap), the prediction is considered correct if the absolute difference between the two scores is less than the margin, i.e., |SChem − SBio| < margin.
A weighted margin ranking loss, loss(SChem, SBio, y) = weighti·max(0, −y(SChem − SBio) + margin) if y = ±1 and loss(SChem, SBio, y) = weighti·max(0, |SChem − SBio| − margin) if y = 0, is applied to compute a criterion only when the prediction is out of the area of molecules' true labels. The weight is calculated based on the reciprocal of the ratio of each label.
The dataset was randomly split into a training, validation, and test set (80%, 10% and 10%, respectively). We used a grid search to tune the hyperparameters including the type of the molecular fingerprints (ECFP4 and MAP4), the length of the molecular fingerprints (1024, 2048, or 4096), and the number of hidden layers (1, 3, or 5). The accuracy, F1, and recall are calculated on the validation set. To mitigate the risk of overfitting, the number of epochs is incorporated into the evaluation function to select the optimal models (see the SI). The optimal model, which utilizes ECFP4 embedding of 4096 length and comprises 3 hidden layers, trained for 10 epochs, was employed for all subsequent tasks.
Multi-step hybrid synthesis routes were derived from the retrosynthetic predictions for 493 molecules conducted by Levin et al.'s tool within a three-minute timeframe. Out of 493 target molecules, we enumerated 26741 distinct product molecules in 397
040 synthesis routes. All the synthesis routes with the shortest length for each target molecule were collected, which contained 1544 distinct product molecules. The reaction type of a molecule (denoted as “Chem”, “Bio”, or “Both”) is assigned based on whether the molecule has been synthesized by an organic chemical reaction or an enzymatic reaction. Reaction type coverage out of all molecules counts the molecule whose SPScore-predicted reaction type includes the actual reaction type out of all molecules. Saved searches out of “Chem” and “Bio” molecules count the molecule whose SPScore-predicted reaction type exactly matches the actual reaction type for these molecules labeled with “Chem” or “Bio”, so the search algorithm does not need to search the alternative reaction type. The reaction retention rate measures how often the actual reaction type used in a pathway agrees with the preferred reaction type predicted by SPScore for the product of that reaction. In particular, for each reaction in a given dataset, we check whether the reaction type (e.g., chemical or enzymatic) matches one of the reaction types that the SPScore predicts as favorable for the reaction's product molecule. The retention rate is calculated as the percentage of reactions that meets this criterion across the entire dataset (see the SI for the formulae). The near shortest synthesis routes include synthesis routes whose lengths are less than or equal to the shortest synthesis route length plus two. Synthesis route retention rate counts the synthesis route whose reactions can be all retained (see the SI for the formulae).
In the expansion step, the behavior differs across the three methods. In FHSync, both organic and enzymatic reaction models are applied to the selected molecule. RXN4Chemistry26 and Levin et al.'s enzymatic templates14 are used to predict single-step precursors for the selected molecule. The precursors generated from both reaction types are merged, and all precursors which are not in the buyable database are scored based on the molecular complexity function (denoted as f(P)) and the depth with a depth exploration factor (denoted as d). In SPSync, the algorithm uses the SPScore to determine which reaction type is more promising for the selected molecule. Only the predicted reaction type is used for precursor generation, and scoring is performed in the same way as FHSync.
In ACERetro, the SPScore is used to guide an asynchronous search between the two reaction types. For each selected molecule, scores are calculated for both the organic and enzymatic pathways using the formula:
Scorei = (1 − c·SPScorei)depthd·f(P) i ∈ [Chem, Bio] |
In the update step, all newly generated precursors, along with their associated scores, are added to the priority queue, which is then re-ranked before the next iteration begins. This shared architecture enables a fair comparison on how the integration of the SPScore and asynchronous search affects planning performance.
The maximum search depth and the expansion time were 10 and 180 s respectively. For a fair comparison, the above parameters together with commercially available compound database from the vendors eMolecules and Sigma-Aldrich are consistent with those used in Levin et al.'s tool (additional parameters in the SI). When the search reaches the time limit, all synthesis routes from buyable molecules to the target molecule are returned.
In = argsort(yi(SPScoreChemi − SPScoreBioi)) |
Supplementary information is available with training parameters and results, benchmark result analysis, synthesis route examples, and searching parameters. See DOI: https://doi.org/10.1039/d5dd00008d.
This journal is © The Royal Society of Chemistry 2025 |