Open Access Article
Junren
Li
a,
Lei
Fang
*b and
Jian-Guang
Lou
b
aCollege of Chemistry and Molecular Engineering, Peking University, No. 5 Yiheyuan Road, Beijing, China
bMicrosoft Corporation, Building 2, No. 5 Dan Ling Street, Beijing, China. E-mail: leifa@microsoft.com
First published on 2nd February 2024
Computer-assisted methods have emerged as valuable tools for retrosynthesis analysis. However, quantifying the plausibility of generated retrosynthesis routes remains a challenging task. We introduce Retro-BLEU, a statistical metric adapted from the well-established BLEU score in machine translation, to evaluate the plausibility of retrosynthesis routes based on reaction template sequences analysis. We demonstrate the effectiveness of Retro-BLEU by applying it to a diverse set of retrosynthesis routes generated by state-of-the-art algorithms and compare the performance with other evaluation metrics. The results show that Retro-BLEU is capable of differentiating between plausible and implausible routes. Furthermore, we provide insights into the strengths and weaknesses of Retro-BLEU, paving the way for future developments and improvements in this field.
Existing metrics to evaluate the retrosynthesis routes can be broadly classified into two primary categories:
• Metrics based on intrinsic properties of generated routes, e.g., route length, reactants price,7 or the coverage of the starting materials in recorded routes.8 While these metrics provide valuable information, they cannot capture the chemical plausibility or practicality of a given route.2,7 For example, protection and deprotection steps are essential for obtaining the target product by preventing undesired reactions, which increases the route length.
• Metrics based on trained models, e.g., reaction cost,9 which calculates a route-level probability score by multiplying the probabilities of each reaction step. A typical planning system generally consists of a single-step retrosynthesis model10–12 and a multi-step searching algorithm.9,13,14 The probabilities generated by single-step models represent the model's confidence derived from the underlying training data,15 they do not correspond to actual reaction probabilities, which are influenced by various factors such as reaction kinetics and the presence of catalysts. Moreover, the model's performance degrades when the size of the template library is increased, resulting in less reliable probabilities,16 and the metric is also affected by the route length, because a route comprising more steps typically exhibits a lower cumulative probability.
The goal of retrosynthesis planning is to provide valid routes for synthesis design. A valid route indicates that all reactions of the route can be performed in the real-world lab scenario, instead of simply applying reaction templates to arbitrary chemical environments. Nonetheless, the metrics mentioned above cannot determine the route validity, which leaves a gap between current CASP programs and actual laboratory experiments. In order to accurately determine if a reaction can take place, a theoretical evaluation or wet-lab experiment is indispensable. Such assessments necessitate substantial computational resources (starting from first principles, which are typically challenging to compute precisely) or involve considerable labor costs. These challenges motivate us to approach the problem from a statistical perspective, seeking statistical measures correlated with chemical plausibility, and thus enabling us to quantify the plausibility of chemical reaction routes.
In natural language processing (NLP), widely accepted evaluation metrics for tasks such as machine translation or text generation/summarization include Bilingual Evaluation Understudy (BLEU)17 and Recall-Oriented Understudy for Gisting Evaluation (ROUGE),18 and they focus on precision and recall when evaluating with human translation, respectively. Both BLEU and ROUGE rely on the concept of n-grams to compute the overlap between generated text and the reference text. n-grams are sequences of “n” consecutive words. For example, unigrams represent single words, bigrams represent two consecutive words, and so on. Drawing an analogy to NLP, retrosynthetic routes (typically represented as trees) can also be considered collections of reaction sequences, as we demonstrated in Fig. 1. Each sequence corresponds to a specific reaction pathway connecting the target product to leaf nodes, which represent a set of individual starting materials. Similar to how consecutive words in a sentence often exhibit semantic correlations, consecutive reactions in validated synthesis routes also demonstrate interrelated synthetic strategies, reflecting the underlying logic and coherence in chemical transformations.19,20 For instance, the nitro group can be easily introduced into an aromatic ring, then reduced to an amine, followed by other substitution reactions, as a simple example of sequential reactions. Since there is no absolute best route for retrosynthesis planning, the precision of sequential reactions is more important than recall. This motivates us to modify the BLEU score for the scenario of retrosynthesis, resulting in Retro-BLEU. The key difference between the basic BLEU score and Retro-BLEU lies in the data being analyzed. While BLEU deals with text, Retro-BLEU is designed for reaction sequences, which can be obtained from retrosynthetic routes. In this context, n-grams represent sequences of “n” consecutive reactions (or consecutive reaction templates, as we will discuss later) instead of words. This adaptation allows us to apply the concept of n-grams from natural language processing to the domain of retrosynthetic routes, enabling a more relevant and meaningful comparison between generated routes and known synthesis routes. By calculating the precision of matching reaction n-grams between generated routes and known synthesis routes, Retro-BLEU offers a quantifiable approach to assess the quality of generated synthetic pathways.
We employ two datasets, the PaRoutes21 dataset and the Retro*-190 (ref. 9) dataset, to determine if there is a noticeable relationship between the n-gram overlap found in model-generated routes and patent test routes.
• PaRoutes: following PaRoutes,21 we collected 457
447 experimentally validated routes from the US Patent and Trademark Office (USPTO) dataset.22 PaRoutes also provides two sets of 10
000 diverse, non-overlapping routes with a depth of at most 10 reactions: set-n1 and set-n5. The difference is the number of routes extracted from each patent before checking for overlapping routes: one route for set-n1 and five routes for set-n5, please refer to PaRoutes21 for details. Due to space limitations, we mainly report the results on set-n5 because the results on set-n1 are similar. We constructed the known n-grams from the patent dataset, excluding those patents containing the corresponding 10
000 target instances. As a result, the remaining patents are denoted as the corresponding “known routes”. This approach was taken to mimic the scenario when evaluating retrosynthesis routes for new targets, ensuring a fair and unbiased comparison. In addition, we generated 2
958
811 routes for set-n5 molecules using Monte Carlo Tree Search (MCTS)13 and 2
799
023 routes for set-n5 molecules using Retro* (ref. 9) with AiZynthFinder,23i.e., for each target molecule, we generated approximately 300 routes. We use the default parameter settings in AiZynthFinder and employ the top-50 predictions from the single-step model in each step.
• Retro*-190: Retro*-190 (ref. 9) is a collection of 190 challenging target molecules specifically designed to test the performance of retrosynthesis search algorithms. Retro*9 provided the shortest route for each target by concatenating reactions from various patents until the starting materials are available in eMolecules§, which are considered as patent test routes. It should be noted that these routes are pseudo-routes because their corresponding reaction sequences may not be chemically logical. We employ the results of several state-of-the-art search algorithms as model-generated routes to compare the n-gram overlap with known routes. The 299
902 training routes from Retro* are considered as known routes to build the known n-grams.
On each dataset, we collected n-consecutive reactions (with n ranging from 2 to 4) from the set of corresponding known synthesis routes to construct the known reaction n-gram sequences. We utilized the SMILES (Simplified Molecular-Input Line-Entry System) representation for these reactions, as the canonical SMILES of each molecule is unique, allowing for efficient identity checking between tuples. For example, if a route is 6 steps long, we would take the first four reactions, the middle four reactions, and the last four reactions as three 4-grams. For each of the tested route, which includes both patent test routes and model-generated routes, we extracted n-consecutive reactions and computed their overlap ratio with the known reaction n-grams (routes from the same patents are excluded when constructing dataset), then we averaged the ratio across all routes to obtain the overall overlap ratio. To be specific, the fraction of n-gram overlap is calculated as follow:
![]() | (1) |
We also calculated the coverage, which is the average ratio of routes having n-grams, e.g., routes shorter than 3 steps do not contain any trigram reaction sequences.
As shown in Table 1, on PaRoutes set-n5, nearly half of the reaction n-grams are recorded/known when evaluating the patent test routes. This observation suggests that a significant portion of the reaction sequences in set-n5 patent test routes overlap with those found in known synthesis routes, indicating that chemists often rely on familiar and well-understood reaction sequences when designing new synthesis strategies. However, the overlap ratio declined to less than 10% on generated routes for both MCTS and Retro*. On Retro*-190, which is a quite challenging dataset, the overlap ratio of the pseudo-routes decreases to approximately 10% at the bigram level, because these routes contain many unobserved reactions from the training data. This decrease can be attributed to the sparsity of reaction sequences.27 When considering reactions as individual tokens, the space formed by continuous n-grams is extremely sparse, because encountering unseen reactions is inevitable during the synthesis of novel molecules. Nonetheless, this does not imply that we should regard unseen reactions as invalid choices.
| n-Grams ratio category | n = 2 | n = 3 | n = 4 | Avg. length | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Reaction | Template | Coverage | Reaction | Template | Coverage | Reaction | Template | Coverage | ||
| a The numbers represent how many routes are used in the evaluation, i.e., top-1 predicted routes, top-10 predicted routes, and all predicted routes (approximately 300 for each target). b The number in the parentheses denotes the solved routes among the 190 targets. c We use the variant of Retro*+ without value functions. | ||||||||||
| PaRoutes set-n5 | 49.0% | 70.0% | 100% | 47.6% | 52.3% | 92.9% | 46.4% | 48.7% | 51.3% | 3.84 |
| MCTS-1a | 7.9% | 24.0% | 94.3% | 7.3% | 5.5% | 56.7% | 7.4% | 1.6% | 24.8% | 3.75 |
| MCTS-10 | 3.9% | 21.2% | 99.0% | 2.2% | 2.9% | 76.7% | 1.0% | 0.5% | 41.2% | 4.50 |
| MCTS-all | 1.4% | 29.1% | 100% | 0.3% | 3.4% | 98.6% | 0.1% | 0.3% | 92.6% | 8.65 |
| Retro*-1 | 4.5% | 14.9% | 94.3% | 3.4% | 2.1% | 57.5% | 2.4% | 0.7% | 22.4% | 3.23 |
| Retro*-10 | 3.1% | 16.1% | 99.0% | 1.7% | 1.6% | 75.0% | 1.0% | 0.4% | 35.2% | 3.79 |
| Retro*-all | 2.3% | 25.1% | 100% | 0.6% | 2.4% | 98.0% | 0.1% | 0.2% | 84.6% | 5.38 |
| Retro*-190 (ref. 9)b | 10.1% | 42.9% | 100% | 4.6% | 28.9% | 90.5% | 1.5% | 21.5% | 77.9% | 6.67 |
| Retro* (165)9 | 6.0% | 31.2% | 100% | 3.4% | 16.9% | 88.5% | 2.1% | 14.9% | 75.2% | 6.35 |
| Retro*+ (183)24c | 3.6% | 29.5% | 100% | 1.6% | 13.6% | 90.7% | 0.9% | 10.2% | 79.8% | 6.82 |
| EG-MCTS (183)25 | 1.2% | 13.9% | 100% | 0.5% | 5.2% | 90.1% | 0.1% | 3.1% | 72.7% | 5.69 |
| RetroGraph (189)26 | 2.1% | 20.1% | 100% | 0.9% | 7.6% | 90.5% | 0.3% | 4.5% | 75.1% | 6.40 |
The sparsity of the reaction space encourages us to develop a more flexible evaluation of generated routes, emphasizing the underlying chemical transformations. In the context of chemical reactions, templates can be considered as an induction and generalization form of reactions. Therefore, we conducted a similar analysis on template sequences, using the same approach as in analyzing the overlap of reaction sequences. Atom-mapping information is a prerequisite for extracting templates. The patent routes have atom-mapping information within, for the test routes on Retro*-190, we employed the commonly used tool RXNMapper28 to map the atom numbers. Afterwards, the reaction templates are extracted with the rxnutils29 package and we use SMARTS (SMILES Arbitrary Target Specification) strings to demonstrate these templates.
We tested the reaction templates with radii ranging from 0 to 2. The chosen radius for a template determines the extent of the chemical environment encapsulated around the reaction center, which in turn influences the sparsity of the chemical space formed by the bigrams. A template with a radius r encompasses the surrounding r atoms, specifically, a template with a radius of 0 focuses only on the atoms undergoing change at the reaction center. For example, a radius of 0 proves to be insufficiently representative, as a single template might correspond to multiple reactions. However, selecting a large radius can lead to overly restrictive template coverage. At a radius of 2, the overlap template bigram ratio for patent routes drops to a mere 34.8%, resulting in bigrams too sparse for effective evaluation. Therefore, we set the radius to 1 when evaluating template sequences, offering a meaningful compromise between specificity and coverage.
Herein, we present the results for a radius of 1, results for other radii can be found in ESI Table 2,‡ indicating that using a radius of 1 is an optimal choice for evaluating template sequences. As shown in Table 1, the patent-extracted routes on PaRoutes set-n5 have a significant portion of known consecutive template sequences, much higher than using reaction sequences. Meanwhile, the overall template sequence overlap ratio is considerably higher than the reaction sequence. Similarly, the test routes on Retro*-190 have 42.9% of recorded template bigrams, while model-generated routes exhibit lower overlaps.
It is important to note that coverage is closely related to the average route length. When more generated routes are examined for each target, the average length increases, resulting in higher coverage. However, only the bigram coverage consistently remains near 100%. Taking the coverage into account, we propose that the bigram overlap ratio should be considered when assessing the chemical plausibility. Furthermore, it should be noted that the template bigram overlap ratio increases when the average route length increases. This might be due to randomly paired sequences as the route extends, which may contain unproductive steps, such as performing unnecessary protection before converting functional groups. This observation implies that route length should also be considered when evaluating the plausibility of generated routes.
![]() | (2) |
We compare Retro-BLEU with four other baselines:
• The route score by Badowski et al.7 This score takes into account route length and convergence. However, due to insufficient experimental data, the cost of each reaction and the yields can only be set using heuristics. We adapted the original implementation from PaRoutes in our comparisons:21
![]() | (3) |
![]() | (4) |
• Cumulative probability: we recursively add the logarithmic probability obtained from the single-step retrosynthesis model NeuralSym31 for each reaction in the route. Note that for reactions in patent test routes that cannot be predicted by the single-step model, we set its probability to 1 × 10−10 when calculating the cumulative probability.
![]() | (5) |
• Length: we use the number of reactions in the route as a metric, with shorter routes being preferable.
| Scorelength(r) = Nx(r) | (6) |
• Bigram ratio: we rank the routes based on the bigram overlap ratio. As we discussed earlier, a higher bigram ratio suggests that the route more closely resembles known successful routes, and is therefore considered better.
For each set of routes, we compute the route score using the aforementioned baselines and Retro-BLEU score. Then, we calculate the rank of the patent-recorded route among all the tested routes for the same target, leading to our top-k metric. Since multiple routes may share the same scores (e.g., the same length under the route length metric), we assess the routes in terms of both best-case and worst-case scenarios. These scenarios represent instances where the patent route is identified either first or last among routes with the same score, respectively.
Fig. 2 shows the results on set-n5 for MCTS and Retro*, and the results on set-n1 can be found in ESI Fig. 1,‡ demonstrating a similar outcome to the one discussed here. In Fig. 2, the gap between the best- and worst-case scenarios is marked with diagonal lines. Retro-BLEU achieves the best overall ranking accuracy with a relatively small gap between the best- and worst-case when compared with other evaluation metrics on both MCTS and Retro* generated routes.
![]() | ||
| Fig. 3 Most frequent positive (highlighted in green) and negative (highlighted in red) template bigrams. | ||
The first positive bigram illustrates a common process for generating an amide, which involves initially hydrolyzing an ester and then coupling it with another amine. Considering the difficulty of amidation reactions and the low reactivity of esters, hydrolyzing esters into more reactive carboxylic acids is often necessary. Following this step, amidation can be completed with the help of condensing agents. Similarly, in the second positive bigram, the product is deconstructed into a primary amine and a nitrogen-substituted heterocycle. The primary amine is derived from a nitro group through a reduction process. This bigram demonstrates an excellent strategy for linking two molecular fragments together, which is commonly employed in drug-like molecule synthesis. The third positive bigram comprises a Sonogashira coupling reaction followed by deprotection to form an exocyclic triple bond. Trimethylsilyl-based protection prevents the formation of side products from excessive coupling. The subsequent deprotection process provides an opportunity for coupling on the other side of the triple bond. These positive bigrams represent well-established reaction strategies, whereas negative bigrams often contain redundant reactions that are not practical in synthesis applications.
In the first negative bigram, the overall reaction involves the hydrolysis of acyl chloride. However, the negative template bigram uses two steps to complete the entire process: an alcoholysis and a hydrolysis on the ester intermediate. In common practice, this reaction can be simply executed by adding acyl chloride into water. When performing the initial alcoholysis step, the search algorithm is unaware that the molecule will ultimately be converted into a carboxylic acid, leading to a redundant step. The second negative bigram aims to convert the R-substituted nitro compound into a primary amine, which could be achieved by directly using reductants to reduce the nitro group. However, the template bigram incorporates extra reagents, resulting in an unnecessarily extended reaction sequence. Similarly, the third negative bigram, which involves converting fluorobenzene to the more easily accessible aminobenzene, can be accomplished in a single step using the Schiemann reaction. These negative bigrams reveal an inherent limitation in the current retrosynthesis planning approaches, in which consecutive reactions were not considered, consequently resulting in the potential generation of redundant steps. We can potentially build these negative bigrams using various data mining techniques, which can help in early stopping unnecessary searches during route finding.
It is worth mentioning that for approximately 20% of target molecules, the patent-extracted routes are not ranked in the top-1 positions. We selected another example where the Retro-BLEU score of the patent route is lower than some of the generated routes, as shown in the second case in Fig. 4. The patent route synthesizes the target molecule within four steps, primarily modifying the substituent on the benzene ring; however, the starting material remains relatively expensive||. In our comparison, we selected the generated route with the highest Retro-BLEU score (5.42), which surpasses the patent route's score of 4.84. Notably, this route was originally ranked in the 25th position by AiZynthFinder's searching process. The template bigrams in the generated route have all been recorded, and it can be considered an alternative synthesis route by first synthesizing the two aromatic systems and then coupling them together using a Suzuki–Miyaura coupling reaction. The generated route is shorter than the one previously reported in the patent, and the starting materials are also simplified. We also include 5 more randomly selected targets where the top-ranked generated route has a higher Retro-BLEU score than the patent test route in the ESI Section 5,‡ providing a more comprehensive view of the Retro-BLEU suggested routes. This comparison indicates that model-generated routes can also serve as valuable supplements to existing routes if carefully selected based on Retro-BLEU scores. Therefore, we believe that Retro-BLEU can serve as a valuable metric to distinguish plausible routes from a vast number of model-generated routes, ultimately enhancing the efficiency and effectiveness of synthetic route selection.
In conclusion, we introduce Retro-BLEU as a metric to evaluate and rank retrosynthetic routes generated by computer-aided synthesis planning tools. Retro-BLEU offers a statistical approach to assess chemical plausibility by analyzing reaction template sequences, and it significantly outperforms other baselines in selecting experimentally validated patent test routes. This accelerates the utilization of retrosynthesis planning tools and enables researchers to identify feasible routes more efficiently. We encourage further research for evaluating model-generated retrosynthetic routes, which will support synthetic chemistry progress and facilitate the discovery and synthesis of novel molecules, benefiting the broader scientific community.
Footnotes |
| † This work was done when Junren Li was an intern at Microsoft. |
| ‡ Electronic supplementary information (ESI) available: We provide the reaction/template n-gram overlap analysis under different partition settings, template n-gram overlap analysis under different radii, and the relationship between Retro-BLEU and filtering strategies in the ESI. See DOI: https://doi.org/10.1039/d3dd00219e |
| § https://www.emolecules.com/products/building-blocks |
| ¶ Except for the errors during text extraction. |
| || The price was $280 per g on https://www.biosynth.com/ accessed in July, 2023. |
| This journal is © The Royal Society of Chemistry 2024 |