Samuel Genhedena and
Gareth P. Howell
*b
aMolecular AI, Discovery Sciences, R&D, AstraZeneca, Gothenburg, Pepparedsleden 1, SE-431 83 Mölndal, Sweden
bChemical Development, Pharmaceutical Technology & Development, Operations, AstraZeneca, Macclesfield SK10 2NA, UK. E-mail: gareth.howell@astrazeneca.com
First published on 21st August 2025
With the aim of providing new tools for the design and assessment of synthetic routes, we describe an approach that mimics human interpretation whilst being highly amenable to machine implementation. The representation of molecular structures as 2D-coordinates derived from molecular similarity and complexity allows individual transformations to be viewed as vectors (reactant to product) where the magnitude and direction of travel can be used to assess and quantify efficiency. Using a dataset comprising 640k literature syntheses and 2.4m reactions taken from six journals between 2000 and 2020, we show that vectors derived in this way follow logical patterns when grouped by reaction type. Similarly, complete synthetic routes can be visualised as sequences of head-to-tail vectors traversing the range between starting material and target, allowing the efficiency with which this range is covered to be quantified. Three applications of the methodology are demonstrated: a comparison of CASP performance between two versions of AiZynthFinder for generating synthetic routes to 100k ChEMBL targets, analysis of predicted routes to a specific target molecule and, finally, a perspective on how the efficiency of published synthetic routes has changed over the last two decades.
To do the same assessment on hundreds or thousands of synthetic routes, our suitably trained chemist quickly becomes the rate-limiting step. If empirical information (e.g., yield or waste) is available, then automation is trivial. If such information is either unavailable or unreliable, for example at the route design stage, the task becomes significantly more challenging due to the sparsity of meaningful or generally accepted metrics.
Step count – either longest linear sequence (LLS) or total – is by far the most common gauge against which synthetic routes are assessed. It is easy to conceptualise, machine-interpretable and a reasonable predictor of the quantitative metrics we are ultimately interested in. If defined and counted consistently, it is a reliable tool for comparing synthetic routes to a specified target – with fewer steps usually being better. Unfortunately, step counting is subject to major inconsistencies within the synthetic-organic community. It is clear that step-counting should stop when the desired target structure is reached, but there is no accepted convention for when to begin. Typically, step-counting begins at the first material (working backwards) that can be purchased, regardless of cost or availability. Alternatively, counting begins at the first material whose synthesis has been reported and deemed “simple”. These approaches are practical, since exhaustively step-counting back to hydrocarbons and biomass feedstock is unrealistic. The result is a high degree of inconsistency, however, with unspecified steps upstream of the starting materials being unaccounted for.
Assessment of a synthetic route based on its constituent reaction types is often informative and can be automated using commercial software such as NameRxn1 or InfoChem.2 Certain reaction types, for example redox manipulations or functional group interconversions (FGI), can be penalised in favour of “constructive” steps where bonds present in the target skeleton are formed. This strategy remains challenging since the classification of reactions is prone to failure, particularly when considering novel or tandem/cascade transformations. Furthermore, the binary assignment of transformations as productive or non-productive is somewhat limiting for the purposes of comparison or ranking.
Other metrics, such as atom economy,3 step economy,4–6 redox economy,7 ideality8 and convergence9 have been reported to assess aspects of efficiency relating to synthetic routes. These concepts are all eminently logical and automatable, provided fully atom-mapped synthetic sequences are available (including reagents, which can be far from accurate);10 none are routinely used, however, when assessing or reporting synthetic routes. In computer-aided synthesis planning (CASP) and in the development of retrosynthesis analysis algorithms, there is no general agreement on how to assess route predictions, especially with respect to quality.11,12
With the above in mind, we have an interest in novel, automatable strategies for the assessment of synthetic routes, primarily at the route design stage, that are less reliant on step count and do not require empirical data (yield, etc.), atom mapping or reaction class assignment.
To quantify similarity, we will use two approaches. Molecular fingerprints, as illustrated in Fig. 1, are widely used in cheminformatics for applications including structure activity relationship (SAR) analysis, virtual library screening and computer-aided synthesis planning (CASP). Amongst the various fingerprint types, Morgan fingerprints21 are popular for fast comparison of small molecules and can be easily generated from the Simplified Molecular Input Line Entry System (SMILES) strings of any pair of molecules using RDKit.22 As a mathematical measure of the similarity between two fingerprints, the Tanimoto coefficient is routinely used and yields values (SFP) between 0 (no similarity) and 1 (identical).23
For comparison, we used an alternative measure of similarity: Maximum Common Edge Subgraph (MCES),24 which can also be generated using SMILES strings.22 This approach, summarised in Fig. 2, compares two molecular structures (graphs) and finds the largest fragment, or MCES, that is present in both. Tanimoto similarity was again used, this time to compare the number of atoms and bonds in the MCES with those in the two comparator molecules. This metric also yields values (SMCES) between 0 (no similarity) and 1 (identical).
For a given synthetic route, similarity measures (SFP, SMCES) between the target and all preceding intermediates can be generated, as shown in Fig. 3, for the synthesis of cell division cycle 25B (CDC25B) phosphatase inhibitor 6.25 The starting fingerprint similarity value (SFP 0.35) for naphthalene 1 is lower than the corresponding MCES similarity (SMCES 0.50). In step (a) (Boc-protection), there is a drop in both similarity metrics (ΔSFP, ΔSMCES = −0.07). This is logical since six heavy atoms (COC[CH3]3) have been introduced in an arrangement that does not feature in the target structure 6. In step (b) (C–N coupling), aniline 3 is added to give compound 4 which has significant positive impact on target similarity (ΔSFP = +0.12, ΔSMCES = +0.28). In the following Boc-deprotection step (c), positive changes are again observed (ΔSFP = +0.25, ΔSMCES = +0.21) since the previously added COC(CH3)3 fragment, which does not feature in target 6, is now being removed. The final transformation (d, ester hydrolysis/deprotection) shows a much lower change in MCES similarity (ΔSMCES = +0.08) compared to fingerprint similarity (ΔSFP = +0.36).
As synthetic chemists, we would identify only one of the four transformations as being “productive” (step b); the other three protecting group manipulations would be considered non-ideal.8 The two similarity metrics both yield negative values (−ΔS) for the first step (a) and we would interpret this as non-productive since structurally, we are moving further away from our desired target. By this interpretation, however, both similarity metrics suggest steps (c and d), two deprotection operations, to be productive (+ΔS). In the case of SFP, both deprotections have a larger, positive ΔS value than the C–N coupling reaction (b).
We can obtain a more logical interpretation by adding a second descriptor to describe the magnitude of structural change taking place in each transformation. The changes in molecular weight for each step along the route might suffice here but, for our purposes of route assessment, we are interested in more than mass variation. Ideally, we are aiming to generate some measure of route efficiency related to cost, waste, time, etc. and, since this information is seldom available directly, we will use a molecular complexity metric as a surrogate. There is an important assumption to recognise here: we are assuming the “complexity” of a molecule is proportional to how easily it is obtained or synthesised, and therefore the implicit cost, time and waste. For the most part this seems reasonable in that “complex” molecules, where there is a variety of atom types, bond orders and ring systems are generally more challenging to obtain than “less complex” molecules. We must be mindful, however, that this assumption does not hold for complex molecules that are readily available (e.g., steroids, carbohydrates).
In the same way that we have used similarity (SFP and SMCES) to measure progress along a synthetic route, we will use CM* to gauge the magnitude of structural change and show that similarity and complexity can operate in opposing directions, providing complementary information. After cleaning, the observed dataset values for similarity (SFP and SMCES) necessarily lie in the range 0 to 1 whilst the observed values for CM* vary between 3.5 and 12.0. To ensure equal weight is given to all metrics, CM* values were normalised (nC) to the range 0–1.‡
Still considering the synthesis of 6 (Fig. 3), two route plots can be generated using similarity (SFP and SMCES for comparison) and complexity as Cartesian co-ordinates, are shown in Fig. 4. Each transformation is now represented as a vector with magnitude and direction. There are various observations to be made here:
![]() | ||
Fig. 4 Vector plots derived from similarity and complexity for the synthesis of phosphatase inhibitor 6 (Fig. 3). |
• The starting material must necessarily be to the left of the route target, since it must be less similar to the target than the target itself.
• The starting material is of lower complexity than the target (routes that do not satisfy this criterion were removed) and so the overall direction of travel from starting material to target will be in the direction +S (right) and +C (up).
• Transformations that are “productive” should therefore travel in the direction +S (right) and +C (up).
This 2D-coordinate system gives more information about the individual transformations than similarity alone. Step (a) (Boc-protection) is non-productive since complexity is added without an increase in target similarity, giving a vector that moves up and left. Step (b) (C–N coupling) is productive since it increases complexity and similarity toward that of the target (the vectors move up and right). The final deprotection operations (c and d) increase target similarity, as demonstrated in Fig. 3, but are associated with a decrease in complexity. The vectors move down and right which signifies “wasted” complexity (molecular structure that was not needed) and can be viewed as non-productive. Compared to similarity alone, the use of similarity and complexity gives a more logical assessment of the individual transformations in the route.
Analysis of dataset reactions is shown in Fig. 6. The distributions of values for ΔSFP, ΔSMCES, ΔnC and ηT (derived from SMCES values) are grouped by ten main reaction super-classes,37 which do indeed display characteristic patterns.
• Carbon–carbon bond forming reactions are predominantly associated with +ΔS and +ΔC changes, which indicates a productive reaction. The efficiency ηT values are the highest of all ten reaction super-classes.
• Heteroatom alkylation/arylation, heteroatom acylation and heterocycle formation show the same +ΔS and +ΔC mean values although the efficiency ηT values are slightly lower.
• Deprotection reactions yield −ΔnC values in conjunction with +ΔS values. The mean efficiency value ηT is positive but significantly less than for the four super-classes mentioned above.
• Redox, functional group interconversion (FGI) and addition (FGA) reactions show very low (close to zero) mean values for ΔS, ΔC and ηT.
• Protection reactions are unique (and in keeping with our earlier observations) in that they are associated with −ΔS and +ΔC values. This is the only reaction super-class with a negative mean ηT value.
Crucially, this definition of transformation efficiency ηT is not binary, in that reaction super-classes are not wholly designated as productive or non-productive. An acylation reaction, for example, can have low or negative ηT if the fragment being introduced represents a small proportion of the target structure (e.g., a methyl ester) or is not present at all in the target structure (e.g., a methyl ester that is functioning as a protecting group).
The distribution of ΔSMCES values is generally narrower than that of ΔSFP, particularly so for redox, FGI and FGA reactions. This is due to subtleties in the way that the fingerprint and MCES similarity algorithms operate. The fingerprint similarity approach is more sensitive to changes in bond order and can be significantly impacted by relatively small, single atom changes.38 This can be illustrated by examining how mean ΔS values change depending on proximity (in terms of step count) to the route target, as shown Fig. 7. The fingerprint based metric SFP is much more sensitive to the final changes taking place in the synthetic route (i.e., in the final step) than the alternative SMCES metric. It is possible to tune the way atom and bond types are distinguished in the fingerprint algorithm (or use a different fingerprint type), which would likely alleviate this problem, but we would suggest the SMCES metric to be more suitable and will use it from hereon.
![]() | ||
Fig. 7 Comparison of similarity metrics vs. step location for 553k reactions. Whiskers show 5th and 95th percentiles, boxes show 25th, 50th and 75th percentiles. |
Further to this, if we accept that (i) vmin represents the shortest possible route (a single transformation) from starting material to route target and (ii) productive transformations should move towards the route target when plotted on similarity, complexity axes, then we might assess the efficiency of a route using simple path comparison, as shown in Fig. 8.
![]() | ||
Fig. 8 Route efficiency μR demonstrated with the synthesis of pyridol 12. Tabulated values show transformation efficiency ηT and magnitude ‖v‖ for each step. |
Pyridol 12 is synthesised in four steps from 7.39 The synthetic range of the route ‖vmin‖ is relatively large since the starting material is of lower complexity than the route target and bears limited similarity to it. The actual path of the route (a–d) is very close to the “ideal” path vmin (note the y-axis scale) and, as shown, a simple ratio of the magnitudes (lengths) of the ideal to actual path gives a measure of the efficiency ηR, where values can range from 0 (low efficiency) to 1 (perfect efficiency). This measure is both easy to conceptualise and there is no inherent need to examine or chemically interpret the actual transformations taking place. Routes that feature significant detours (e.g., protecting group manipulations, auxiliaries, complex leaving groups) are immediately obvious both mathematically and visually, as demonstrated in Fig. 9. To O-methylate 13 chemo- and regioselectively,40 global protection and deprotection was necessary, eventually yielding O-GlcNAcase inhibitor 18. The synthetic range of the route is very small (‖vmin‖ = 0.06) since little overall structural change is taking place whilst the actual path length (∑‖v‖) is very large due to the significant “detours”; the result is a low efficiency value (μR = 0.05).
This definition of route efficiency is wholly agnostic towards step count: ηR is not influenced by the number of steps, only the direction in which they travel. Whilst this partially satisfies our original aim, it introduces a significant shortcoming: transformations with very small or zero magnitude (e.g., redox, FGI, FGA) or those which offer small, incremental progress in a direction similar to vmin are not penalised. Whilst we are trying to develop a theoretical analysis that is not governed by step count, we cannot ignore the fact that fewer steps is generally better. As shown in Fig. 10, it is possible for the real path of a route to be identical to vmin but also comprise low-efficiency transformations. Conversion of alcohol 19 to alkene 23 is achieved with very high stereoselectivity.41 In terms of introducing the heavy atoms present in the target skeleton, the entirety of the work is achieved in step (b) where iodide 20 is reacted with 1-trimethylsilylpropyne. Regarding similarity (SMCES) and complexity, steps a (FGI), c (alkyne migration) and d (hydrogenation) effect no change and display zero efficiency ηT values. The resulting route efficiency ηR is perfect, however, since our analysis ignores zero-magnitude vectors and effectively treats this as a single step transformation (19–23).
![]() | ||
Fig. 10 Penalised route efficiency ηPR demonstrated by the synthesis of 23. Tabulated values show transformation efficiency μT, magnitude ‖v‖ and penalised magnitude ‖vp‖ for each step. |
This shortcoming can be remedied by using a penalised function for route efficiency (ηPR) where the minimum path length for any single step is set to 0.1. As shown in Fig. 10, this significantly reduces the efficiency for the synthesis of 23 (ηR = 1, ηPR = 0.51) but would have much less impact on the route to 12 (Fig. 8, ηR = 0.94, ηPR = 0.84). The minimum value of 0.1 (a somewhat arbitrary value chosen by inspection of the data in Fig. 6) can, of course, be tuned to vary the extent to which low efficiency transformations are penalised. This penalised efficiency metric might be considered as an amalgam of atom economy3 and ideality8 since wasted molecular structure and inefficient transformations are disfavoured.
Regardless of this inconsistency, Fig. 11 (left) shows that for all route lengths, convergent routes show higher penalised route efficiency values μPR than linear routes and (Fig. 11, right) coupling reactions occurring at branch points are significantly more efficient (ηT) than those occurring at non-branch points. This is logical since the branch point in a convergent route, where two significant skeletal fragments come together, is likely to feature a large increase in similarity and complexity. The same efficiency analysis we have described for the LLS could also be carried out for any parallel synthetic branches, between its starting material and the branch point, to give a second efficiency measure for the route. As a final, expected observation, Fig. 11 shows a steady decrease in route efficiency with route length.
As shown in Fig. 12 (top), if we consider only the best routes (one per target) by step count, there is negligible difference in the output of the two software versions. The output from the two software versions becomes distinguishable, however, using both synthetic range ‖vmin‖ (Fig. 12, middle) and penalised route efficiency ηPR (Fig. 12, bottom). The later version predicts routes that are, on average, of wider synthetic range. Since the collection of route targets is the same for both software versions, we can deduce that the route starting materials identified by version v4 are generally simpler and less similar to the route targets compared to version v1. The later version also predicts routes with slightly higher penalised route efficiency, which indicates a reduced reliance on non-productive transformations (redox, FGI, FGA, etc.) compared to version v1.
We can also use this methodology to assess and rank CASP routes, using androgen receptor antagonist 26 from the ChEMBL dataset as an example.46 Construction of the central thiohydantoin unit is the main challenge and four predicted routes are shown in Fig. 13. Route A is the joint shortest (two steps) and the strategy here is to purchase the thiohydantoin core. The starting material is therefore complex and of very high similarity to the route target, meaning the synthetic range ‖vmin‖ is very small (0.04, notice the difference in scale for the vector plot of route A). The two proposed transformations are both of low efficiency (oxidation and FGI) resulting in low route efficiency (ηR = 0.36) and even lower penalised route efficiency (ηPR = 0.19).
Route B is also two steps but, in comparison to route A, covers a much wider synthetic range since the starting material 27 is much less complex and less similar to 26. Both transformations are of high efficiency and the resulting route efficiency is almost perfect (ηR = ηPR = 0.99). We might speculate on the viability of the thiohydantoin-forming step (29 + 30) or whether the primary amide might interfere but, if successful, route B would be a highly efficient synthesis.
The remaining two routes are both longer (4 steps). Route C has wider synthetic range ‖vmin‖ but comprises two protecting group manipulations resulting in a lower route efficiency (ηPR = 0.77). Route D has narrower synthetic range but, with three construction reactions and one FGI, has a higher route efficiency (ηPR = 0.88).
We propose that the overall analysis is in keeping with human assessment: route B would be the most efficient (if viable), routes C and D are less efficient (but perhaps more viable) and route A is unlikely to be of use (the cost and availability of starting material 24 is unlikely to be significantly different to that of target 26).
To demonstrate the use of this methodology, we have shown how (i) large sets of CASP output from differing sources can be rapidly compared and contrasted and (ii) the automated assessment of individual CASP-derived routes to a given synthetic target can be achieved in a similar way to human interpretation.
Our mathematical approach to route analysis is highly amenable to further analysis. The impact of ordering in the sequence of transformations along a synthetic route ought to be of interest. We would instinctively suggest that similarity and complexity values should continually increase from starting material to target such that wasted complexity (due to yield losses) is minimised. Similarly, we might expect low-efficiency transformations to be better situated at the start of a route and high-efficiency transformations towards the end; these properties could be assessed using rank correlation metrics (e.g., Pearson,47 Spearman48).
It should be stressed that the methodology described here, derived only from chemical structures and route topography is wholly theoretical and will always be inferior to real, empirical data such as cost, time and waste. Obtaining reliable empirical data for known transformations is problematic however and, in the case of unknown or theoretical transformations, Hendrickson's observation made in 1976 (ref. 9) (“when planning an organic synthesis it is presently impossible to predict the yields of individual reactions, or indeed even whether they will succeed or fail”) remains pertinent today. Thus, we believe this methodology will be useful wherever assessment of synthetic transformations and routes is required.
CASP | Computer aided synthesis planning |
ECFP4 | Enhanced connectivity fingerprint (diameter 4) |
FGA | Functional group addition |
FGI | Functional group interconversion |
LLS | Longest linear sequence |
MCES | Maximum common edge subgraph |
PMI | Process mass intensity |
SAR | Structure activity relationship |
SMILES | Simplified molecular-input line-entry system |
Also provided are the python scripts and an accompanying Jupyter notebook that allows the user to search, retrieve data and generate vector plots for individual routes within the patent dataset.
• Vectors_scripts.py: Python file containing all the necessary code (except the file below) to convert the patents datafile into a useable dataframe with the necessary metrics.
• proudfoot_complexity.py: Python file containing the necessary code for generating CM* complexity metrics.
• Vectors_Notebook.ipynb: Jupyter notebook containing instructions on how to process data, search, retrieve route information and generate vector plots.
Complementary analyses using alternative complexity metrics and the code files mentioned above is available. See DOI: https://doi.org/10.1039/d5sc06089c.
Footnotes |
† A list of protecting groups with structures is provided in the SI. |
‡ For application with other datasets, we would suggest using nC values of 0 and 1 respectively for compounds where CM* is <3.5 or >12.0. |
§ Version v1 corresponds to the first public version of AiZynthFinder with models trained on USPTO data in 2019, whereas version v4 corresponds to the latest major release of AiZynthFinder with models trained on USPTO data in 2022. |
This journal is © The Royal Society of Chemistry 2025 |