Hanyu
Gao
a,
Connor W.
Coley
a,
Thomas J.
Struble
a,
Linyan
Li
b,
Yujie
Qian
c,
William H.
Green
a and
Klavs F.
Jensen
*a
aDepartment of Chemical Engineering, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, MA 02139, USA. E-mail: kfjensen@mit.edu
bHarvard T.H. Chan School of Public Health, 677 Huntington Ave, Boston, MA 02115, USA
cMIT Computer Science & Artificial Intelligence Laboratory, 32 Vassar St, Cambridge, MA 02139, USA
First published on 2nd January 2020
The access to essential medicines remains a problem in many low-income countries for logistic and expiration limits, among other factors. Enabling flexible replenishment and easier supply chain management by on demand manufacturing from stored starting materials provides a solution to this challenge. Recent developments in computer-aided chemical synthesis planning have benefited from machine learning in different aspects. In this manuscript, we use those techniques to perform a combined analysis of a WHO essential medicines list to identify synthetic routes that minimize chemical inventory that would be required to synthesize the all the active pharmaceutical ingredients. We use a synthesis planning tool to perform retrosynthetic analyses for 99 targets and solve a mixed-integer programming problem to select a combination of pathways that uses the minimal number of chemicals. This work demonstrates the technical feasibility of reducing storage of active pharmaceutical ingredients to a minimal inventory of starting materials.
The expiration of stockpiles of pharmaceuticals is one significant contributor to drug shortages,6 in particular in low-income countries, where drug donations can include drugs that are either expired or close to expiration.7 Another reason for expired stockpiles is poor forecasts of future demands8 or stockpiling for emergencies that fortunately do not occur. Drug expiration often results from the final formulated drug having a much shorter storage life than the chemical starting materials. Thus, it is of potential interest to identify the minimum number of starting materials needed to produce a desired set of medicines, and store the starting materials instead of the final drug products. Recent advances in computer-aided synthesis enable fast planning of large numbers of possible pathways to individual targets.9–14 Tools have also been developed to evaluate proposed pathways and identify which pathway suggestions are likely to be feasible.15–23 This progress makes it possible to combine synthesis planning for multiple targets. Molga et al. performed retrosynthetic analysis for multiple targets on the same graph and penalized new reaction types that appeared during the search, to promote the use of common intermediates.36 In this application, we explicitly consider the criteria of maximizing the overlap in the chemicals needed to realize a set of molecules. An approach of different syntheses sharing chemicals would have the advantage of allowing more flexible replenishment since starting materials tend to have more sources and be easier to obtain than the final drug products. Additionally, minimizing the total number of chemicals also promotes sharing common reaction conditions such as solvents and catalysts. It then becomes more likely to have reactions that share the same or similar conditions and thus the possibility of using common reactors, which facilitates modular process development. On the other hand, minimizing the number of chemicals may lead to pathways that are suboptimal for each molecular target individually. It increases the risk of choosing more practically challenging reactions or longer synthetic pathways, which are both undesirable.
We develop a framework to discover and select an optimal combination of pathways for molecules in the WHO essential medicines list accessible by chemical synthesis (more specific criteria in the Methods section). A machine learning based retrosynthesis platform24 is used to find candidate synthetic routes for all the targets, and then a multi-objective mixed-integer programming problem is solved25 to minimize the number of chemicals used in all the syntheses (including those used as solvents, reagents and catalysts), while maximizing the perceived chemical feasibility of the pathways. It is worth noting that this is a hypothetical study focusing on the conceptual feasibility of the problem. Building such a system for producing the WHO EML requires many other considerations, including the quantity and demand for different drugs, the expertise for conducting chemical synthesis and developing chemical processes, and cost associated with separation and purification of the product.
1) Molecular weight not greater than 500;
2) No fewer than 10 carbon atoms;
3) No greater than 5 hydrogen donors;
4) No greater than 10 hydrogen acceptors;
5) Estimated octanol–water partition coefficient, logP ≤ 5.
6) No more than three chiral centers.
The final list of 127 targets used in our analysis is provided in the Table S1,† along with the full list of the 420 molecules and information about what filters caused the exclusion of particular targets.
When using the in-scope filter in the search, the threshold for filtering out a reaction suggestion was when its score is below 0.75. In practice, this value helped achieve a good balance between filtering out false negative reactions and including false positive reactions, leading to a good number of reasonable pathways from which to choose.
For the stopping criteria (or how to determine whether a chemical is a “starting material”), we defined a molecule that has no more than 10 carbon atoms, 3 nitrogen atoms and 5 oxygen atoms as a terminal node in the tree search. Different criteria could be chosen for the retrosynthetic analysis. Specifically, we did not use the price of the chemicals (which seems intuitive) for a few reasons: a) many of these essential medicines, including the intermediates in the syntheses, are likely to be identified as commercially available at a cheap price, and b) there lacks a complete catalog that has accurate price estimates for all commonly used chemicals. In the meantime, these small molecules as defined here are likely to be obtained in a sufficiently affordable manner.
![]() | (1) |
![]() | (2) |
![]() | (3) |
![]() | (4) |
xij, yk ∈ {0, 1} | (5) |
As a common strategy to solve a multi-objective optimization problem, we used a weighting parameter, λ, to balance between the two objectives. The first constraint also needed reformulation to a pure mathematical form. The final mixed-integer multi-objective optimization is presented as follows:
![]() | (6) |
![]() | (7) |
y ≥ 1 − M(1 − dk) | (8) |
![]() | (9) |
xij, yk, dk ∈ {0, 1} | (10) |
The optimization problem is solved using the CPLEX solver with the Python API.25 Note that the optimization problem is linear, so there is no need to supply an initial guess and it is guaranteed that the solution found is global optimal regardless of the starting point.
We changed the value of λ to test the tradeoff between the two objectives. The value of λ varied from 0 to 1. They represented two extremes. With λ being one, solving the optimization problem was essentially trying to find the minimal number of chemicals required without considering the feasibility of the pathways. λ being zero indicated that we were trying to find pathways for individual targets that had the maximum score without considerations of the overlap of starting materials.
As described in the Methods section, we used machine learning models to predict the top-five most plausible sets of conditions23 and to assess the probability that the desired product is the major outcome given the reactants and each of the five sets of reaction conditions.19 The condition set that yields the highest reaction evaluation score was retained as the final condition for each reaction. The pathways with more than one low-probability (<0.05) step are filtered out, and targets that include more than one molecules are also excluded at this stage, resulting in 193597 pathways for 99 targets. These remaining pathways involved 3340 chemicals used as starting materials and 570 chemicals used in the reaction conditions (i.e., as solvents, catalysts or reagents).
When comparing the chemicals used in the above two scenarios, it was found that 106 chemicals were present in both scenarios. 130 chemicals were used only in the separate analysis and 81 chemicals were used only in the combined analysis. As exemplified in Fig. 1, a starting material was shared by different targets in the combined analysis but was used for fewer targets in the separate analysis. In the combined analysis, 1-bromo-4-chlorobenzene was shared by three different targets (pyrimethamine, halopericol and chloropromazine), whereas in the separate analysis only halopericol used it as a starting material (Fig. 1A). Di-tert-butyl dicarbonate was shared by three different targets as a protecting agent in the combined analysis, while only one target used it in the separate analysis (Fig. 1B). Dimethylamine was another example that was used by more syntheses in the combined analysis than in the separate analysis (4 vs. 2, Fig. 1C). The full list pathways for all the targets in the combined and separate analyses are in the ESI† (Fig. S2 and S3). We manually examined the first fifty targets in both analyses, and pointed out potential problematic reactions. Common issues include missing reaction conditions (e.g., solvents and bases), or ambiguous selectivity. These topics are further discussed later in the Limitations section.
The entire list of chemicals along with the number of times each chemical was used by all the syntheses are provided in ESI† (Fig. S4, S5 and S6). The chemicals were generally shared by more syntheses in the combined analysis, with the most frequent chemical (methanol) used for 10 different targets. We also found that, by visual inspection, in the separate analysis, chemicals that are small variations of each other are used in different synthesis, which indicates some redundancy, and in the combined analysis, we found fewer similarly-structured chemicals.
To further examine how the reduction in the number of chemicals was achieved, we counted the number of chemicals used in each synthesis for both the combined analysis and separate analysis. Then when we calculated the average number of chemicals used in each synthesis. Interestingly enough, the combined analysis and separate analysis were very close (8.28 vs. 8.41). This indicated that the combined analysis was not mainly selecting syntheses that used fewer chemicals, but identifying pathways that had overlap in the chemicals they used.
For some pairs of targets, we found that their syntheses shared more common chemicals in the combined analysis. For example, the combined analysis recommended performing the syntheses of both propyliodone and probenecid in one step (Fig. 2). Whereas the targets were structurally different, the analysis find commonalities in their syntheses: both used 1-bromopropane to introduce the propyl group and both reactions happened in dimethylformamide (DMF). The recommended pathways in the separate analysis syntheses did not share any common chemicals. The scores given in the combined analysis, although slightly lower than the separate analysis, still indicated that the reactions were likely to occur.
For probenecid, the combined analysis proposed alkylation of a sulfonamide in the presence of a carboxylic acid using sodium hydroxide. However, multiple evidence in the literature has shown reactivity in the reverse order – the carboxylic acid was more reactive than the sulfonamide.29,30 To ameliorate the side reactivity, either the acid was protected,31 or it was alkylated and later hydrolyzed back to the acid (in the reaction workup).32 It is worth noting here that the model-suggested reactions are not necessarily single step operations. There might be some human bias in the practice of recording reactions in the database – some workup steps might be omitted and only the desired overall transformation is recorded. For this reaction (and likely many other similar reactions), alkylation of the sulfonamide is the desired outcome over alkylation of the carboxylic acid, but due to cross-reactivity in basic conditions, a workup is sometimes used to hydrolyze the resulting ester back to the carboxylic acid. However, the recorded reaction (in most datasets) does not typically indicate a separate workup step and the carboxylic acid would appear unchanged. This example illustrates that the previously developed data-driven models used to make condition and reactivity predictions (before solving the optimization problem) can have bias based on reaction datasets.
Another example target pair shown here is neostigmine and tetracaine (Fig. 3). In the combined analysis, the synthesis of neostigmine and the second step for tetracaine synthesis uses a common substrate (dimethylamine) and the same solvent (benzene). While in the separate analysis, for both targets pathways with higher perceived likelihood were chosen, but at the expense of not having any overlapping starting materials or reaction conditions.
It was, however, not universal to have improved pairwise similarity for all target pairs. Fig. S7† visualizes the pairwise similarity between the syntheses of different targets using a heatmap. The similarity was calculated as the ratio of the number of overlapping chemicals to the total number of unique chemical used in the two syntheses. Mathematically this is referred to as the Tanimoto similarity, and it is a common metric to calculate similarity for molecular fingerprints.33 We calculated this value for each target pair to create the heatmap. It was not clear that the combined analysis had improved pairwise similarity over the separate analysis, and when we calculated the difference, on average the pairwise similarity was only 0.0006 higher in the combined analysis than the separate analysis, which was not a significant margin.
It turned out that for many chemicals, they did not specifically overlap with another single target, but shared chemicals with several other targets. Although some targets needed more starting materials in the combined analysis than the separate analysis, many of the starting materials were shared with other targets, so that they would not add additional requirements to the overall chemical inventory. Fig. 4 and 5 show the entire pathways for synthesizing procainamide and ephedrine, together with how the starting materials can be shared with other targets (only one example target is shown for a starting material; there can be other targets that also share the starting material). In the combined analysis, both syntheses were longer and used more chemicals than in the separate analysis, but more of the starting materials were shared with some other targets (procainamide 5/5 and ephedrine 2/3).
We also explored the tradeoff between the two objectives, minimizing the number of chemicals and minimizing the penalty of the pathways, by adjusting the weighting parameter, λ, between the two goals (eqn. (6)). The results are shown as Pareto frontiers in Fig. 6, with λ taking values from an increasing sequence from right to left: 0, 0.01, 0.1, 0.25, 0.33, 0.50, 1. The points on the Pareto curve represent the states from which one objective cannot be improved without worsening another objective. As shown in Fig. 6, reducing the total number of chemicals leads to decreased average scores of the pathways. Emphasizing the minimum number of chemicals (λ = 1) without regard to pathway feasibility led to 175 starting materials and an average pathway score of 0.230. At the other extreme (λ = 0), when pathways with the highest scores were chosen for every individual target, the total number of chemicals needed was 236, more than 30% higher than the other extreme, and the average pathway score was 0.663. The sharp change in the Pareto frontiers on the right hand side of Fig. 6 revealed that the number of chemicals decreased significantly with only a small decrease in the average pathway score. When λ = 0.1, the number of starting materials was 199 with an average pathway score of 0.648. Decreasing λ to 0.5 reduced the number of starting materials to 187 and the average pathway score to 0.566. The sharp curvature at the right end of the figure shows that reductions of the number of chemicals can be achieved without significantly sacrificing the predicted quality of the pathways.
In Fig. 7, we show several representative examples with different types of problems. Fig. 7a and b can be classified as false negative predictions by the reaction evaluation model, as literature precedence of the exact or very similar reactions can be found.34,35 These reactions usually involve many bond changes or even multiple reaction steps (usually recorded in the database as a single step), which can be out of scope for the reaction evaluation model. If we can expand the scope of the reaction evaluation model, or break down the retrosynthetic templates to reflect the multiple reaction steps separately, these reactions should be predicted viable. Fig. 7c–e are examples where the primary issue was reaction conditions – an atom source is missing so the reaction is chemically nonsensical. Fig. 7c only predicted a solvent and base as reagent, and most importantly there is no methylating agent. Fig. 7d does not provide a carbon source for the epoxidation. Fig. 7e missed water which is needed for a workup step and also supply the oxygen. These reactions cannot be possible as they are presented, but if the condition recommendation model can be improved to propose the necessary conditions, these reactions could become viable suggestions. For example, if methyl iodide is supplied as an additional reagent, the score of the reaction in Fig. 7d increases to 0.41, which can make it favorable than the current scheme presented. Fig. 7f and g represent examples where there were fundamental reactivity concerns. These are the true negative reactions that we would like to be eliminated from the optimization. With further improvement of the underlying models used before the optimization, using the reaction score as a filter will help us more exclusively filter out reactions that have fundamental reactivity concerns.
![]() | ||
Fig. 7 Examples of low scoring reactions. a) and b) Represent cases where the reaction evaluation model was not able to recognize feasible reactions.34,35 c)–e) Represent cases where a necessary reagent was missing that was supposed to supply atoms to the product. f) and g) Represent cases where reactivity was problematic. |
We highlighted all low probability steps in Fig. S2 and S3† to make the pathways more clear to readers. As we explained, these are not straightforward one-step operations, and their inclusion can be attributed to limitations of the retrosynthetic model, condition recommendation model or reaction evaluation model.
In addition, in order to present a more realistic collection of pathways, we performed an analysis in which we filtered out all pathways that have low-probability reactions. This results in 54204 pathways for 78 targets. We ran the same combined (λ = 0.5) and separate analysis (λ = 0) and the total number of chemicals for these two cases were 250 and 301. The reduction is smaller (17%) since there are fewer options to choose from, but it still demonstrates the benefit of planning synthesis in a combined manner for multiple targets. The full pathways for this sensitivity analysis are shown in Fig. S8 (combined analysis) and Fig. S9† (separate analysis). As computer-aided synthesis planning programs improve, we expect that the number of targets with pathways found and the quality of proposed pathways will improve; our optimization framework for multi-target planning demonstrates the value even now.
We understand the current model-suggested pathways are not perfect and improvements in multiple aspects can be made. In the ESI† we manually examined the first 50 pathways and made notes about reactions that might be problematic.
A large fraction of cases can be attributed to missing reaction conditions, like solvents and bases (e.g. Fig. S2:† T8, T10-2; Fig. S3:† T7, T8, T10-3). This is largely related to the quality of the data used to train the condition recommendation model. Some common solvents/reagents are omitted in the database (due to human error in data entry); a typical example is that water is frequently missing when the model recommends a reagent like NaOH or HCl. This needs to be addressed in future improvement of data and model for the condition recommendation model.
Some reactions seem to have ambiguous selectivity (e.g. Fig. S2 and S3:† T0–1, T14). While the reaction evaluation score should reflect this aspect, the reactivity/selectivity can only be confirmed via experiments. Models that are more focused and robust on selectivity prediction are currently under development that can improve the confidence in this task.
Sometimes the pathways seem redundant and circular (e.g. Fig. S2 and S3:† T5). This either requires improving the tree retrieval procedure, or the suggestions can be avoided by changing the optimization objective to favor shorter pathways.
As mentioned in the previous section, a more serious flaw is that some reactions are missing an necessary reagent that contributes atoms (e.g. T41 in Fig. S2 and S3†). The alkylating agent is recorded as a reactant in some cases and reagents in others in the training dataset. This type of error occurs when the retrosynthetic template omits the alkylating agent in the reactant but the condition recommendation model fails to predict it as a reagent. Solving this issue requires cleaning the data to fix the role of the alkylating agent.
These are also common issues with the rest of reactions shown Fig. S2 and S3,† and will need to be addressed in future improvements of the dataset quality and model performance. However, these issues appear similarly in both the combined and separate analysis, which indicates that the optimization method can stay useful with improved tools for retrosynthetic analysis.
We are also limited by retrosynthetic analysis settings, as we set the maximal number of trees for a target and maximal depth of the trees. Also, we chose only one set of condition for every reaction, which limits the ability to further reduce the number of solvents, reagents, etc. These can be solved by improved optimization algorithms to optimize on a graph instead of enumerate distinct tree options.
The code for running the retrosynthetic analysis and optimization is available at https://github.com/Coughy1991/Combined_synthesis_planning.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: 10.1039/c9re00348g |
This journal is © The Royal Society of Chemistry 2020 |