Similarity based functionalization for enumeration of synthetically plausible chemical libraries surrounding a target

Functionalization of lead compounds to create analogs is a challenging step in discovering new molecules with desired properties and it is conducted throughout the chemical industry, including pharmaceuticals and agrochemicals. The process can be time-consuming and expensive, requiring expert intuition and experience. To help address synthesis planning challenges in late-stage functionalization, we have developed a molecular similarity approach that proposes single-step functionalization reactions based on analogy to precedent reactions. The developed approach mimics reaction strategies and suggests co-reactants defined implicitly by a corpus of known reactions. Using ca. 348 k reactions from the patent literature as a knowledge base, the recorded products or close analogs are among the top 20 proposed products in 74% of ∼44 k test reactions. The combinatorial growth inherent in recursive applications of the tool allows the enumeration of chemical libraries surrounding a target compound of interest. Moreover, each step of the resulting library synthesis leverages common chemical transformations reported in the literature accessible to most chemists.


Table of Contents
For every proposed reaction, the co-reactant is shown in blue.The product is shown in black and any structural change resulting from the reaction is highlighted in pink.The rank and overall similarity score are labeled above and below every suggestion, respectively.The similarity-based approach is unable to find an analog of the recorded product.6) (A) A selected reaction scheme for lead compound diversification by Mann and co-authors.Reactive intermediate 13 was treated with amines to obtain N-substitutedsulfonamides. Exemplary analogs generated in the study are described here.(B) A selected reaction scheme for lead compound diversification by Lagisetty and co-authors.Reaction intermediate 14 was diversified using different chemistries at different sites (marked with different colors).Each reaction described here is likely to contain its own distinct SMARTS pattern.# These iterations were not parallelized, and computations were performed on a single core.
3. Iterate through each of the precedent reactions from the knowledge base in order of decreasing reactant similarity.For computational efficiency, this considers the 100 most similar reactants only.For each of these reaction precedents, extract a localized reaction template based on the atom mapped transformation, using RDChiral (modified).4. Still iterating through the precedent reactions, apply the extracted template to the target molecule to get candidate products.5.For each candidate product generated in the previous step, compute the candidate product's Morgan fingerprint.Then, compare it to the reaction precedent's products' fingerprint to get a second similarity score, s prod .This score reflects how similar the products of the known reaction are to the proposed products of this theoretical reaction.6.Still for each candidate precursor set, multiply the reactant similarity score s reac with the product similarity score s prod to get the overall similarity s = s prod ꞏ s reac .This score represents the extent to which the proposed enumeration reaction is analogous to the precedent reaction.7. Rank all enumerated products by their overall scores, s.Remove any duplicates in the candidate product list as determined by their isomeric SMILES string, while retaining only the highest score when there are multiple entries.
Top-k accuracy evaluation 1. Calculate the Morgan fingerprint (radius =2, using features) of the recorded reactant and product.2. Calculate the baseline Tanimoto similarity score, s baseline , between the recorded reactant and product 3. Iterate through the ranked enumerated product list in the order of increasing ranks.
For each of these enumerated products, compute the enumerated product's Morgan fingerprint.Then, compare it to the recorded product's fingerprint to get the enumerated product Tanimoto similarity score, s enumprod .4. If the baseline similarity score s baseline is greater than or equal to the highest enumerated product similarity score s enumprod , then no solution was found.
Otherwise, the rank associated with highest s enumprod score was recorded.

Template extraction
RDChiral was used for template extraction and application. 2By design, the problem was formulated as an inverse of the retrosynthesis problem.This allowed us to take advantage of the techniques that have been developed for retrosynthetic template extraction and application.The role of reactants and products of the pseudo reaction were reversed during the retrosynthetic template extraction.That is to say, the reactants of the pseudo reaction were fed as products to RDChiral and vice versa.This approach enabled us to employ RDChiral with minimal modifications.
One minor change was made to RDChiral to ensure successful template extraction and application.The 'MAXIMUM_NUMBER_UNMAPPED_PRODUCT_ATOMS' parameter was changed from the default setting of '5' to '5000'.This was necessary because the original reactions from the USPTO-500k dataset do not contain information about by-products.This minor setting change ensured that we were able to extract templates from reactions containing reactants that contributed more than 5 atoms towards by-product formation.

Buyability of co-reactants
For a reaction in the dataset with multiple reactants, we constructed a corresponding pseudo reaction that contained the most complex reactant and the original products; the other less complex reactants were removed from the reaction (henceforth, these removed reactants are referred to as 'coreactants').These co-reactants can be likened to building blocks.Here, we evaluable the commercial availability of these compounds.

Method
For every reaction, co-reactants were searched using the ASKCOS buyable database. 3,4If all co-reactants associated with a given reaction are listed as commercially available in the ASKCOS buyable database, then the reaction is classified as having co-reactants that are buyable.If at least one co-reactant is not a buyable compound on ASKCOS, then the reaction is classified as having co-reactants that are not buyable.These co-reactants would have to be synthesized or searched using other buyable compound databases.
For co-reactants that were not listed on ASKCOS as buyable, we randomly sampled ten compounds and used to ASKCOS tree-search algorithm to evaluate synthesizability. 3,5sult There are a total of 435,246 reactions in the dataset.382,086 reactions (~88% of the dataset) had co-reactants that were listed as buyable in the ASKCOS buyable database or simply did not need any co-reactants.Further, nine of the ten randomly sampled nonbuyable compounds had many ASKCOS predicted synthesis pathway (57-200 trees).

Top-N Accuracy: Chemical sensibility analysis using a graphconvolutional neural network model
To complement the analysis performed using the fast-filter predictor, we also employed the graph convolutional neural network model trained by Coley and co-authors. 6This serves as a second approach to understand the chemical sensibility of the reactions considered successful in the Top-N accuracy analysis.
In the test set comprising 44k reactions, roughly 90% (38,738 reactions) of the cases were able to recover recorded products or close analogs, and these cases considered to be successful are further analyzed.For 22,233 reactions, our algorithm recovered the exact reaction recorded on the test set.This was determined by an exact SMILES string match between the algorithm proposed reaction and the recorded reaction in the test set (i.e., from the U.S. patent literature).The remaining 16,505 reactions were further analyzed using the trained graph convolutional neural network model.

Method
First, any duplicate reactions were removed.There was a total of 12804 unique reactions in the 16505 total reactions.We used a graph convolutional neural network model trained by Coley and co-authors and currently implemented on askcos.mit.edu. 6,7he settings employed for the forward predictor were 'wldn5', and 'uspto_500K'.These model settings match closely with the settings used in the original publication.Performance was measured using top-N accuracy for N = {1,3,5,10,20,50}; this is defined as the fraction of the 12804 reactions where the similarity algorithm suggested product is predicted to be chemically sensible by the trained graph convolutional neural network model with rank N.

Result
Table S2

Filtering analogs using a property constraint
The library of analogs generated by recursive application of similarity-based enumeration in Figure 7 were evaluated using the 'QED' property filter. 8'QED' is a property that was originally described in 'Quantifying the chemical beauty of drugs' by Bickerton and co-authors. 8QED evaluates the 'drug-likeness' of a molecule based on an analysis of observed distribution of physical-chemical properties of approved drugs.QED scores range from 0 to 1 (higher scores are more drug-like).QED is a model property for illustrative purposes only.
The QED score implementation in RDKit was used for this analysis. 9The QED score for all ~2.5 Million analogs was calculated.The analogs were filtered to identify molecules with QED scores greater than the input molecule (QEDinput = 0.80235).A selection of molecules with improved QED scores are shown in Figure S9.

Figure S2 :
Figure S2: Different fingerprint settings and similarity metrics for one-step enumeration are evaluated using the validation dataset.'TverskyA' and 'TverskyB' are Tversky similarity metrics with (=1.5, =1.0) and (=1.0,=1.5), respectively (see equation 3).'Morgan2Feat' and 'Morgan2noFeat' refer to Morgan fingerprints of radius =2 with and without features, respectively.'Morgan3Feat' and 'Morgan3noFeat' refer to Morgan fingerprints of radius =3 with and without features, respectively.The top-N accuracy is not a strong function of the fingerprint settings and similarity metrics tested.As a result, the Morgan2Feat fingerprint and Tanimoto similarity metric were used for this study.

Figure S3 :
Figure S3: Randomly selected example from the test set (Example 1).For every proposed reaction, the co-reactant is shown in blue.The product is shown in black and any structural change resulting from the reaction is highlighted in pink.The rank and overall similarity score are labeled above and below every suggestion, respectively.The similarity-based approach is unable to find an analog of the recorded product.

Figure S4 :
Figure S4: Randomly selected example from the test set (Example 2).For every proposed reaction, the co-reactant is shown in blue.The product is shown in black and any structural change resulting from the reaction is highlighted in pink.The rank and overall similarity score are labeled above and below every suggestion, respectively.The similarity-based approach proposes an analog of the recorded product with rank 2.

Figure S5 :
Figure S5: Randomly selected example from the test set (Example 3).For every proposed reaction, the co-reactant is shown in blue.The product is shown in black and any structural change resulting from the reaction is highlighted in pink.The rank and overall similarity score are labeled above and below every suggestion, respectively.The similarity-based approach proposes the recorded product with rank 1.

Figure S6 :
Figure S6: Randomly selected example from the test set (Example 4).For every proposed reaction, the co-reactant is shown in blue.The product is shown in black and any structural change resulting from the reaction is highlighted in pink.The rank and overall similarity score are labeled above and below every suggestion, respectively.The similarity-based approach proposes the recorded product with rank 1.

Figure S7 :
Figure S7: Randomly selected example from the test set (Example 5).For every proposed reaction, the co-reactant is shown in blue.The product is shown in black and any structural change resulting from the reaction is highlighted in pink.The rank and overall similarity score are labeled above and below every suggestion, respectively.The similarity-based approach proposes a close analog of the recorded product with rank 8.

Figure S8 :
Figure S8: Example diversification schemes from the literature.(associated with Figure6) (A) A selected reaction scheme for lead compound diversification by Mann and co-authors.Reactive intermediate 13 was treated with amines to obtain N-substitutedsulfonamides. Exemplary analogs generated in the study are described here.(B) A selected reaction scheme for lead compound diversification by Lagisetty and co-authors.Reaction intermediate 14 was diversified using different chemistries at different sites (marked with different colors).Each reaction described here is likely to contain its own distinct SMARTS pattern.

Figure S9 :
Figure S9:Analogs with improved QED property were identified using our algorithm and a 'QED' property prediction algorithm.Core structure is in black, and proposed structural modifications are in blue.

:
The trained graph convolutional neural network model ranks the 12804 reactions proposed by our algorithm highly, indicating chemical sensibility of the proposed transformations.