Open Access Article
Simon Viet
Johansson
*ab,
Morteza
Haghir Chehreghani
b,
Ola
Engkvist
ab and
Alexander
Schliep
bc
aMolecular AI, Discovery Sciences, R&D, AstraZeneca, Gothenburg, Sweden. E-mail: simon.johansson@astrazeneca.com
bDepartment of Computer Science and Engineering, University of Gothenburg, Chalmers University of Technology, Gothenburg, Sweden
cFaculty of Health Sciences, Brandenburg University of Technology Cottbus-Senftenberg, Cottbus, Germany
First published on 27th November 2023
Artificial intelligence (AI) contributes new methods for designing compounds in drug discovery, ranging from de novo design models suggesting new molecular structures or optimizing existing leads to predictive models evaluating their toxicological properties. However, a limiting factor for the effectiveness of AI methods in drug discovery is the lack of access to high-quality data sets leading to a focus on approaches optimizing data generation. Combinatorial library design is a popular approach for bioactivity testing as a large number of molecules can be synthesized from a limited number of building blocks. We propose a framework for designing combinatorial libraries using a molecular generative model to generate building blocks de novo, followed by using k-determinantal point processes and Gibbs sampling to optimize a selection from the generated blocks. We explore optimization of biological activity, Quantitative Estimate of Drug-likeness (QED) and diversity and the trade-offs between them, both in single-objective and in multi-objective library design settings. Using retrosynthesis models to estimate building block availability, the proposed framework is able to explore the prospective benefit from expanding a stock of available building blocks by synthesis or by purchasing the preferred building blocks before designing a library. In simulation experiments with building block collections from all available commercial vendors near-optimal libraries could be found without synthesis of additional building blocks; in other simulation experiments we showed that even one synthesis step to increase the number of available building blocks could improve library designs when starting with an in-house building block collection of reasonable size.
Library design has traditionally aimed to optimize the selection of molecules for either molecular diversity41–43 or molecular properties like high activity towards a target or low lipophilicity, i.e. a focused library design.44–48 A diverse library design provides a larger coverage of the chemical space and is often viewed as more ‘informative’, since similar molecules hypothetically would have redundancy in the information gained.41,49 Focused libraries on the other hand might aim to optimize a selected lead compound50,51 by lowering the structural diversity and exploring similar structures to the lead compound to improve a specific property.
There are several methods for producing combinatorial libraries with different throughput, ranging from parallel synthesis robotically generating libraries of size ∼103, to DNA-encoded chemical libraries (DECLs) enabling synthesis up to size 109.52–55 The limitations of the DECLs are a restriction on the type of building blocks that can be attached to a DNA tag and the possibility of the encoding oligonucleotide affecting the binding affinity of the building block.55 Hence, it is primarily used in hit identification but recently also hit confirmation,56 while lower throughput methods are used during lead optimization.
Following the generation of a library, screening methods are employed to search the produced chemical space for desired interactions with the drug target.
Traditional High-throughput screening (HTS) has the capability to physically test approximately 106 compounds. Consequently, virtual compound libraries became the focus as the computational resources became large enough to store their chemical structures.23,57,58 The virtual library CH/PMUNK59 consists of 95 million compounds by enumerating products using common reactions from combinatorial chemistry. The virtual library REAL60 has over 6 × 109 molecules for virtual screening that obey Lipinski's rule of 5.61 The GDB-17 library of small molecules enumerated by Ruddigkeit et al.62 contains 160 billion virtual compounds with up to 17 heavy atoms. The ChemSpace Atlas63 is a collection of 4 × 104 Generative Topographic Maps64 which accommodate up to 5 × 108 compounds. The ZINC library65 is a readily updated free database for which the latest collection contains “1.4 billion compounds, 1.3 billion of which are purchasable”.
A problem with virtual libraries is that the hits produced can require specific synthesis expertise to produce the compounds with real chemistry. As such, compound suppliers offer “synthesis on demand” building blocks of which the largest is MADE,66 a catalogue of 770 million building blocks that can be ordered and made with “over 76% success rate”.
Generative models for de novo design offer an alternative to virtual screening or HTS, by instead generating focused selections with a smaller size.3,67 Generally, the binding affinity of a small molecule to a drug target is decided by a scaffold, a common structure for the library design, with variations in building blocks attached to the scaffold to accommodate for other desired molecular properties.68 Several deep learning models have been proposed to generate chemical libraries in a focused manner, in particular decorating a scaffold69 by suggesting which building blocks to attach to this scaffold. The Mol-GPT model showed capability to both optimize a lead, as well as decorate a scaffold.70 STRIFE emphasized pharmacophore information to decorate and optimize proteins.71 Domenico et al. adapted the REINVENT3 architecture to create focused libraries towards inhibiting NA, AChE and SARS-CoV-2.72 LibINVENT73 uses reinforcement learning to generate reaction-constrained decorations to input scaffolds. We will not cover generative models that are not designed to be applied to library design here, but refer others whom have addressed recent developments.74–76 These methods can generate building blocks for combinatorial library design, but do not inherently offer an optimized combinatorial selection. Given a limited experimental budget, there is motivation to develop workflows for optimizing combinatorial design for novel de novo generated building blocks.
Methods that simultaneously optimize both diversity and molecular properties of a library have been used in several previous studies, using for example simulated annealing77 (SA) or genetic algorithms (GA).78–80 These approaches provide optimization over lists of provided building blocks, or virtual libraries but cannot determine whether novel generated building blocks can be acquired or if they are only hypothetical structures impossible to synthesize in practice. As such, a design made by these models on de novo generated building blocks is limited by the “synthesis on demand” success rate. More recently, multi-objective optimization (MOO) has been approached in the chemical discovery field mostly using methods based on pareto ranking (PR).81,82 These methods do not need a weighing between different objectives, instead they keep all solutions that are non-dominated, i.e., have at least one dimension where the solution is optimal, thereby making a model of the pareto front. The most common optimization algorithms using PR are the genetic algorithms NSGA-II83 (for two to three objectives) and NSGA-III84 (for higher dimensions). However, the former is limited in scalability to large solution spaces when including diversity as it requires computation of all pair-wise distances, and the latter alleviates this by computing diversity in relation to fixed reference points that are forced to be included in the selection. This works well for feature-based diversity, such as physiochemical properties, but provides no guarantees to improving the structural diversity, as this measure is defined on the selection as a whole and not in relation to individual members of the selection.
A model that has proven to perform well for modelling the trade-off between quality and diversity is the Determinantal Point Process (DPP).85–87 DPPs are probabilistic models that have been argued to represent repulsion between items.88 They are used in other application areas for text summarization,87 pose estimation86 and diverse image selection,85 but have not yet been investigated for library design. While common methods for selecting diversity are maximizing the sum of pairwise distances41,80 or minimizing average pairwise similarity,79 the determinant of the similarities captures the interaction between multiple molecules simultaneously.89 Additionally, the max-sum or min-average methods scale in time complexity quadratically with the number of building blocks in the optimization space. While the DPP has a cubic scaling, it is instead dependent on the size of the sampled library rather than the number of options.
We propose a library optimization workflow for de novo generated building blocks in a combinatorial fashion applying recombination.90,91 Using LibINVENT,73 we generate and filter building blocks that can attach to an example scaffold using specified reactions. These building blocks can be both novel or previously existing in eMolecules,92 a platform aggregating in-stock commercial building blocks from “over 140 suppliers”. We then use the Computer Aided Synthesis Prediction (CASP) tool AiZynthFinder13 to evaluate all generated building blocks and query their availability in the eMolecules building block platform, or estimate the number of reaction steps needed to synthesize them using template-based retrosynthesis prediction.9,10 We simultaneously explore and optimize the library selection for Quantitative Estimate of Drug-likeness (QED),93 Quantitative Structure–Activity Relationship (QSAR)1,67,94,95 and structural diversity (measured by the similarity in the compounds' extended connectivity fingerprint (ECFP) representation)96 using Gibbs sampling,97 conditioned on a constant size, thus sampling from a determinantal point process of constant size k (k-DPP).98 The workflow is model-agnostic and can be applied to any list of building blocks and any CASP tool that break down the building blocks into stock-available precursors. We apply this workflow to optimize a library from all available building blocks from eMolecules, as well as compare them to libraries including generated building blocks available from varying number of synthesis reaction steps. We also simulate an in-house building block store by optimizing over a subset of the available building blocks and explore the differences in optimized libraries between using available building blocks and commercially available building blocks.
The main contributions of this framework are as follows. We
• extend combinatorial library design to score de novo designed building blocks,
• propose the use of DPPs, in particular k-DPPs, to sample libraries that optimize the trade-off between quality and diversity, and
• estimate the difference in score between libraries using available building blocks and total pool of generated reactants, and estimate the potential gain from expanding the available building blocks.
![]() | ||
| Fig. 2 Scaffold used as input for the generation of building blocks. This figure is adapted from ref. 1. | ||
905 inactive compounds. The compounds were represented by the extended connectivity fingerprint with 2048 bits and radius 3 (ECFP6), computed using the RDKit morgan algorithm.96,111 The model was trained using a random 80%/20% training/test data split, with 4974 active compounds in the training set (out of 280
967, 1.77% actives), and 1330 active compounds in the test set (out of 70
242, 1.89%). The model, as well as the script to generate the model, is provided in the repository. The data is imbalanced with most of training points labelled as inactive compounds, resulting in AUC-ROC score of 0.995 by having a bias towards predicting most input as negative. The model is trained for binary classification to predict the product label, and the metric used to evaluate a compound is the classification probability of having the “active” label. This model was used both as part of the LibINVENT reinforcement learning run and during Library selection.
A total of 104
991 unique molecules (82%) were generated, of which 94
808 (74%) matched the reaction filters. All molecules for which QSAR model assigned a probability of being active lower than 0.8 were removed in post-processing to reduce the optimization space. This yielded 45
928 remaining products, from which the building blocks were extracted. 32
159 unique carboxylic acids and 2084 unique aromatic halides were identified, corresponding to AC and BH reactions, respectively. The runtime was approximately 2 hours using a Nvidia 2080Ti. The output building blocks as a function of the training is provided in ESI Fig. S.1.†
The parameters chosen both for generative modelling and retrosynthesis let both models run for a longer time, 1000 epochs compared to 100 during generation and 5 minutes instead of 2 for retrosynthesis evaluation, than previous uses of the same architectures.13,73 This yields more output building blocks and solves more routes than previous use in demonstrated studies, and potentially include LibINVENT output that could be a result of over-exploiting the QSAR model. This was done intentionally to increase the size of the search space and provide a larger diversity of building blocks with respect to quality properties to showcase the effect of the different strategies.
Discrete DPPs are probability distributions first used by Odile to model fermions,113 and have been increasingly popular within machine learning for capturing the trade-off between diversity and quality.85 Let L ∈
n×n be a positive semi-definite (PSD) matrix. A discrete DPP with kernel L is a probability distribution μ: 2[n] →
+defined by
| μ(S) ∝ Det(LS), ∀S ⊆ [n]. | (1) |
Kulesza and Taskar85 demonstrate that the quality of terms can be incorporated into DPPs by decomposing the kernel into
| Li,j = qiϕiTϕjqj, | (2) |
![]() | (3) |
We denote by SY the symmetric matrix of pairwise Tanimoto similarities, and the diversity thus defined as the determinant of SY. Note that this measure is inherently hard to compare between two sets of different sizes. Furthermore, the similarity matrix consists of values in [0, 1] on the off-diagonal elements, and diagonal elements of 1, resulting in a determinant with values in [0, 1]. This will for larger kernels, i.e., larger library selections, result in determinants that approach 0 at a rapid pace. For numerical stability, we use the logarithm of the determinant (denoted by log
Det) as the measure for diversity. Two examples of (non-combinatorial) building block selections can be found in ESI Fig. S.2. and S.3.† They illustrate selections that minimize diversity and maximize diversity, respectively. The selections were made using an offline greedy selection algorithm,85,115 similar to the algorithm proposed by Nakamura et al.89
000 in minutes. For the sampled number of possible products, n = 32
159 × 2084 = 67
019
356, it is computationally infeasible to compute n − k determinants for selecting item (k + 1), especially for designing larger libraries. Recent research in usage of DPP has primarily been focused on the algorithmic efficiency while keeping a close-to-greedy performance in selecting diverse items.118,120–123 For scenarios such as ours, the only selections of relevance are sets of fixed size, such as the same sizes as screening plates, i.e., 96, 384 or 1536 for parallel library synthesis, or other fixed sizes determined by a project demand. k-DPPs are an extension of general DPPs that are conditioned to selected sets of size exactly k. Gharan and Rezaei123 introduced a computationally efficient method for sampling k-DPPs using a Gibbs sampling scheme shown to have fast mixing properties. Here, the proposal distribution samples suggestions only from exchange operations between one element and one non-element of the current k-set. This ensures that the size of selection always remains constant. Moreover, at time step t during sampling, it requires only computation of the transition probability![]() | (4) |
Since the pairwise similarity values of SX are all in [0, 1], the determinants may become too small for double precision with relevant choices of k. For numerical stability, the logarithm of the right hand side of eqn (3) is used in step 7. The logarithm of the determinant become negative, where a greater value represents a more diverse set. In the numerical experiments we let m = 12, n = 8, corresponding to the generated building blocks of carboxylic acids and aromatic halides respectively, and used k = 96 as it is a common plate size.
The acceptance ratio, α controls the probability to accept solutions that are worse than the current selection, in order to “escape” a local maximum. For these experiments, the change between two neighbouring solutions, i.e., differing only by one building block, will typically be low and as such, the transition probabilities are very high. A low α is needed for a faster model convergence. We chose to conduct experiments for α = 0 such that we only accept strict improvements (hill climbing, which is a greedy search). An extensive exploration of α was considered out of scope for this study. The selections of the model for different optimization strategies were examined, see Table 1. To explore the mixing time, the termination criteria were set as a patience parameter, sampling the distribution until 10
000 samples were drawn without finding a better solution. We compare the results against the average result of 100 random selections and the top 96 cherry-picked compounds by QSAR values from the LibINVENT run.
Det is the logarithm of determinant of the kernel matrix, or matrix of all pairwise Tanimoto similarities in the current selection, and a measure of diversity. A value closer to 0 is more diverse. Random selection is the average values of 100 combinations selected for each reaction step availability. For each optimization strategy, we show the results of stock-available building blocks (0 reaction steps) and building blocks up to 4 reaction steps away
| Selection strategy | N reaction steps | Avg QSAR (SD) | Avg QED (SD) | Avg log Det (SD) |
|---|---|---|---|---|
| QSAR | 0 | 0.993 (0.000482) | 0.370 (0.0184) | −206.8 (7.07) |
| 1 | 1.000 (0.0) | 0.281 (0.0299) | −192.9 (7.87) | |
| 2 | 1.000 (0.0) | 0.278 (0.0249) | −195.9 (7.27) | |
| 3 | 1.000 (0.000104) | 0.281 (0.0281) | −193.1 (10.4) | |
| 4 | 1.000 (0.000133) | 0.277 (0.0385) | −192.1 (6.00) | |
| QED | 0 | 0.676 (0.0100) | 0.785 (0.00105) | −155.9 (3.62) |
| 1 | 0.677 (0.0126) | 0.782 (0.00122) | −154.4 (2.23) | |
| 2 | 0.685 (0.0114) | 0.781 (0.00221) | −155.5 (5.11) | |
| 3 | 0.682 (0.0122) | 0.782 (0.00173) | −153.5 (2.47) | |
| 4 | 0.675 (0.0124) | 0.781 (0.00189) | −155.1 (3.58) | |
| Diversity | 0 | 0.698 (0.00939) | 0.244 (0.0155) | −101.3 (0.239) |
| 1 | 0.699 (0.0115) | 0.138 (0.00704) | −95.88 (0.474) | |
| 2 | 0.688 (0.00741) | 0.110 (0.0105) | −94.12 (0.256) | |
| 3 | 0.687 (0.0122) | 0.103 (0.00766) | −94.13 (0.574) | |
| 4 | 0.686 (0.00947) | 0.099 (0.00717) | −93.65 (0.266) | |
| Simultaneous optimization | 0 | 0.852 (0.00676) | 0.703 (0.00620) | −126.8 (0.995) |
| 1 | 0.848 (0.00980) | 0.701 (0.00651) | −126.8 (1.45) | |
| 2 | 0.845 (0.00498) | 0.704 (0.00540) | −127.3 (1.02) | |
| 3 | 0.843 (0.0102) | 0.699 (0.00570) | −126.2 (1.47) | |
| 4 | 0.851 (0.00690) | 0.699 (0.00956) | −127.2 (2.00) | |
| Random selection | 0 | 0.765 (0.0230) | 0.354 (0.0361) | −128.3 (4.26) |
| 1 | 0.781 (0.0223) | 0.231 (0.0314) | −126.7 (4.57) | |
| 2 | 0.778 (0.0214) | 0.213 (0.0294) | −125.6 (3.88) | |
| 3 | 0.779 (0.0219) | 0.215 (0.0297) | −126.3 (4.97) | |
| 4 | 0.781 (0.0227) | 0.213 (0.0315) | −125.9 (4.419) | |
| LibINVENT top 96 | — | 1.000 | 0.43 | −88.44 |
The 32
159 unique carboxylic acids and 2084 unique aromatic halides generated through LibINVENT were analysed using AiZynthFinder. The retrosynthetic prediction found that 88.7% of the generated carboxylic acids and 98.3% of the aromatic halides could be synthesized within 2 steps of reactions from the base eMolecules stock. Of the building blocks, 6203 carboxylic acids (19.3% of the generated building blocks) and 763 aromatic halides (36.6%) were directly available in stock; i.e., required no synthesis. The full distribution of reaction availability can be seen in Fig. 3.
The compound selection was performed on the criteria of only QSAR, only QED, only diversity and all the metrics simultaneously with equal weight. For the rest of this section, we will refer to the strategy of optimizing the metrics simultaneously with Simultaneous Optimization (SO). The single-objective strategies were performed by setting the weights ωk in Algorithm 1 for the ignored metrics to 0. This was performed for building blocks available from 0 to 4 reaction steps, as extending the search to the remaining compounds added few additional options (see Fig. 3). At each step, the new building blocks were added to the existing pool of available blocks to model the marginal gain for the chemist to perform synthesis for acquisition of new building blocks. We repeated 10 runs for each level of reaction step for Algorithm 1 from different randomized initializations. The full distribution of QSAR and QED values by combination of the considered building blocks from reaction steps 0–4 can be seen in ESI Fig. S.4.†
The results for single-objective search, cf.Table 1, show that the average QSAR values while optimizing for the other objectives tended to stay between 0.6 and 0.7, indicating that an arbitrary re-combination of building blocks from LibINVENT compounds of high QSAR values does not always result in a product that also has a high QSAR value.
Expanding the search to building blocks available by 1–4 reaction steps resulted in samples of slightly lower diversity as average QSAR value went from very close to 1.0 to selections that had each compound with a value of exactly 1.0. Optimizing for diversity maintained the average QSAR value in the observed selections. The results of SO did not improve as the number of available building blocks increased. This indicates that the set of purchasable building blocks that is already available covers optimal solutions given our scoring parameters. For the single-objective optimization strategies, the QED value tended to decrease as the size of the search space increased. A possible explanation could be that the building blocks corresponding to several steps of reactions are more complex, which tend to have a negative effect on the QED value.93 The difference between the selections from baseline available building blocks and selections of building blocks one reaction step away represent the largest change in QED score, while further expansions of the building block availability resulted in much smaller or no changes for all metrics. This observation is likely explained by the distribution of building blocks we previously observed in Fig. 3; one reaction step represents a change from a space of 6203 × 763 products to a space of 23
034 × 1926, almost ten times larger. The next reaction steps increase the size of the product space relative to the previous step by 31.7% and 4.9%, respectively. The sampling process thus selects building blocks from a pool that is very similar between these three selections, and as such the distributions are similar.
The top 96 compounds by predicted activity generated by LibINVENT had an average QSAR value of 1.0 and average QED of 0.43. While these compounds are more diverse than any selection found in our combinatorial selection, they achieve this by breaking the combinatorial constraint. The selection had 96 different carboxylic acids and 3 different aromatic halides. 95 carboxylic were evaluated by AiZynthfinder to be synthesizable, in at most four reaction steps. The 3 aromatic halides were all available directly in stock.
To compare these results against random selection, we sampled 100 combinatorial selections of size 12 × 8, where each building block for the respective AC and BH reactions was sampled with equal probability. This was repeated for building block availability from each level of reaction steps up to 4 reaction steps from the stock. The random selections consistently had worse QSAR values and QED values than SO, while having diversity values that were not noticeably different from the optimized selections. The average QED value among the random selections is <0.25, which is significantly lower than the average of an “attractive drug”.93 In addition, the average QSAR value is lower than 0.8, which means many products in the selection are not very likely to be bioactive. This validates the need for optimizing these selections.
The selected products of the single-objective optimizations as well as the SO were also compared visually. Fig. 4 shows a small sample of 2 × 2 combinatorial examples from the different selections for visual clarity. The single-objective selections leave plenty of room for improvement. QED-optimized and diversity-optimized selections both have QSAR values around 0.7, but while the QED-optimized compounds are small, the diversity optimized compounds promote larger building blocks with several rings and side chains. QSAR-optimized selections have the lowest diversity and cover a range of low QED-scores, favouring building blocks with 1–2 rings each and are generally too large still for being druglike. It is likely that the QSAR score of 1.0 indicates that LibINVENT finds exactly which bits in the fingerprint representation that exploit the QSAR model. SO yielded a balanced selection of smaller building blocks that still yielded a high average QSAR value of ∼0.848.
To provide a reference frame for the diversity in relation to the log determinant, we conducted two experiments for “cherry-picking” molecules among the products at each level of building block availability, using both random selection and the RDKit MaxMin Picker. In the first experiment we filter all products to be in the same ranges as the SO selections in terms of QSAR ∈ [0.6–1.0] and QED ∈ [0.54–0.82] and sampled 100 samples using both methods. The results of this procedure can be found in Table 2.
| N reaction steps | Selection strategy | Avg QSAR (SD) | Avg QED (SD) | Avg log Det (SD) |
|---|---|---|---|---|
| 0 | MaxMin | 0.729 (0.00377) | 0.570 (0.00258) | −37.29 (0.1784) |
| Random | 0.740 (0.00747) | 0.606 (0.00563) | −53.59 (1.650) | |
| 1 | MaxMin | 0.712 (0.00381) | 0.563 (0.00198) | −30.95 (0.103) |
| Random | 0.745 (0.00732) | 0.598 (0.00461) | −45.7 (0.838) | |
| 2 | MaxMin | 0.712 (0.0455) | 0.562 (0.00184) | −30.55 (0.105) |
| Random | 0.744 (0.00739) | 0.599 (0.00470) | −45.64 (0.931) | |
| 3 | MaxMin | 0.712 (0.00445) | 0.563 (0.00192) | −30.462 (0.107) |
| Random | 0.744 (0.00700) | 0.598 (0.00485) | −45.466 (0.837) | |
| 4 | MaxMin | 0.712 (0.00411) | 0.562 (0.00221) | −30.4 (0.123) |
| Random | 0.744 (0.00668) | 0.598 (0.00495) | −45.39 (0.933) |
The MaxMin picker also illustrate the competing objectives, as optimizing diversity without considering the QSAR and QED results in average values that are close to the minimum possible QED value according to the filter, while the QSAR tends to be slightly below the median (0.76) of the distribution. Another observation is that the diversity noticeably increases as reaction steps increase from step 0 to 1, but do not change significantly for subsequent steps. This was mirrored by the DPP optimization when targeting diversity. A second experiment was thus conducted where we filtered the minimum ranges to match the SO mean values, QSAR ∈ [0.84–1.0] and QED ∈ [0.7–0.82], to explore how diverse selections can be with similar values to our selections. These results are displayed in Table 3. During this study, we confirmed that the set of purchasable building blocks covered most of the highest scoring products. The number of products passing the filter that could be made from in-stock eMolecules building blocks were 1822, while 1 reaction step only increased this number to 1947. From the building blocks available in 1 reaction step to 4, this number only increases to 1984, meaning that 91% of the highest quality products were available using in-stock products. This small change is also reflected in the minor changes to log
Det, which shows a small increase in diversity for the larger pools of products.
| N reaction steps | Selection strategy | Avg QSAR (SD) | Avg QED (SD) | Avg log Det (SD) |
|---|---|---|---|---|
| 0 | MaxMin | 0.850 (0.00125) | 0.717 (0.00105) | −46.99 (0.298) |
| Random | 0.866 (0.00309) | 0.725 (0.00210) | −72.62 (2.397) | |
| 1 | MaxMin | 0.850 (0.00124) | 0.716 (0.000990) | −46.20 (0.338) |
| Random | 0.866 (0.00357) | 0.724 (0.00185) | −72.01 (2.080) | |
| 2 | MaxMin | 0.850 (0.00121) | 0.719 (0.000871) | −46.18 (0.308) |
| Random | 0.865 (0.00378) | 0.724 (0.00178) | −71.80 (2.303) | |
| 3 | MaxMin | 0.850 (0.00114) | 0.716 (0.00188) | −46.03 (0.314) |
| Random | 0.865 (0.00355) | 0.723 (0.00188) | −71.66 (1.871) | |
| 4 | MaxMin | 0.850 (0.00121) | 0.715 (0.000864) | −46.02 (0.291) |
| Random | 0.866 (0.00350) | 0.724 (0.00198) | −71.82 (1.931) |
To evaluate the selection strategies in a more practically relevant setting, we restricted our building block stock availability to a subset of 3% of the original size (∼45k building blocks) simulating an approximate availability of building blocks available for a pharmaceutical company. The distribution of solved retrosynthesis routes for the building block subsets are shown in Fig. 5. The unsolved routes on average were 26
504 with a standard deviation of 526.6 and 1072 with a standard deviation of 132.9 for AC and BH reactions, respectively. It is noteworthy that the proportion of building blocks added per reaction step relative to the current available size is larger for these limited availability subsets, i.e., as 1745 and 385 building blocks are added for AC and BH after one reaction, compared to 16
831 and 1163 building blocks added for the full stock. The general trend continues as the selection space is expanded to more reaction steps and in the first four reaction steps almost half of the total number of aromatic halides and more than half of the carboxylic acids become available.
The same four selection strategies were used for building blocks available from 0 to 4 reaction steps with ten starting randomized initializations each. Here, the selection from stock-available (zero reaction steps), seen in Table 4, shows that the highest achievable values are drastically lower than after acquiring more building blocks by synthesis. For this smaller space the algorithm is likely to result in the same optimum for the given stock with multiple initializations.
Det is the logarithm of determinant of the kernel matrix, or matrix of all pairwise Tanimoto similarities in the current selection, and a measure of diversity. A value closer to 0 is more diverse. Random selection is the average values of 100 combinations selected for each reaction step availability. For each optimization strategy, we show the results of stock-available building blocks (0 reaction steps) and building blocks up to 4 reaction steps away
| Selection strategy | N reaction steps | Avg QSAR | Avg QED | Avg log Det |
|---|---|---|---|---|
| QSAR | 0 | 0.909 (0.0108) | 0.373 (0.0408) | −152.3 (7.32) |
| 1 | 0.984 (0.00718) | 0.386 (0.0493) | −189.5 (13.5) | |
| 2 | 0.992 (0.00534) | 0.349 (0.0340) | −196.3 (9.80) | |
| 3 | 0.992 (0.00533) | 0.341 (0.0290) | −195.4 (10.8) | |
| 4 | 0.993 (0.00464) | 0.335 (0.0331) | −196.9 (11.9) | |
| QED | 0 | 0.734 (0.0188) | 0.701 (0.00638) | −143.2 (2.24) |
| 1 | 0.685 (0.0152) | 0.775 (0.00234) | −151.4 (3.08) | |
| 2 | 0.691 (0.00809) | 0.781 (0.00334) | −155.1 (3.27) | |
| 3 | 0.687 (0.0117) | 0.782 (0.00256) | −155.2 (2.96) | |
| 4 | 0.686 (0.0133) | 0.772 (0.00231) | −155.3 (3.22) | |
| Diversity | 0 | 0.722 (0.0127) | 0.305 (0.0205) | −108.3 (0.796) |
| 1 | 0.704 (0.0133) | 0.237 (0.288) | −102.7 (0.759) | |
| 2 | 0.707 (0.0153) | 0.200 (0.0174) | −100.8 (0.783) | |
| 3 | 0.708 (0.0124) | 0.186 (0.0168) | −100.3 (0.860) | |
| 4 | 0.716 (0.0421) | 0.187 (0.0241) | −101.8 (12.6) | |
| Simultaneous optimization | 0 | 0.789 (0.0182) | 0.650 (0.00599) | −127.9 (2.33) |
| 1 | 0.836 (0.00652) | 0.700 (0.00709) | −127.1 (1.08) | |
| 2 | 0.842 (0.00771) | 0.700 (0.00718) | −126.2 (1.24) | |
| 3 | 0.846 (0.00742) | 0.703 (0.00705) | −127.2 (1.19) | |
| 4 | 0.846 (0.00722) | 0.703 (0.00793) | −127.3 (1.18) | |
| Random selection | 0 | 0.758 (0.0210) | 0.382 (0.0391) | −128.8 (3.90) |
| 1 | 0.767 (0.0240) | 0.365 (0.0488) | −131.2 (4.44) | |
| 2 | 0.770 (0.0250) | 0.330 (0.0458) | −129.8 (4.60) | |
| 3 | 0.776 (0.0231) | 0.304 (0.0414) | −130.1 (4.63) | |
| 4 | 0.772 (0.0235) | 0.312 (0.0424) | −129.4 (4.35) |
The results show that optimized selections approach their respective values from the full eMolecules availability already after extending the selection space to building blocks available within one reaction, and that the stock-available selections score similar in average QSAR and diversity to the random selection of previous experiment. The standard deviations for each sample are in the same range as for the previous experiments, and occasionally, such as for the log
Det during the diversity selection strategy, significantly higher. This is expected since there is variation in both available building blocks in addition to the variation across repeat samples due to initialization. There are smaller improvements in selections with building blocks available within two reaction steps and no improvements with further reactions. We can draw parallels with the distribution of available building blocks in Fig. 4 to the distribution of the previous experiment, and note that the improvements occur when a relatively large number of new building blocks are added to the selection space. When the relative expansion of the space is low the probability of finding a new improved solution is also low.
Unlike the previous experiment, however, the QED score remains at a similar level or, in some cases, improves as the number of reaction steps increase. It is likely that the number of added building blocks through reactions that are “too large”, e.g., heavier than 500 Da. We base this hypothesis on analysing the average number of heavy atoms in the building blocks as a function of reaction steps, which correlate with molecule weight. The average number of heavy atoms for the different number of reaction steps are shown for both cases in Table 5.
| Building block | In stock | 1 reaction step | 2 reaction steps | 3 reaction steps | 4 reaction steps |
|---|---|---|---|---|---|
| AC, full | 14.34 | 18.52 | 20.54 | 21.87 | 22.11 |
| AC, 3% subset | 14.09 | 14.20 | 14.66 | 16.49 | 17.42 |
| BH, full | 12.23 | 17.27 | 20.58 | 22.91 | 20.67 |
| BH, 3% subset | 11.86 | 12.76 | 14.26 | 15.97 | 16.59 |
For both the full datasets and the subsets, we compared the simultaneous optimization strategy against random selection, using the Kullback–Leibler divergence124 to measure how difficult it is to pass a selection as sampled from the random distribution as one sampled from the simultaneous optimization. The results are shown in Table 6. This show that the distributions are distinctly different from each other.
| Kullback–Leibler divergence, KL(A, B) | ||
|---|---|---|
| N reaction steps | Dataset | A = simultaneous optimization |
| B = random | ||
| 0 | Full | 79.04 |
| 1 | 138.9 | |
| 2 | 188.8 | |
| 3 | 162.3 | |
| 4 | 168.6 | |
| 0 | 3% subset | 32.20 |
| 1 | 48.57 | |
| 2 | 65.68 | |
| 3 | 79.60 | |
| 4 | 77.64 | |
The methodology of comparing the optimization results between two different stocks of availability might be useful to estimate the prospective gain from synthesizing new building blocks compared to buying available compounds or simply using the current stock by comparing the optimization results with different selection spaces. This can assist the decision-maker in designing efficient libraries in a combinatorial manner. The number of building blocks estimated to be available through synthesis shows a substantial/relevant increase in search space as the number of reaction steps increases. In practice, only stock-available building blocks or building blocks that can be synthesized in one reaction step will often be used. Alternatively, one could introduce a constraint on the total number of reaction steps used for the selected library, which could be accounted for using e.g., reaction sampling.
000–100
000 samples depending on selection space, initialization and number of metrics, which could take between 20 minutes and 4 hours on a single CPU with the QSAR model being the biggest bottleneck. However, since the evaluation of a random forest model is linear in the number of new products between two samples (12 or 8 depending on the exchanged building block) and determinant calculations have the time complexity of O(k3) with total number of products, the method will eventually be limited by evaluations of diversity rather than QSAR. This appears feasible with size 1536 as here n = products2 = (u × v)2. The termination criterion for 10
000 samples without improvement was chosen after initial experimentation. For larger library dimensions, it is possible that more samples are more suitable to find convergence. The increase in number of building blocks to choose results in more decision variables to determine for an optimal solution. Additionally, larger dimensions generally mean the marginal change of exchanging one building block on the average values in the selection is smaller, which implies the acceptance ratio becomes closer to 1. On an Intel Xeon W-2125 CPU @ 4.00 GHz machine with 8 threads the 12 × 8 configuration required approximately 0.11 s for the QSAR computations compared to 0.04 s for computing diversity for each sample, while a 48 × 32 configuration required 0.14 s for the QSAR and 4.0 s for computing the diversity. A full exhaustive search was never considered even for the smallest subsets as e.g., the size of the average 3% subset at stock-availability in a 12 × 8 configuration results in ∼2 × 1027 different possible combinations. For the same reasons, hyperparameter optimization of acceptance ratio α and score weights ω was not performed, as this scaffold is hypothetical and that a marginally better selection would not lead to generalizable guidelines for these parameters.
Footnote |
| † Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d3dd00095h |
| This journal is © The Royal Society of Chemistry 2024 |