Jeff
Guo‡
a,
Franziska
Knuth‡
ab,
Christian
Margreitter§
a,
Jon Paul
Janet
c,
Kostas
Papadopoulos§
a,
Ola
Engkvist
ad and
Atanas
Patronov§
*a
aMolecular AI, Discovery Sciences, R&D, AstraZeneca, Gothenburg, Sweden. E-mail: patronov@gmail.com
bDepartment of Physics, Norwegian University of Science and Technology, Trondheim, Norway
cMedicinal Chemistry, Research and Early Development, Cardiovascular, Renal and Metabolism (CVRM), BioPharmaceuticals R&D, AstraZeneca, Gothenburg, Sweden
dDepartment of Computer Science and Engineering, Chalmers University of Technology, Gothenburg 41756, Sweden
First published on 4th February 2023
In this work, we present Link-INVENT as an extension to the existing de novo molecular design platform REINVENT. We provide illustrative examples on how Link-INVENT can be applied to fragment linking, scaffold hopping, and PROTAC design case studies where the desirable molecules should satisfy a combination of different criteria. With the help of reinforcement learning, the agent used by Link-INVENT learns to generate favourable linkers connecting molecular subunits that satisfy diverse objectives, facilitating practical application of the model for real-world drug discovery projects. We also introduce a range of linker-specific objectives in the Scoring Function of REINVENT. The code is freely available at https://github.com/MolecularAI/Reinvent.
Recently, the application of DL-based methods to join two molecular subunits via a chemical linker has gained considerable interest.16–23 Generating suitable linkers is important for fragment-based drug discovery (FBDD)24,25 and scaffold hopping,26 and fundamental for the design of proteolysis targeting chimeras (PROTACs).27–29 The former two techniques are avenues to discover and optimize novel small molecule drugs, while the latter is a relatively new therapeutic modality able to achieve targeted protein degradation. Therefore, linker design represents a relevant problem in drug discovery.
FBDD is an alternative to traditional high-throughput screening (HTS) and virtual screening (VS) which screens ‘Lipinski compliant’ small molecules. In contrast, FBDD screens ‘fragments’, typically with a molecular weight (MW) under 260 Da. Although ‘fragment’ hits typically exhibit weaker binding affinities than small molecules, they often form polar interactions with the receptor and possess favourable lipophilicity, limiting entropically driven binding.24,25,30 Thus, ‘fragments’ can be an advantageous starting point for drug design and techniques to optimize their potency and physico-chemical properties include fragment growing and fragment linking.24,25,31,32 The latter is of particular interest as proper linking of two ‘fragments’ such that the linked molecule does not perturb the constituents' interactions can lead to significant potency gain. This is attributed to favourable entropic effects and known as ‘super-additivity’. In practice, fragment linking is challenging and ‘super-additivity’ is rarely achieved, owing to incompatible linkers disrupting the fragments' binding poses.31,32 Thus, improvements in linker design are critical to unlock the full potential of FBDD.
Scaffold hopping refers to modifying the core structure of a molecule to improve physico-chemical properties while retaining potency.26 The task can be formulated as a linker design problem if the scaffold itself is defined as the linker between two molecular subunits. Scaffold hopping is challenging as retaining potency requires 3D structural awareness of the interactions formed between the molecule and its receptor. Similar to fragment linking, improvements in linker design can enhance the ability to generate novel scaffold ideas.
PROTACs are heterobifunctional molecules in which a linker joins a ligand binding to a protein of interest (POI), conferring specificity, and an E3 ubiquitin ligase. The formation of the ternary complex leads to subsequent ubiquitination, achieving POI degradation and thus, targeted knockdown.27–29 While the unique mechanism of action provides promising therapeutic applicability beyond traditional small molecules, PROTAC design is challenging. PROTACs are comparably large molecules, typically existing beyond ‘Lipinski's rules' and thereby posing a design challenge since experience is limited.33–35 Moreover, linker design is challenging due to the relatively high conformational flexibility present in longer linkers and has mostly deferred to empirical structure–activity relationship (SAR) studies, often necessitating numerous iterations of design-make-test-analyze (DMTA) cycles.36 Therefore, there is a need for improved linker design to improve the overall PROTAC design.
Previously developed computational tools for linker design involve searching a database, making the generalizability of proposed linkers inherently limited.37–40 While success has been demonstrated when using these methods combined with filtering steps, one would ideally want to generalize the task such that plausible linker ideas can be proposed given any molecular subunits.37–40 Recently, DL-based linker design models have been proposed that circumvent database searches.16–23 DeLinker is a graph-based model proposed by Imrie et al. which explicitly incorporates 3D information via the distance and angle between the molecular subunits to augment the feature vector.16 Imrie et al. further improve DeLinker and introduce DEVELOP which couples DeLinker with a convolutional neural network (CNN) operating on the 3D structure of the starting fragments.20 SyntaLinker is a conditional transformer model proposed by Yang et al. which treats linker generation as a natural language processing (NLP) task using SMILES.15,17 SyntaLinker was further extended by Hu et al. to perform kinase scaffold hopping after focusing the model via transfer learning.18 Similarly, Feng et al. introduce the SyntaLinker-Hybrid workflow which performs transfer learning on a base SyntaLinker model using known active compounds to focus the generative model.23 Moreover, Langevin et al. proposed the Scaffold Constrained Molecular Generation (SAMOA) algorithm based on recurrent neural networks (RNNs) where one of the capabilities of the model is linker generation.19 Recently, equivariant models including 3DLinker21 and DiffLink22 operating on the coordinates of fragments have been applied for linker generation. Equivariance enforces that symmetry operations applied to the input transforms the output in the same way, and thus model performance is independent of the initial coordinates. However, while these models are capable of generating linker ideas, a major drawback is the limited support to optimize explicitly for desired physico-chemical properties. The current models only allow users to control for the desired linker length16–19 and a select number of physico-chemical properties, e.g., number of hydrogen-bond donors (HBDs).17 To encourage wide adoption of DL-based linker design, increased flexibility to define tailored MPO objectives and better generalizability is needed.
In this work, we present Link-INVENT as an extension to the existing de novo design platform REINVENT, which has previously identified experimentally validated nM potent inhibitors.6,41 The suggested algorithm shares some similarities with the SAMOA algorithm as proposed by Langevin et al.19 in that the code builds upon REINVENT's existing codebase and uses policy-based RL for MPO.6 However, our algorithm has three crucial differences compared to earlier work. Firstly, the prior trained by Langevin et al. is based on ChEMBL compounds and follows the protocol as reported for REINVENT, which was purposed to sample small molecules as SMILES.6,19,42 Consequently, in their linker generation solution, linkers are sampled when the “*” token (the model's internal representation of characters in a SMILES string), denoting the attachment point, is reached, and based on the conditional probabilities of the SMILES sequence so far. The limitation is that linkers should be generated in the context of both molecular subunits. In the extreme case, the SAMOA algorithm may struggle to generate plausible linkers if the SMILES sequence was “CC*C…” where the length of the SMILES on the right side of the “*” token is greater than that on the left side, as the conditional probabilities for linker generation would only be based on the sequence so far, i.e., “CC”. In contrast, Link-INVENT is trained based on the conditional probabilities of observing a linker given both molecular subunits, similar to the SyntaLinker model reported by Yang et al.17 Secondly, the data preparation to train the Link-INVENT prior was based on reaction-splicing of the ChEMBL compounds similar to the Lib-INVENT library design model we reported previously.42,43 Our training set contains linkers that join molecular subunits ranging from a few atoms in size to larger moieties with rings. As a result, a single Link-INVENT prior is suited for diverse linker generation tasks. Finally, Link-INVENT was built on the latest version of REINVENT (3.2) and supports an extensive selection of physico-chemical properties that can be optimized through RL. Moreover, we have implemented additional linker specific properties that can be optimized (in the form of additional Scoring Function components), ranging from physico-chemical properties to flexibility and rigidity, allowing one to explicitly optimize linker properties. We demonstrate the use of Link-INVENT in fragment linking, scaffold hopping, and PROTAC design case studies. Through RL, the Link-INVENT agent learns to generate favourable linkers connecting molecular subunits that satisfy diverse MPO objectives, facilitating practical application of the model for real-world drug discovery projects. The code is freely available at https://github.com/MolecularAI/Reinvent.
1. Initial filtering: filter the raw ChEMBL data (version 27) to keep ‘drug-like’ compounds only (see the ESI† for details). Lenient filtering criteria were applied such that the training data are effective for PROTAC applications where the warheads can be larger in size compared to traditional ‘fragments’.34,35
2. Reaction-based slicing: slice the filtered ChEMBL compounds following the protocol from our Lib-INVENT work using the reaction SMIRKS.43 The result is a dataset of tuples with the structure: (linker, warheads pair, full molecule).
3. Sliced data filtering: filter the tuples to remove unrealistic data points, e.g., linkers with a molecular weight greater than 500 Da.
4. Generate training and validation sets: a validation set containing 287 Bemis–Murcko scaffolds was held out.46
5. SMILES randomization: data augmentation for the training and validation sets was performed via SMILES randomization. At each training epoch, the model is provided with datasets composed of the same sliced tuples (linker, warheads pair, full molecule) but with a different SMILES representation. The purpose was to improve the chemical space generalizability of the generative model as shown by Arús-Pous et al.47
For full details of the data preparation, see the ESI.†
![]() | (1) |
1. Agent sampling: generate a batch size (128 in this work) number of linkers conditioned on an input pair of warheads. Thus, 128 linked molecules were generated at each epoch.
2. Assess linked molecules' desirability: combine the warheads and linkers to form the linked molecules and compute their desirability based on the satisfaction of the Scoring Function.
3. Update the agent policy: compute the loss and update the agent's policy to steer sampling towards favourable linkers. The specific loss function used in Link-INVENT was previously introduced by Fialková et al. in our Lib-INVENT work and defined as the difference between the augmented and posterior likelihoods (DAP).43 Correspondingly, the same loss function was used in this work and is constructed by first defining the augmented log likelihood:
log![]() ![]() | (2) |
J(θ) = (log![]() ![]() | (3) |
Steps 1–3 are repeated until the permitted number of epochs has elapsed. All favourable linkers (and the corresponding full molecules) that achieve a total score (computed by aggregating the scores achieved on each composite objective defined in the Scoring Function) exceeding a user-defined threshold (typically 0.4) are outputted. In this work the threshold was set to 0 to store all molecules generated. The purpose of this was to compare the profiles of molecules generated towards the beginning of the experiment and how RL gradually guides the generation of favourable molecules.
1. Linker effective length: the number of bonds between the attachment atoms.
2. Linker maximum graph length: the number of bonds encompassed in the longest molecular graph traversal path.
3. Linker length ratio: the ratio of the “linker effective length” over the “linker maximum graph length”.
Moreover, one can control linker flexibility through the “linker ratio of rotatable bonds” component which is defined as the number of rotatable bonds (as calculated by using RDKit51) over the total number of bonds (Fig. 2b). We note that this treatment of flexibility is not the only valid definition and inherent limitations exist such as being completely agnostic to intra-molecular hydrogen bonds. Furthermore, RDKit's calculation of rotatable bonds does not consider bonds to terminal atoms rotatable as it depends on the hybridization of the atom they are attached to. Consequently, bonds to attachment points are always considered non-rotatable. This is exemplified in Fig. 2b where the butane linker receives a ratio of 60/100 (60%). Consequently, a linker can never achieve a ratio of rotatable bonds of 100% and to achieve a higher ratio, and linkers must become increasingly longer which can lead to unrealistic ideas being proposed. In practice, this is not a limitation in guiding Link-INVENT towards flexible/rigid linkers as one can introduce appropriate score transformations that provide meaningful agent feedback (discussed in the Results section). For a full list of properties available in the Link-INVENT Scoring Function, see the ESI.†
1. Illustrative example: a simple experiment to illustrate how Link-INVENT gradually learns to satisfy MPO objectives.
2. Experiment 1a: fragment linking: link two fragment hits and satisfy a hydrogen-bond molecular docking constraint.
3. Experiment 1b: comparison fragment linking: link two fragment hits and satisfy a core constrained molecular docking protocol. Results are compared to the existing DL-based linker design tools DeLinker and SyntaLinker.16,17
4. Experiment 2: scaffold hopping: generate new scaffold ideas to improve physico-chemical properties while retaining potency by satisfying a hydrogen-bond molecular docking constraint.
5. Experiment 3: PROTACs: demonstrate the flexibility of Link-INVENT to generate linkers with diverse properties. The focus in this section is to showcase the linker specific properties implemented for the Link-INVENT Scoring Function.
The same prior was used for all the experiments and demonstrates the versatility of the single trained generative model in addressing diverse tasks.
Illustrative example. As an initial illustrative example, we devise an experiment to link two benzene rings with the objective of limiting the number of HBDs and the linker possessing exactly one ring (Fig. 3). Correspondingly, the Scoring Function contains two components:
1. Linker number of hydrogen bond donors: maximum reward is given if the linker contains no HBDs. See ESI Fig. S1† for the score transformation.
2. Linker number of rings: reward is only given if the linker contains exactly one ring.
Fig. 3 shows the Link-INVENT training progress over 20 epochs. The average score over triplicate runs shown in the curve is gradually increasing. Example molecules generated over the course of training are superimposed on the plot. The first molecule on the left possesses multiple HBDs and the linker does not contain a ring. Consequently, this molecule receives low reward. As training progresses, the example molecules start to satisfy our MPO objective. Towards the end of the 20 epochs, the example molecule not only possesses no HBDs, but the linker also has exactly one ring. The purpose of this experiment was to illustrate how the Link-INVENT agent learns via RL to generate molecules that increasingly satisfy the target objective.
![]() | ||
Fig. 4 Experiment 1a: fragment linking strategy for casein kinase 2 inhibitors for the alpha catalytic site (CK2α). (a) Initial fragment hits. The fragment structures are colour-coded: gray fragment PDB ID: 5CSV and green fragment PDB ID: 5CSH. The gray fragment binds by forming hydrogen-bond interactions with Lys68 and Asp175 while the green fragment binds via hydrophobic interactions. The fragment linking strategy was to leverage the nitrogen atoms on both fragments to design a linear linker, separated by 9.9 Å. (b) Fragment linking led to the discovery of the linked molecule, CAM4066 (PDB ID: 5CU4). The constituent fragments are circled in the structure. The linear linker features amide bonds that modulate the linker flexibility and rigidity which the authors attribute to its binding potency.52,53 |
In this section, we adopt the fragment linking strategy devised by Fusco and Brear, et al. (Fig. 4a) and task Link-INVENT with generating plausible linked molecules that retain the Lys68 hydrogen-bond interaction.52,53 Moreover, while Fusco and Brear, et al. exclusively evaluated linear linker ideas, we allow Link-INVENT to explore linkers with rings and branching (to a certain extent). Correspondingly, we devise a Scoring Function composed of the following components (see ESI S2 and S3† for Scoring Function transformations):
1. DockStream: this component is a molecular docking package that is fully compatible with Link-INVENT. DockStream supports docking using a variety of backends. In this work, we use Glide and LigPrep which we previously identified to yield the best average performance over a variety of receptor targets.54–59 A docking constraint was enforced to retain the Lys68 hydrogen-bond interaction.52,53
2. Linker length ratio ≥70: this component prevents linkers with branching that is significantly longer than the effective length (number of bonds between the linker attachment atoms).
3. Linker molecular weight ≤200 Da: this component also prevents linkers with extensive branching but more importantly, prevents the Link-INVENT agent from exploiting the weaknesses of molecular docking, e.g., generating linkers that possess a large number of HBDs which may achieve a favourable docking score but at the expense of limited permeability.60
The fragment linking experiment was run in triplicate and the results are shown in Fig. 5 (see ESI Fig. S4† for all training plots). Over the course of 100 epochs, the average Glide docking score of the batch of molecules generated by Link-INVENT gradually becomes more favourable (Fig. 5a). The docking score distributions of the triplicate runs are essentially identical and demonstrate a reproducible experimental outcome (Fig. 5b). The relatively few molecules that possess a docking score of 0 do not satisfy the docking constraint and were generated towards the beginning of the Link-INVENT run at a timestep where the agent has received minimal feedback. Furthermore, some molecules proposed by Link-INVENT exhibit a more favourable docking score than the reference ligand (−15.20 kcal mol−1, black dotted line in Fig. 5b). The majority of the remaining molecules dock similar to the reference ligand (approximately −14 kcal mol−1) and demonstrates that Link-INVENT at the very least proposes chemical ideas that can satisfy the docking constraint. Subsequently, the interplay between the agent and the DF is exemplified in Fig. 5c. The DF encourages balance between agent exploration and exploitation by penalizing repeated sampling of identical Bemis–Murcko scaffolds.46 The triplicate runs yield a large number of unique scaffolds with minimal overlap, demonstrating diversity in the results and showing that replicate experiments explore different areas in chemical space (Fig. 5c). Next, the plausibility of generated molecules was investigated by comparing their binding poses with the reference ligand. Fig. 5d shows the binding pose of an example top scoring molecule (based on the satisfaction of the composite Scoring Function) superimposed with the reference ligand (see ESI Fig. S5† for more examples). Firstly, the proposed linker is similar to the ground-truth linker, differing only by a single atom shift of an amide bond and the presence of an additional nitrogen. It is important to note that information about the reference ligand was not available to the Link-INVENT agent during the generative process and is not present in the training set (see ESI† for more details). Fusco and Brear, et al. posited that the flexibility and rigidity of the reference ligand linker are crucial to its potency.52,53 The similarity in the linker proposed by Link-INVENT suggests that the docking constraint implicitly guides the agent towards 3D structural awareness, in agreement with our previous results.54 This is further supported by the predicted polar interactions of the generated molecule (Fig. 5d turquoise dotted lines) being mostly identical to those of the reference ligand (Fig. 5d yellow dotted lines) with the only exception being His160. Consequently, the structural similarity between the linkers naturally results in significant overlap of the binding poses and is exemplified in the docking score in which the generated molecule is predicted to dock more favourably than the reference ligand. Taken together, the results in this section demonstrate that Link-INVENT is able to generate plausible chemical ideas spanning diverse minima and is easily tuned for bespoke applications via the Scoring Function.
![]() | ||
Fig. 6 Experiment 1b: comparison fragment linking inosine 5′-monophosphate dehydrogenase (IMPDH) inhibitors for tuberculosis (TB). (a) Initial fragment hits (PDB ID: 5OU2). Trapero et al. linked copies of the fragment hit separated by 4.6 Å via a linear linker.61 (b) Fragment linking led to the discovery of the linked molecule shown in gray (PDB ID: 5OU3) and possessing significantly enhanced in vitro potency. The constituent fragments are circled in the structure. The methyl substituent on the imidazole ring of the right fragment in (b) is not present in the initial fragment hit structure and was added post linker design.61 A Link-INVENT generated molecule is superimposed with the reference ligand (green), showing excellent pose overlap of the constituent fragments and with a comparable docking score. |
This specific case study was also investigated in DeLinker and SyntaLinker DL-based linker design studies.16,17 To assess the prospective compatibility with the protein, DeLinker and SyntaLinker dock their generated molecules with AutoDock Vina62 and MOE docking,63 respectively, post hoc, as proxies for binding affinity. However, an important criterion in any fragment linking campaign is good agreement between the constituent fragment poses of the generated molecule to the reference fragment poses. DeLinker does not show any binding poses of their generated molecules while SyntaLinker shows only three example poses where none of the fragment poses overlap with the reference fragment poses even though they recover the reference ligand.16,17 This suggests that neither docking protocol was able to capture the constituent fragments' binding poses. To address this problem, we task Link-INVENT with generating linker ideas by adopting the strategy envisioned by Trapero et al.61 In particular, we perform core constrained docking with Glide to enforce that the binding pose of at least one fragment is within 0.3 Å to the reference fragment pose (see the ESI† for full details).55–59 This is in contrast to the docking protocol applied in the previous experiment 1a: fragment linking as π-interactions contributes extensively to the binding affinity of the fragment hit in the IMPDH binding pocket (Fig. 6a). For a fair comparison, we apply our core constrained docking protocol to the example generated molecules provided in DeLinker and SyntaLinker studies.16,17 We note that the training data used in DeLinker, SyntaLinker, and Link-INVENT are different which can contribute to differences in performance. We devise a Scoring Function composed of the following components (see ESI S6 and S7† for Scoring Function transformations):
1. DockStream: core constrained docking in Glide was applied through DockStream to prevent significant binding pose deviation of the constituent fragments in the linked molecule compared to the reference fragment pose (see the ESI† for full details).54–59
2. 3 ≤ linker effective length ≤ 5: this component enforces linkers to possess an effective length between 3 and 5 bonds. The specific interval was chosen so that proposed linkers generally span 4.6 Å, capturing the a priori knowledge from fragment screening (Fig. 6a).
3. Linker length ratio ≥70: this component prevents linkers with branching that is significantly longer than the effective length (number of bonds between the linker attachment atoms). In contrast to Trapero et al. where only linear linkers were evaluated, we allow Link-INVENT to explore moderately branched linkers.61
4. Linker molecular weight ≤150 Da: similar to the Scoring Function in the previous fragment linking experiment, this component prevents the Link-INVENT agent from exploiting the weaknesses of molecular docking,60 the only difference is the upper limit of the linker molecular weight being 150 Da instead of 200 Da. The rationale is that the constituent fragments here possess a greater MW compared to the previous fragment linking case study and thus, a lower upper limit is enforced to keep the linked molecules within a reasonable MW range.
The fragment linking experiment was run in triplicate for 70 epochs, generating a total of 8960 SMILES, which is similar to the 9000 molecular graphs generated in the DeLinker work, and facilitating a fair comparison.16 The training plots are shown in Fig. S8.† An example binding pose of a generated molecule is shown in Fig. 6b (green). The 4-aminopyridine linker facilitates extensive overlap of the binding poses of the constituent fragments with the reference poses. Moreover, the docking score is comparable to that of the reference ligand, demonstrating that Link-INVENT is able to generate plausible ideas within a relatively narrow solution space (linkers were enforced to possess an effective linker length between 3 and 5 bonds). DeLinker and SyntaLinker example molecules also show good pose agreement when docked with our protocol (Fig. S10†). SyntaLinker also recovers the reference ligand. However, we note that in the experimental design for SyntaLinker, the authors introduced bias by providing their model with information from the reference ligand. Specifically, in one of the fragments, Trapero et al. included a methyl substituent on the imidazole ring due to synthetic accessibility and the linker with the greatest in vitro potency featured an ether linkage (Fig. 6b reference ligand shown in gray).61 Correspondingly, the methyl substituent and the ether linkage information was provided to the SyntaLinker model during the generative process.17
Next, we assess the docking scores of the generated ideas by Link-INVENT, DeLinker, and SyntaLinker (Table S2†). Across triplicate runs, Link-INVENT generates molecules with a more favourable docking score than the reference ligand (see Fig. S9a† for an example binding pose). By contrast, none and only one (the recovered reference ligand) of the molecules provided in the DeLinker and SyntaLinker studies dock better than (or equal to) the reference ligand. We acknowledge that it is possible that some DeLinker and SyntaLinker proposed molecules do indeed possess more favourable docking scores than the reference ligand and the analysis performed is based on what the authors have provided (20 and 3 example molecules for DeLinker and SyntaLinker, respectively). We note that the reference linker is present in the training data. However, Link-INVENT generates a large number of ideas with comparable docking scores to the reference ligand and also possesses high diversity as shown by the number of unique Bemis–Murcko scaffolds in the generated molecules (Fig. S8f†). Specifically, on average, of the 5000 generated molecules by Link-INVENT, there are around 3000 unique Bemis–Murcko scaffolds. We further note that since the fragments are held constant, this means that the unique scaffolds pertain to the linker itself. Therefore, Link-INVENT generates diverse linker ideas that satisfy the core constrained docking protocol.
Accordingly, we show that the Link-INVENT Scoring Function steers the agent to generate molecules that satisfy the desired MPO objective. By including docking explicitly as a learning objective, Link-INVENT is able to generate molecules with favourable docking scores and outperforms DeLinker and SyntaLinker which dock their generated ideas post hoc.
![]() | ||
Fig. 7 Experiment 2: scaffold hopping strategy for dual leucine zipper kinase (DLK) inhibitor optimization. (a) Initial inhibitor possessing poor physico-chemical properties causing in vivo high clearance (PDB ID: 5CEO). The two hydrogen bonds in the hinge region with Cys193 are crucial for potency. The goal was to replace the pyridine core while retaining the Cys193 interactions. (b) Scaffold hopping led to the discovery of a DLK inhibitor with a pyrazole core and with demonstrated in vivo efficacy (PDB ID: 5CEQ).64,65 The retained molecular sub-units are circled in the structure. |
In this section, we adopt the scaffold hopping strategy devised by Patel et al. and task Link-INVENT with generating novel core ideas with a focus on improving CNS properties. A docking constraint to enforce the Cys193 hydrogen-bond interactions is applied to retain predicted potency and the following specific physico-chemical properties, adopted from Patel et al., were enforced:65 the number of HBDs must be less than 2, the topological polar surface area (tPSA) must be less than 90 Å2, and the CNS MPO score must be greater than or equal to 4. The CNS MPO is an algorithm developed from analysis of CNS drugs and candidates as a predictor for CNS efficacy and encompasses six physico-chemical properties (Clog
P, C
log
D, MW, tPSA, number of HBDs, and pKa).66 In the devised experiment, we do not account for all six CNS MPO properties and only enforce log
P, MW, tPSA, and number of HBDs. Correspondingly, we define the Scoring Function with the following components (see ESI S11 and S12† for Scoring Function transformations) and note that the reference linker is not present in the training set:
1. DockStream: this component is identical to the usage described in the fragment linking section. The only exception was that the docking constraint was enforced to retain the Cys193 hydrogen-bond interactions in the hinge region.65
2. Number of hydrogen bond donors <2: this component is included in the CNS MPO algorithm and enforces the overall linked molecule to possess less than two HBDs. This quantity was specifically desired by Patel et al.65
3. Molecular weight ≤450 Da: this component is included in the CNS MPO algorithm and is enforced to be in an interval in agreement with CNS penetration but with some leniency to allow more Link-INVENT exploration of chemical space.66
4. 3 ≤ Slog
P ≤ 4: this component is included in the CNS MPO algorithm and is enforced to be in an interval in agreement with CNS penetration.66
5. tPSA ≤ 90 Å2: this component is included in the CNS MPO algorithm and is enforced to be in an interval in agreement with CNS penetration.66 The interval was also specifically desired by Patel et al.65
6. 1 ≤ linker number of aromatic rings ≤2: this component was specifically desired by Patel et al. as the binding site geometry is most compatible with a planar ring present in the core scaffold.65
The scaffold hopping experiment was run in triplicate and the results are shown in Fig. 8 (see ESI Fig. S13† for all training plots). Over the course of 100 epochs, the average Glide docking score of the batch of molecules generated by Link-INVENT gradually becomes more favourable (Fig. 8a) and the similarity in the docking score distributions demonstrates a reproducible experimental outcome (Fig. 8b). In contrast to the fragment linking experiment, relatively few molecules possess a more favourable docking score than the reference ligand (shown by the black dotted line). Instead, the majority of molecules score slightly worse (approximately −9.5 kcal mol−1). This is not completely unexpected as the MPO objective is significantly more challenging than the previous fragment linking case study. Consequently, the solution space is much narrower. It is important to note, however, that the objective of the scaffold hopping experiment is not strictly to propose novel cores that dock better than the initial inhibitor (Fig. 6a). Patel et al. noted that their initial inhibitor, while potent, exhibits high in vivo clearance.64,65 Therefore, an inhibitor with sufficient binding affinity and good CNS penetration could achieve in vivo efficacy. The narrower solution space in the scaffold hopping experiment is further supported by Fig. 8c where the absolute counts of unique Bemis–Murcko scaffolds is less than the fragment linking experiment.46 This is not a limitation of Link-INVENT but rather the nature of the MPO objective. Nonetheless, the absolute count for the generated scaffolds is still high and demonstrates Link-INVENT samples with diverse minima. Similar to the fragment linking results, minimal overlap between replicate runs shows that replicate experiments explore different areas in chemical space (Fig. 8c). The plausibility of the proposed scaffolds was investigated by comparing their binding poses with the reference ligand. Fig. 8d shows the binding pose of an example top scoring molecule (based on the satisfaction of the composite Scoring Function) superimposed with the reference ligand (see ESI Fig. S14† for more examples). Firstly, the proposed scaffold features planar aromatic rings, as enforced by the Scoring Function, and as desired by Patel et al.65 Secondly, the Cys193 hydrogen-bond interactions are retained, as enforced by the docking constraint. The proposed ligand is predicted to form an additional hydrogen bond with Gln195, owing to the hydrocarbon chain that extends the spatial occupancy of the overall molecule (Fig. 8d). This suggests that the application of a docking constraint can guide the Link-INVENT agent towards 3D structural awareness, learning to exploit the binding site geometry and electronics. Finally, the binding poses of the generated ligand and the reference ligand overlap significantly, supporting plausibility. Taken together, the results in this section demonstrate the flexibility of the Link-INVENT Scoring Function to optimize relatively complex MPO objectives and that the agent learns to propose plausible chemical ideas.
In this section, we use the PROTAC design strategy by Wang et al. to demonstrate Link-INVENT's linker specific components for the Scoring Function. In select experiments, a fixed set of physico-chemical properties was enforced and based on observed values from compiled PROTAC databases.34,35 Correspondingly, we define the Scoring Function with the following components (see ESI Fig. S10† for the Scoring Function transformations):
1. tPSA ≤ 250 Å2.
2. 3.5 ≤ logP ≤ 6.0.
3. Number of hydrogen bond acceptors ≤ 16.
4. Number of hydrogen bond donors ≤ 6.
5. Number of rotatable bonds < 25.
We demonstrate control over the properties of generated linkers while keeping physico-chemical properties of the PROTAC within the specified intervals described above. Subsequently, we devise three Sub-Experiments:
1. Sub-Experiment 1: fix physico-chemical properties and control the linker length. We show that Link-INVENT can generate linkers within a specified narrow length interval. In addition to including the physico-chemical properties listed above, the Scoring Function contains the following components:
(1) Linker effective length = [4, 6], [7, 9], [10, 12], or [13, 15]: this component enforces linkers to possess an effective length within the specified intervals. See ESI S15† for the Scoring Function transformation.
(2) Linker length ratio = 100; this component prevents linker branching.
The combination of components 1 and 2 enforces Link-INVENT to generate linkers without branching.
2. Sub-Experiment 2: fix physico-chemical properties and the linker length within the interval [7, 9], and control linker linearity, i.e., linkers with and without rings. We show that Link-INVENT can generate linkers within a specified narrow length interval and control for the presence of rings. In addition to including the physico-chemical properties listed above, the Scoring Function contains the following component (see ESI Fig. S22† for the Scoring Function transformations):
(1) Linker effective length = [7, 9]: this component enforces linkers to possess an effective length within the specified interval of [7, 9].
(2) Linker length ratio = 100; this component prevents linker branching.
(3) Linker number of rings = 0; this component enforces linkers to possess no rings, i.e., the linker is linear. In the experiment where we want to generate linkers with rings, we simply omit this component in the Scoring Function.
Similar to Sub-Experiment 1, components 1 and 2 enforce Link-INVENT to generate linkers without branching.
3. Sub-Experiment 3: in this Sub-Experiment, no length or physico-chemical properties are enforced. Instead, we task Link-INVENT with generating linkers with variable flexibility, which is defined by the “linker ratio of rotatable bonds” component, i.e., ratio between the number of rotatable bonds over the total number of bonds. Correspondingly, the Scoring Function contains only one component:
(1) Linker ratio of rotatable bonds = [0, 30], [40, 60], and [70, 100]: the defined intervals correspond to “Low”, “Moderate”, and “High” flexibility (see ESI S26† for the Scoring Function transformation).
PROTAC Sub-Experiment 1: controlling the linker length. Link-INVENT was tasked with generating linker ideas of variable length while keeping physico-chemical properties within a specified range (Fig. 10a, see ESI Fig. S17–S21† for all training plots). The baseline experiment does not enforce a specific effective linker length interval and the distribution of lengths spans a large range (Fig. 10a). In contrast, one can enforce the Link-INVENT agent to explore effective linker lengths within a certain interval, as shown by the enrichments observed in Fig. 10a, e.g., the ‘enforce 4–6’ experiment enforced effective linker lengths in the interval [4–6] and the corresponding bar is enriched relative to other lengths. The purpose of this Sub-Experiment is to show the ease with which one can control effective linker length exploration, mimicking a real-world PROTAC linker design campaign.36,67
PROTAC Sub-Experiment 2: controlling linker linearity. Link-INVENT was tasked with generating linker ideas with an effective length in the interval [7, 9], while keeping physico-chemical properties within a specified range and controlling linearity (Fig. 10b, see ESI Fig. S23–S25† for all training plots). The baseline experiment does not enforce linearity and the resulting ratio of linear linkers to cyclic linkers, i.e., linkers containing at least one ring, is approximately 1:
2. In contrast, one can enforce the Link-INVENT agent to explore linear linkers or cyclic linkers, shown by the enrichments observed in Fig. 10b. The purpose of this Sub-Experiment is to further showcase the user flexibility in specifying desired linker properties.
PROTAC Sub-Experiment 3: controlling linker flexibility. This Sub-Experiment showcases Link-INVENT's “linker ratio of rotatable bonds” component which can be specified in the Scoring Function. We note that while the component itself is meant to be a descriptor of linker flexibility, inherent limitations exist, e.g., not accounting for intra-molecular hydrogen-bonding interactions which would rigidify the linker. Link-INVENT was tasked with generating linker ideas with variable ratios of rotatable bonds where we define ‘Low’, ‘Moderate’, and ‘High’ as the intervals [0, 30], [40, 60], and [70, 100], respectively (Fig. 9c, see ESI Fig. S27† for all training plots). Examples of linkers possessing variable degrees of flexibility are shown in Fig. 10c. The agent implicitly learns that linkers containing rings and sp2 hybridized atoms are more rigid. A clear transition from “Low” flexibility to “High” flexibility is marked by increasing linearity and sp3 hybridized atoms. Without enforcing any length constraints, the proposed linkers become increasingly longer to achieve a high “linker ratio of rotatable bonds” value. This is exemplified in the example linker in the “High” experiment (Fig. 10c). Naturally, the linker shown is likely unrealistic and this Sub-Experiment was an extreme example to showcase the flexibility of Link-INVENT's Scoring Function. In practice, one could constrain the linker length within a specified interval as was done in Sub-Experiments 1 and 2 and explore variable flexibility. In this regard, the “linker ratio of rotatable bonds” provides some control over the conformational entropy of proposed linker ideas.
We demonstrate the application of Link-INVENT in three case studies encompassing fragment linking,24,25 scaffold hopping,26 and PROTAC design.27–29 The Scoring Functions for the experiments were devised based on the corresponding fragment linking,52,53,61 scaffold hopping,64,65 and PROTAC design67 studies. We illustrate the practical adoption of Link-INVENT to real-world drug discovery projects by showcasing how to translate experimental insights into an informative Scoring Function for Link-INVENT. Subsequently, the agent learned to satisfy the desired MPO objective via RL. Specifically, in Experiment 1a: fragment linking,52,53 we showed that Link-INVENT can propose plausible linker ideas that satisfy a molecular docking constraint with an additional constraint over the permitted linker spatial occupancy by controlling for branching. More than 5000 unique Bemis–Murcko scaffolds were generated by the Link-INVENT agent, demonstrating that diverse linker ideas were explored.46 Similarly, in Experiment 1b: comparison fragment linking, we showed that the Link-INVENT agent can learn to generate molecules that satisfy a core constrained docking protocol. Furthermore, by including docking explicitly as a component in the Scoring Function, Link-INVENT is able to generate molecules that possess generally more favourable docking scores than DeLinker and SyntaLinker which are previously reported DL-based methods for linker design.16,17 In the scaffold hopping experiment,64,65 we showed that Link-INVENT can simultaneously optimize a relatively complex MPO objective encompassing a molecular docking constraint and favourable central nervous system (CNS) compatible physico-chemical properties. In this experiment, Link-INVENT navigated a narrow solution space and proposed plausible scaffold ideas which satisfy all desired properties and are diverse as shown by the number of unique Bemis–Murcko scaffolds.46 In the PROTAC experiment,67 we further showed Link-INVENT's extensive user control on the linker properties. We demonstrated the ability to enforce the Link-INVENT agent to explore effective linker lengths within a specified interval while keeping physico-chemical properties within a specified range. Moreover, linker linearity can be controlled, enforcing the agent to explore only linear linkers or linkers containing rings. Finally, we prove that linker flexibility can be controlled via the “linker ratio of rotatable bonds” component which provides users with the ability to modulate the conformational entropy of proposed linker ideas. These series of PROTAC Sub-Experiments mimic real-world PROTAC linker design, which typically investigate linkers of variable length and flexibility.36,67
Link-INVENT is a ready-to-use generative model for linker design with the capability to optimize bespoke MPO objectives via the flexible Scoring Function. The case studies in this work show how Link-INVENT can be applied to real-world drug discovery projects and that the agent proposes plausible and diverse linker ideas. The code is freely available at https://github.com/MolecularAI/Reinvent.
The Link-INVENT code is publicly available in the following GitHub repository: https://github.com/MolecularAI/Reinvent. Finally, molecular docking was performed using proprietary software licensed by Schrödinger (version 2019-4): LigPrep and Glide.55–59 Reproducing experiments 1 and 2 require a Schrödinger license.
Footnotes |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d2dd00115b |
‡ These authors contributed equally. |
§ Present address: Odyssey Therapeutics, Cambridge, MA, USA. |
This journal is © The Royal Society of Chemistry 2023 |