Yossra
Gharbi
and
Rocío
Mercado
*
Department of Computer Science and Engineering, Section for Data Science and AI, Chalmers University of Technology, Chalmersplatsen 4, 412 96 Gothenburg, Sweden. E-mail: yossra@chalmers.se; rocio.mercado@chalmers.se
First published on 27th September 2024
Targeted protein degradation (TPD) is a rapidly growing field in modern drug discovery that aims to regulate the intracellular levels of proteins by harnessing the cell's innate degradation pathways to selectively target and degrade disease-related proteins. This strategy creates new opportunities for therapeutic intervention in cases where occupancy-based inhibitors have not been successful. Proteolysis-targeting chimeras (PROTACs) are at the heart of TPD strategies, which leverage the ubiquitin–proteasome system for the selective targeting and proteasomal degradation of pathogenic proteins. This unique mechanism can be particularly useful for dealing with proteins that were once deemed “undruggable” using conventional small-molecule drugs. PROTACs are hetero-bifunctional molecules consisting of two ligands, connected by a chemical linker. As the field evolves, it becomes increasingly apparent that traditional methodologies for designing such complex molecules have limitations. This has led to the use of machine learning (ML) and generative modeling to improve and accelerate the development process. In this review, we aim to provide a thorough exploration of the impact of ML on de novo PROTAC design – an aspect of molecular design that has not been comprehensively reviewed despite its significance. Initially, we delve into the distinct characteristics of PROTAC linker design, underscoring the complexities required to create effective bifunctional molecules capable of TPD. We then examine how ML in the context of fragment-based drug design (FBDD), honed in the realm of small-molecule drug discovery, is paving the way for PROTAC linker design. Our review provides a critical evaluation of the limitations inherent in applying this method to the complex field of PROTAC development. Moreover, we review existing ML works applied to PROTAC design, highlighting pioneering efforts and, importantly, the limitations these studies face. By offering insights into the current state of PROTAC development and the integral role of ML in PROTAC design, we aim to provide valuable perspectives for biologists, chemists, and ML practitioners alike in their pursuit of better design strategies for this new modality.
Fig. 1 (a) A PROTAC is a hetero-bifunctional molecule, consisting of a ligand (blue triangle) that recruits an E3 ubiquitin ligase, a warhead (orange circle) that binds to the POI, and a linker (blue curve) that connects the two binding moieties. The PROTAC functions by simultaneously binding to the POI and the E3 ligase, thus bringing them into close proximity and inducing the formation of a ternary complex. (b) The PROTAC MoA begins with an E1–ubiquitin-activating enzyme that activates ubiquitin (Ub) in an ATP-dependent manner. This activated Ub is then transferred to an E2–Ub-conjugating enzyme. Subsequently, a PROTAC simultaneously binds to the POI and an E3 ubiquitin ligase, bringing them into close proximity. This facilitates the transfer of Ub from the E2 enzyme to the POI, catalyzed by the E3 ligase. The polyubiquitinated POI is then recognized and degraded by the proteasome into smaller peptides, and the PROTAC is released back into the cellular environment where it can be reused, initiating the process again with another instance of the same POI. (c) Visual representations of dBET6 and its respective ternary complex: left – a 2D skeletal formula of the PROTAC molecule dBET6; middle – a close-up of the dBET6 degrader's three-dimensional (3D) structure in complex with CRBN and BRD4 (PDBID:6BOY), emphasizing the importance of the PROTAC's spatial orientation in forming a good ternary complex; and right – a space filling model for the same complex, involving BRD4, CRBN, DNA damage-binding protein 1 (DDB1), and dBET6. Color key: BRD4 (green), CRBN (cyan), and DDB1 (dark blue). |
In 2003, the same group synthesized a PROTAC using estradiol, a form of estrogen, as part of its structure.14 This PROTAC was designed to target and promote the destruction of the estrogen receptor alpha (ERα), which, when activated by estrogen, can promote the growth of some breast cancers.15 It has been shown that the estradiol-based PROTAC could effectively enforce the ubiquitination and subsequent degradation of the α isoform of ER in vitro.14 Similarly, they created a PROTAC that incorporates dihydrotestosterone (DHT) to target and degrade the androgen receptor (AR). When activated by androgens like DHT, the AR can stimulate the growth of prostate cancer cells.16 The DHT-based PROTAC has shown efficacy in promoting the rapid ubiquitination and proteasome-dependent degradation of AR in cellular tests.14 These PROTACs served as proof that they are a promising modality for selectively degrading key proteins involved in cancer, opening up potential treatment benefits by TPD in hormone-responsive cancers.17,18
While first-generation PROTACs were capable of degrading target proteins, they suffered from poor cell permeability and chemical stability stemming from their high molecular weight.19 They generally exhibited low potency using micromolar concentrations, which is less desirable than the nanomolar concentrations used for more potent drugs, indicating that higher doses are required to exhibit efficacy.1 Notably, early PROTACs were peptide-based and commonly used β-TrCP or Von Hippel-Lindau (VHL) as E3 ligases. One significant drawback of peptide-based therapeutics is their high molecular weight, which affects their ability to cross cell membranes. This poor permeability is a critical limitation because it means that even if a PROTAC is theoretically effective, its inability to enter cells renders it ineffective in practice.1 These limitations promoted the need to develop second-generation PROTACs, motivating a transition from peptide-based to small-molecule PROTACs. The use of small molecules expanded the range of potentially targetable proteins by taking advantage of a more extensive array of E3 ligases beyond β-TrCP and VHL, such as mouse double minute 2 homologue (MDM2), inhibitors of apoptosis proteins (IAPs), and cereblon (CRBN).19 In 2008, the Crews lab developed the first small-molecule PROTAC that could degrade a target protein within cells, in this case, targeting AR.20 This PROTAC was composed of nutlin-3A, a ligand for MDM2, and a non-steroidal androgen receptor ligand (SARM) for AR, connected by a polyethylene glycol (PEG) linker.20 The SARM-nutlin PROTAC induced the degradation of AR in a proteasome-dependent manner with enhanced cell penetration in vitro.
Since the first PROTAC was reported in the literature, the field of PROTACs has experienced remarkable growth3,21 and has led to the design of compounds with improved drug-like properties, demonstrating effectiveness both in vitro and in vivo.22–25 In 2013, the first in vivo success of PROTACs occurred with the development of phosphoPROTACs. PhosphoPROTACs are a particular form of PROTACs that exploit phosphorylation-dependent binding interactions.26 This modification was made to improve the selective targeting of proteins involved in signaling pathways. These compounds were able to inhibit tumor growth in mouse models. This was a major breakthrough, as it proved that PROTACs could be used not only in cell-based assays but also in living organisms to exert therapeutic effects.
In 2019, the first PROTACs to enter clinical trials were ARV-110 (ref. 27 and 28) and ARV-471,29 which target AR and ER, respectively. ARV-110 was tested in a heavily pre-treated population with metastatic castration-resistant prostate cancer (mCRPC). Results from a phase I trial showed that ARV-110 could reduce the levels of AR in cancer cells by at least 95%, which is a significant reduction that hampers the cancer cell's ability to grow and survive. Notably, its effectiveness in ENZ-resistant models offers a potential treatment option for patients who no longer respond to ENZ, addressing a critical gap in prostate cancer therapy. ARV-110 advanced to phase II clinical trials in 2020 based on initial phase I data that demonstrated the drug's good oral availability, safety, and tolerability in patients.30 On the other hand, ARV-471 is designed for oral administration in patients with hormone receptor-positive (HR+) and HER2-negative metastatic breast cancer. In a phase I clinical study involving breast cancer patients who had undergone multiple prior treatments, ARV-471 significantly reduced the expression level of ER in tumor tissues of patients. It was also reported that ARV-471 is well tolerated across all tested doses (30–700 mg), and maintained a high level of ER degradation (89%).10 ARV-471 advanced to phase III clinical trials for breast cancer in 2024.
Following the lead of ARV-110, PROTAC technology has advanced significantly. Approximately 29 PROTAC drugs have entered clinical trials, which marks their successful translation into the clinic.26 Notably, this rapid expansion includes treatments targeting previously undruggable proteins, such as transcription factors and RNA-binding proteins. Additionally, these trials primarily focus on oncology, targeting cancers with poor prognoses, including metastatic prostate cancer, breast cancer, and solid tumors.
In some of the latest generations of PROTACs, additional elements have been introduced to give another dimension of control over PROTAC activity.31,32 These classes of controllable PROTACs aim to address off-tissue effects by controlling PROTAC action in a spatiotemporal manner.32 Some are designed to be activated or deactivated by specific wavelengths of light, allowing for controlled degradation processes in target cells, with potentially reduced side effects and enhanced therapeutic index. These PROTACs include phospho-dependent PROTACs that degrade targets with activated kinase-signaling clues, and light-controllable PROTACs that use light as an external clue to trigger target degradation. Notable light-controllable PROTACs, also commonly referred to as PHOTACs, include photo-caged and photo-switchable PROTACs.33 Photo-caged PROTACs are designed to be inactive in their initial form and activated by light exposure, which removes the photo-cage group and enables the degradation of the POI. Photo-switchable PROTACs, on the other hand, are designed to reversibly control the degradation process via the incorporation of photoswitchable groups such as azobenzene, which can switch between active and inactive states under different wavelengths of light. In-cell click-formed proteolysis-targeting chimeras (CLIPTACs) share similar ambitions to PHOTACs and have been used to degrade two key oncology targets successfully.34 The reader is referred to these excellent reviews for a more detailed analysis of milestones in PROTAC development.1,19,26,32,35
E3 ubiquitin ligases are categorized into two main types based on their mechanism of action (MoA) for transferring Ub's to their target proteins: HECT-domain and RING-type E3 ligases. HECT-domain E3 ligases first form a thioester bond with Ub. This means that Ub is temporarily attached to the E3 ligase itself. Subsequently, the E3 ligase transfers the Ub from itself directly onto the substrate protein that is to be tagged for degradation.46 Unlike HECT-domain ligases, RING E3 ligases do not form a direct bond with Ub. Instead, they facilitate the transfer of Ub directly from an E2 enzyme (which is conjugated with Ub) to the substrate protein.47 In essence, RING-type E3 ligases act as mediators that bring the E2–Ub conjugate close to the substrate, enabling the direct transfer of Ub. This ubiquitination cycle repeats, leading to the transfer of multiple Ub's and the polyubiquitination of the substrate. Once a protein is polyubiquitinated, it is tagged for degradation. The proteasome recognizes the tagged protein, binds to it, unfolds it, and breaks it down into smaller peptides.48 The PROTAC is then recycled for additional ubiquitination rounds of additional substrates.49
Furthermore, a key advantage of PROTACs lies in their substoichiometric catalytic activity, which operates on an event-driven basis.76 This means that PROTACs do not need to fully occupy their target proteins to be effective, in contrast to traditional inhibitors that function in an occupancy-driven manner.77 In SMDs, the effectiveness of the drug is often dependent on stoichiometrically occupying the target binding site.78 This means that a significant portion of a target protein must be bound by an inhibitor molecule for the desired therapeutic effect to be observed. This often requires relatively high concentrations of the drug to achieve sufficient occupancy, since the effects are proportional to the extent of binding.78 PROTACs operate differently: they bind transiently to their targets and, after facilitating ubiquitination, dissociate. This allows them to cycle through multiple rounds of activity, repeatedly initiating the degradation of additional instances of the same POI.1 In contrast to SMDs that act in a dose-dependent manner, this catalytic feature allows PROTACs to achieve potent effects at possibly lower doses, offering potential advantages in terms of efficacy, safety, negative side effects, and off-target effects.79
One final advantageous characteristic of PROTACs worth mentioning is that they are able to selectively target and induce the degradation of specific protein isoforms. These are distinct forms of the same protein arising from a single gene. The ability to selectively target them is significant because it implies that PROTACs can be used to differentiate between closely related forms of a protein and target only the isoform(s) associated with a disease without affecting others that may have essential functions in normal cellular processes.6,69,76
These attractive characteristics make PROTACs a prime focus of drug design endeavors. To maximize the potential of this innovative class of compounds, researchers are increasingly turning to data-driven approaches for design strategies. Machine learning (ML) has thus demonstrably advanced drug discovery and development by enhancing target identification, small-molecule design, predictive biomarker discovery, and the prediction of clinical trial success.80,81 ML methods can help researchers analyze large amounts of data to identify potential drug targets, optimize compound properties, and predict how patients will respond to treatments. This makes drug development more efficient and increases the likelihood of success in vitro.80,81 Given the complexity of designing PROTACs due to the large chemical space they span and their multivalent nature, leveraging ML will likely be crucial in making the development of this new modality more feasible. Despite numerous reviews on PROTACs, there is a notable gap in the literature: an in-depth review that delves into the use of ML for PROTAC design is still lacking. In this comprehensive literature review, we explore the impact of ML on de novo PROTAC design to date. First, we delve into the distinct characteristics of PROTAC linker design, underscoring the features required to create effective bifunctional molecules capable of TPD. We then examine how ML in the context of fragment-based drug discovery (FBDD; Fig. 2a), honed for small-molecule drug discovery, is paving the way for PROTAC linker design. Our review provides a critical assessment of the obstacles inherent in the application of these methods to PROTAC development. This assessment seeks to shed light on the pressing need for specialized algorithms, enhanced data quality, and the adaptation of ML models to address the multifaceted nature of PROTAC engineering. Moreover, we review existing ML works that have been tailored to PROTAC design, highlighting pioneering efforts as well as the limitations associated with these existing approaches. We also offer perspectives on potential avenues for future ML research in this field.
Furthermore, the length of the linker affects the range of spatial configurations accessible to potential ternary complexes during formation, restricting which protein interfaces are accessible for interaction. Smith et al.55 demonstrated how differences in linker lengths and attachment points enable selective degradation of closely related kinase isoforms using PROTACs. The study developed isoform-selective PROTACs for the p38 mitogen-activated protein kinase (MAPK) family using the same warhead and E3 ligase but varying the linker features (linker attachment points and lengths). Two different linker attachment points (an amide and phenyl series) and varying linker lengths (10, 11, 12, and 13 atoms) were used to create distinct PROTACs that differentially recruit VHL. This selective recruitment controls the degradation of either the p38α or p38δ isoforms. For instance, PROTACs with 12- and 13-atom linkers in the amide series became highly selective for p38α degradation, showing much higher degradation efficacy compared to degraders with shorter linkers, which were also less selective. Conversely, a 10-atom linker in the phenyl series led to selective degradation of p38δ, with very minimal impact on other isoforms. This selective degradation ability is attributed to how variations in linker lengths and attachment points influence the formation of the ternary complex. By fine-tuning the linkers, PROTACs can achieve selective degradation profiles – in this particular study, shorter linkers may bring the E3 ligase into a position that is optimal for ubiquitinating p38δ but not p38α.
• Linker length is an important factor in determining the spatial configuration necessary for effective ternary complex formation. Adequate length ensures optimal potency. Both too-long and too-short linkers can negatively impact the potency of PROTACs.
• Small changes in linker length can shift the degradation selectivity between closely related protein isoforms.
The composition of the linker can also improve the PK properties of PROTACs, such as metabolic stability, and biodistribution.82 These properties influence how the drug is adsorbed, distributed, and eventually metabolized inside the body. However, the majority of linkers in PROTACs have been based on a limited set of chemical motifs, with PEG and alkyl chains being the most common. Approximately 55% of linkers utilize PEG, while about 30% use alkyl chains of various lengths.57 These motifs are favored due to their versatility, ease of synthesis, and ability to modulate the solubility and permeability of PROTAC molecules. Around 65% of published PROTAC structures incorporate both alkyl and PEG segments within their linkers. This combination aims to leverage the beneficial properties of both motifs, such as the flexibility and hydrophilicity provided by PEG, and the structural simplicity and modifiability of alkyl chains. A further 15% of linkers involve modifications to the basic glycol units in PEG, such as adding methylene groups.57 Such modifications are typically done to explore different chain lengths and thus influence the potential structural configurations accessible to PROTACs.
• Amide-to-ester substitution can benefit the optimization of PROTACs, and potentially other compounds, falling beyond the Rule of Five.
• Modifications in PROTAC linker composition, such as altering chemical groups and combining different motifs, directly influence the physicochemical properties of PROTACs.
Another study describes how the ability of PROTACs to induce selective protein degradation is enhanced by the plastic nature of the binding interactions between CRBN and BRD4 bromodomains.54 Plasticity here means that the proteins can adopt multiple conformations at the binding interface depending on the linker length, composition, and linkage position. It was shown that different linkers can promote different binding conformations between the CRBN and BRD4. This plasticity allows the PROTAC to effectively bring the proteins into proximity in orientations that are conducive to ubiquitination. Using X-ray crystallography and molecular docking, the authors shed light on how different linker configurations lead to distinct low-energy binding conformations between CRBN and BRD4. The varying conformations accessible to PROTACs in this study illustrate how linker-induced flexibility directly impacts biological outcomes.
• The flexibility of a PROTAC's linker can be tuned by adjusting, for instance, the length and chemical composition of the linker.
• Flexibility can allow for conformational adaptability and access to multiple binding orientations.
One recent study that nicely illustrates these points used a combination of crystallographic data and mathematical modeling to explore the conformational dynamics of protein–protein interactions induced by PROTACs, to understand how these dynamics influence ubiquitination and eventual protein degradation.90 Interestingly, the authors found that the stability of the ternary complex did not necessarily correlate with increased protein degradation efficiency, suggesting that excessive stability might inhibit degradation efficiency. Notably, the spatial arrangement and kinetic properties of the ternary complex were crucial in this context: effective PROTACs brought lysine residues on the POI close to the active site of the E2 enzyme, facilitated by the E3 ligase within the complex. Lysine residues are the most common sites for ubiquitination in proteins. The authors also confirmed that the kinetics of the ternary complex, especially its dissociation rate, also play a role in determining the degradation efficiency. Salt bridges and the hydrophobicity of the interactions within the ternary complex were found to contribute positively both to the cooperativity and to the half-life of the interaction. These findings suggest prioritizing compounds that can induce the necessary conformational dynamics without overly stabilizing the ternary complex, highlighting how valuable insights can be gained using computational tools.
Fragments can be an ideal starting point for drug design, with fragment growing and linking strategies allowing for the optimization of their potency and physicochemical properties. Fragment linking in particular gives the possibility for significant potency gains by ensuring that the linked molecule maintains the interactions of the original fragments, a phenomenon known as super-additivity.93 However, achieving this is in practice very challenging, as a bad linker can instead lead to the disruption of fragment binding poses.
Despite its success in drug discovery, FBDD may fall short when applied to PROTAC linker design. PROTACs are substantially larger and more complex than the small fragments typically dealt with in FBDD. The linker in a PROTAC must connect two distinct binding moieties, facilitating the formation of a stable ternary complex, and does not simply focus on improving the binding affinity. To reiterate, the linker in a PROTAC must be flexible enough to allow the formation of a ternary complex but rigid enough to maintain the correct spatial arrangement of the ligands. This balance is difficult to tackle using traditional FBDD approaches, which focus on optimizing single-binding interactions rather than complex multi-protein assemblies. The unique challenges posed by the size, complexity, and spatial requirements of PROTACs necessitate more advanced methodologies. While direct application of typical fragment-linking strategies used in FBDD is not generally feasible in PROTAC design, a modular approach can certainly be beneficial. As we show in the next section, researchers are already taking inspiration and lessons learned from FBDD and applying them to PROTAC design.
Link-INVENT93 is an extension to the existing de novo molecular design platform REINVENT; it uses policy-based reinforcement learning (RL) for multi-parameter optimization, and can be applied to both fragment linking and scaffold hopping given a desired property profile. Via RL, the Link-INVENT agent learns to generate linkers connecting molecular fragments while satisfying diverse objectives, facilitating the practical application of the model for real-world drug discovery projects. In the original study, Link-INVENT used the drug-like compound SMILES extracted from ChEMBL for training. Lenient criteria were applied to ensure the dataset's effectiveness for PROTAC applications (e.g., larger warheads). ChEMBL compounds were sliced using reaction SMIRKS to create triplets (linker, warheads, full molecule). Unrealistic data points were removed, and datasets were augmented via SMILES randomization for improved generalizability. Link-INVENT is trained based on the conditional probabilities of observing a linker given both molecular subunits, similar to SyntaLinker. The agent is initialized with the same parameters as the prior and is updated via RL to generate linkers that increasingly satisfy the desired multi-parameter optimization (MPO) objectives. The scoring function combines various components (physicochemical properties, structural features, predictive models, and binding energy approximations) to evaluate the desirability of generated linkers. Link-INVENT was tested in various experiments, demonstrating its capability to generate linkers that meet specific criteria. Notably, Link-INVENT has also been demonstrated to be effective in PROTAC linker design, successfully optimizing the properties of generated linkers, including effective length, the presence of rings, and flexibility.
Due partly to the surprising effectiveness of 2D representations like SMILES, the majority of molecular generative models used for de novo molecular design and FBDD have made limited use of 3D structural information, including SyntaLinker and Link-INVENT. Nevertheless, the PROTAC MoA suggests that incorporating 3D information may come to play an important role in designing PROTAC structures, which lead to favorable ternary complexes. In the next subsection, we cover ML models that seek to incorporate structural information into their molecular design workflows.
3DLinker101 is a conditional generative model for designing molecular linkers using 3D spatial information and is capable of generating linker graphs along with their 3D structures and anchor atoms. This is achieved through an E(3)-equivariant graph VAE, addressing challenges such as the conditional generation of linkers based on two input ligands and the requirement for 3D structural awareness to avoid atom clashes. It predicts both the graph (2D) structure of the linker and its 3D coordinates while ensuring the model's outputs are equivariant with respect to E(3) group symmetries (i.e., rotation, translation, and reflection). The training data was derived from the ZINC database,102 from which 3D conformers were generated for each molecule using RDKit and the lowest-energy conformation chosen as the reference structure. The final curated dataset contains ∼366k (fragment, linker, coordinate) triplets and was roughly divided into 99.8%/0.1%/0.1% training/validation/testing splits. Using this generous training split, the model outperforms other baselines, including DeLinker and other 2D graph generative models (coupled with ConfVAE103 for 3D structure generation), in recovering molecular graphs and accurately predicting the 3D coordinates of atoms. Nevertheless, it is unclear if the reported metrics are for the training, validation, or test set. While 3DLinker demonstrates improved performance in generating 3D molecular structures with accurate geometry, precise connection of molecular fragments, and higher recovery rates, the authors observed that this comes with the trade-off of lower uniqueness and novelty in sampled molecules compared to the benchmarked approaches.
DiffLinker104 is an E(3)-equivariant 3D-conditional diffusion model for the design of molecular linkers. This approach uniquely generates molecular linkers for a set of input fragments represented as 3D atomic point clouds, overcoming the limitations of previous methods by not being restricted to linking pairs of fragments. DiffLinker automatically determines the number of atoms in the linker and its attachment points to the input fragments. As the previous approaches, DiffLinker was trained and evaluated on a dataset derived from ZINC-250k, but the authors also took things a step further by benchmarking on two additional datasets: one derived from CASF-2016, and another derived from GEOM.105 The molecules derived from GEOM can be decomposed into three or more fragments with one or two linkers connecting them, creating a more challenging benchmark that better approximates real-world usage. DiffLinker demonstrates an ability to generate diverse and synthetically accessible molecules with minimal clashes, especially when conditioned on target protein pockets. It represents a significant advancement in FBDD, providing a powerful tool for the generation of chemically relevant molecules in a flexible and efficient manner. Nevertheless, the authors did not apply their fragment-linking approach to PROTAC design.
Building upon the success of SyntaLinker, DRlinker106 is a similar approach that incorporates RL, and, indirectly, 3D information, for the generation of linkers with specific 2D and 3D attributes. It was trained and evaluated for FBDD on datasets derived not only from ChEMBL, but also from CASF-2016.107 On tasks like optimizing bioactivity, it achieves a 91.0% and 93.9% success rate in generating compounds with desired linker length and LogP, respectively. Despite being based on 2D SMILES representation, DRlinker can also perform scaffold-hopping in a way that generates molecules with high 3D similarity but low 2D similarity to lead inhibitors. Two years later, the same team followed up with another model for FBDD, which aims to better incorporate 3D information. GRELinker108 combines a gated-graph neural network (GGNN109) with RL and curriculum learning (CL) to design linkers with desirable property profiles. Its architecture is very similar to that of GraphINVENT.110 It outperforms DRlinker in tasks such as controlling LogP, optimizing synthesizability and bioactivity, and generating molecules with high 3D similarity but low 2D similarity to lead compounds. It has also been evaluated in scenarios representative of real-world use-cases, where the aim is to optimize for molecular affinity using docking scores. The authors found that the use of CL improved its efficiency in generating complex linkers.
Despite the successes of the aforementioned works in FBDD, and, in particular, of DeLinker and Link-INVENT in PROTAC linker design, the methods reviewed above all face a key limitation – they were all trained and optimized on small-molecule binders rather than on an actual PROTAC dataset. Although careful filtering was done to make the datasets more generalizable beyond small-molecule binders, due to the fact that warheads can be much larger than the typical fragments used in FBDD, we argue that the training of these models may not fully capture the unique features and complexities of larger, multivalent molecules like PROTACs, nor their unique chemistry. As previously discussed, PROTACs not only have larger sizes but also exhibit different biophysical and chemical properties compared to the small molecules typically found in drug discovery databases (Fig. 3). This training limitation can affect the applicability of these methods for designing effective PROTAC linkers, as the chemical space and design strategies for PROTACs diverge significantly from those of small molecules. This underscores the necessity for specialized tools for PROTAC linker design that can accommodate their unique size, complexity, and 3D structural requirements. The next section reviews ML models specifically tailored for PROTAC design.
Fig. 3 The distributions of various molecular descriptors in PROTACs versus small molecules. PROTACs were downloaded from PROTAC-DB and PROTACpedia, while small molecules were randomly sampled from ZINC-250k,102 a popular database used in drug discovery containing commercially-available compounds for virtual screening (e.g., drug-like compounds). This comparative analysis of their chemical and physical properties highlights the differences between both classes of molecules. The descriptors include molecular weight, partition coefficient (LogP), number of rotatable bonds, number of hydrogen bond donors (HBDs) and acceptors (HBAs), and normalized atom counts for carbon. |
PROTAC-RL is a deep generative model combining an augmented transformer architecture with memory-assisted RL capable of generating PROTACs with favorable PK properties, including solubility, stability, and bioavailability.112 Notably, the authors experimentally validated their model by testing the synthetic feasibility of six of their designs. To address the challenge of limited training data, the model was pre-trained using a large dataset of PROTAC-like structures, termed quasi-PROTACs, followed by fine-tuning on actual PROTAC data. Given a pair of E3 ligand and warhead SMILES, the model generates optimized linkers, which aim to optimize the PK attributes of the returned PROTACs. PROTAC-RL achieved a recovery rate of 43.0%, much higher than the recovery rates of the baseline models, DeLinker and SyntaLinker, even after these were retrained using the PROTAC training datasets. After retraining, Delinker and SyntaLinker achieved recovery rates of 4.8% and 10.4%, respectively. This stark contrast in recovery rates between PROTAC-RL and the benchmarked models after retraining further strengthens the argument that models designed and trained for small molecular fragments cannot adequately capture the unique aspects of PROTACs, as the design strategies and principles for these two classes of molecules are fundamentally different. Because the RL component allows for the conditional generation of PROTACs with specific properties, such as a desired protein target, the authors applied PROTAC-RL to the design of BRD4-targeting PROTACs. To this end, they generated 5k compounds, which they then filtered through a combination of ML classifiers and molecular simulations to identify candidates with favorable PK properties and synthetic accessibility. Of the six candidate PROTACs, which were synthesized and experimentally tested, three showed inhibitory activity against BRD4 in cell-based assays. One lead candidate demonstrated high anti-proliferative potency and a favorable PK profile in mice.
AIMLinker is a GGNN109 model for autoregressive PROTAC linker generation at the atomic/bond level.111 Like GRELinker, it seeks to improve upon previously-developed graph-based deep generative models like DeLinker, CGVAE,113 and GraphINVENT110via the incorporation of 3D information. AIMLinker was trained on a dataset combining molecules from ZINC and PROTAC-DB.114 The training focused on predicting viable 2D linker structures from fragment–molecule pairs. Generated molecules were then validated via molecular docking and simulations to verify binding to the target proteins via binding affinity and conformational predictions. AIMLinker was used to successfully generate a diverse library of novel PROTACs. The model demonstrated superiority over other fragment-linking methods (DeLinker and DiffLinker) in generating molecules with favorable PK properties and high binding affinities, with a few designed PROTACs even outperforming the reference compound dBET6 in binding affinity and structural alignment (Fig. 1c). Despite a promising performance in PROTAC linker design, AIMLinker does have two current limitations, namely the focus on a single PROTAC target (BRD4) and the reliance on docking predictions, which are known to be inaccurate.115
Finally, ShapeLinker116 is a model based on Link-INVENT, but with an important shape alignment contribution to the scoring function, and less significant but still important contributions from the ratio of rotatable bonds and the linker length ratio. The authors train on PROTAC-DB114 data, as well as on ten well-known ternary complexes from the Protein Data Bank (PDB): 5T35, 7ZNT, 6BN7, 6BOY, 6HAY, 6HAX, 7S4E, 7JTP, 7Q2J, and 7JTO. All of these complexes have binding PROTACs that were optimized in individual structure-based drug studies and cover a diverse range of PROTAC (and linker) “shapes”. These are included in the training of the shape alignment model. Nevertheless, it is not clear whether these additions indeed improve the performance of ShapeLinker over that of the base Link-INVENT. The results suggest that perhaps larger changes to the model architecture are required for step-changes in performance.
Equally important is the choice of an E3 ligase and a corresponding ligand. The selection of the E3 ligase often involves considering a range of factors, such as the ubiquitination efficiency, the specificity of the ligase, and its expression levels within the relevant cells or tissues. For instance, if a PROTAC is being developed for cancer therapy, the chosen E3 ligase should be highly expressed in cancer cells and less so in healthy cells to minimize off-target effects.118 Furthermore, different E3 ligases can induce varying degrees of degradation even with the same POI ligands and linkers. The selection of an effective E3 ligase and ligand is thus a critical aspect of PROTAC design, and structural knowledge of the E3 ligase–POI interaction can significantly aid this process. Interestingly, neither the binding affinity of the warhead nor of the E3 ligase ligand seems to directly influence the degradation efficiency of the PROTAC.117
Once the individual ligands have been selected, the next challenge is to design a molecule where these components work in concert. This harmony is essential for the formation of an effective ternary complex between the PROTAC, the target protein, and the E3 ligase. The spatial and temporal dynamics of this complex formation are critical. It's not just about bringing these entities into proximity; it's also about ensuring that they interact in a manner that facilitates the transfer of ubiquitin from the E3 ligase to the target protein. For instance, the spatial arrangement in a potential ternary complex needs to allow the POI's ubiquitination site to be accessible to the E3 ligase once the complex is formed. This may involve tweaking the linker length, rigidity, or chemical composition to achieve the optimal orientation.
Nevertheless, some ML-driven methods for PROTAC design seek to tackle the problem in a more holistic manner. Though less common than modular approaches, which focus heavily on PROTAC linker optimization, comprehensive PROTAC design strategies can be advantageous for a few reasons. They allow, in principle, for the simultaneous optimization of multiple parameters, such as flexibility, cell permeability, and degradation efficiency—factors determined not only by the linker composition, but also by that of the warhead and E3 ligand. A holistic approach may also better account for the complex interactions between the different PROTAC components, leading to the design of more effective and specific PROTACs; nevertheless, this remains to be rigorously demonstrated. In the next section, we examine the only study which, to our knowledge, has tackled the problem of engineering PROTACs in a holistic fashion, challenging traditional FBDD principles.
Although diffusion models have not yet been applied to PROTAC design (only FBDD, as in DiffLinker104), we believe they present a promising direction in PROTAC engineering, both for linker-only and holistic design strategies. Firstly, diffusion models excel at generating high-quality molecular structures by gradually transforming simple distributions into complex data distributions.123 Secondly, diffusion models can naturally integrate 3D information, which allows for the design of PROTACs that account for the spatial arrangement and interactions between the POI, the PROTAC, and the E3 ligase. Diffusion models are also known for their robustness in handling noisy data,124 and they can be integrated with existing generative and predictive frameworks in an online setting.125 These capabilities of diffusion models make them natural choices to explore further for generating diverse and novel PROTAC structures.
For a detailed summary of all models surveyed in this work, please see Table 1.
Model | Year | Data | Type | Focus |
---|---|---|---|---|
SyntaLinker96 | 2020 | SMILES | Transformers | Fragment linking |
PROTAC-RL112 | 2022 | SMILES | Transformers + RL | Fragment linking & PROTAC linker design |
Link-INVENT93 | 2023 | SMILES | LSTM + RL | Fragment linking & PROTAC linker design |
ShapeLinker116 | 2023 | SMILES (w/3D coords) | Link-INVENT | PROTAC linker design |
DRlinker106 | 2022 | SMILES (w/3D coords) | Transformers + RL | Fragment linking |
DeLinker98 | 2020 | 2D graphs (w/3D coords) | JT-VAE | Fragment linking |
DEVELOP100 | 2021 | 2D graphs + 3D coords | JT-VAE + CNN | Fragment linking |
3DLinker101 | 2022 | 2D graphs + 3D coords | E(3) eq. graph VAE | Fragment linking |
Nori et al.119 | 2022 | 2D graphs | GraphINVENT110 (GGNN + RL) | Full (“holistic”) PROTAC design |
AIMLinker111 | 2023 | 2D graphs (w/3D coords) | GGNN | PROTAC linker design |
GRELinker108 | 2024 | 2D graphs (w/3D coords) | GGNN + RL + CL | Fragment linking |
DiffLinker104 | 2024 | 2D graphs + 3D coords | 2D GNN + E(3) eq. 3D diffusion | Linker size prediction & fragment linking |
Nori et al.119 | 2022 | Morgan fingerprints | XGBoost | Degradation activity prediction |
DeepPROTACs120 | 2022 | SMILES + 3D graphs | GCNs + LSTMs | Degradation activity prediction |
Ribes et al.121 | 2024 | Morgan fingerprints | MLP | Degradation activity prediction |
PROTAC-DB is a public database designed to support the research and development of PROTACs.114 It offers an online repository of structural and experimental data related to these molecules. Data in the database is manually extracted from the literature or calculated using specific programs. In the second release, the number of PROTACs was expanded to 3270 and featured ∼360 warheads, ∼1500 linkers, and ∼80 E3 ligands. As of June 2024, PROTAC-DB contains 5388 entries. It also includes ternary complex structures for PROTACs. PROTAC-DB covers key aspects of PROTAC activity, including degradation capacity, quantified by metrics like DC50 and Dmax; binding affinities between PROTACs (or PROTAC ligands) and target proteins and E3 ligases; cellular activities such as IC50, EC50, GI50, and GR50; and PAMPA and Caco-2 permeability data. Nevertheless, entries are not necessarily complete and there is a lot of missing data in PROTAC-DB, often because the original source does not report all aforementioned metrics.121
PROTACpedia is a curated database focused on PROTACs, containing detailed entries on 1190 PROTAC molecules as of the latest update (October 2022).122 It contains high-quality data that has been carefully curated by experts, including information on ∼202 warheads, ∼65 E3 ligands, and ∼806 linkers. This platform facilitates the sharing and dissemination of critical PROTAC-related data to help expand PROTACpedia. Its collaborative nature encourages contributions to ensure that the database remains an up-to-date resource for researchers exploring PROTACs.
There is a significant overlap in the activity distributions of structures deposited in PROTACpedia with those in PROTAC-DB. As of June 2024, there are 807 PROTACs present in both databases, identified via string comparison following canonicalization of PROTAC SMILES from both databases. In other words, roughly 68% of PROTACs in PROTACpedia and 25% of PROTACs in PROTAC-DB are present in both databases. We did not explore what fraction of the duplicate PROTAC structures correspond to duplicate entries between the two databases, as it is possible that a PROTAC may be present in both databases but still contain information about different sets of experiments.
As PROTACs represent a relatively new therapeutic modality, there is a relative scarcity in the number of publicly available crystal structures, especially for ternary complexes. Structures that are available in the PDB have most frequently been determined using cryogenic electron microscopy (cryo-EM), a technology that has revolutionized the field of protein structural biology. Nevertheless, because the PROTAC MoA is not particularly well-understood, generalizations are being made across a range of PROTACs based on limited mechanistic data. Researchers would be wise to exercise caution when generalizing too far beyond the scope of their models or experiments.
Part of the challenge in ML-driven PROTAC engineering stems from the limited amount of structured data available. While public databases such as these have been greatly influential thus far in driving the development of ML tools for PROTAC design, without more comprehensive datasets, data-driven models will only be able to access a fraction of the vast chemical space accessible with PROTACs. Data scarcity becomes even more of a concern when considering factors like bioactivity, PK properties, and 3D structure in PROTACs. Low-data and low-resource learning can provide valuable strategies in the current scarce data landscape,127 but, ultimately, more high-quality, structured data will need to be systematically generated and deposited following FAIR data-sharing principles for researchers to truly harness the powers of ML in PROTAC design. We hope that, just as ML has become an invaluable tool for identifying hits and optimizing leads in small-molecule drug discovery pipelines, it will also transform the current paradigm of PROTAC engineering, making us wonder how we ever managed without it.
• Molecular weight: the small-molecule MW distribution peaks around 250–500 Da. This is the typical range expected for drug-like small molecules as it is considered optimal for oral bioavailability according to Lipinski's Rule of Five. The PROTAC distribution peaks around 750–1000 Da, highlighting how much larger they are than traditional small molecules. The small-molecule distribution is relatively narrow and sharply peaked, indicating a more uniform range of MWs, while the PROTAC distribution is broader, reflecting greater variability in the size of these complex molecules. The clear separation between the two distributions highlights a key difference between PROTACs and traditional small molecules: their size.
• Partition coefficient: the LogP distribution for small molecules peaks around 2–3. This is consistent with drug-likeness criteria, where a LogP value between 1 and 3 is typically considered favorable for oral bioavailability. The LogP distribution for PROTACs is broader and peaks around 5. Higher LogP values indicate that PROTACs are generally more hydrophobic than small molecules, which can affect their solubility and cellular permeability. The higher LogP values for PROTACs may pose challenges for their solubility in aqueous environments like the extracellular environment and the cytosol, and may require formulation strategies to enhance their solubility and bioavailability.
• Rotatable bonds: the distribution in the number of rotatable bonds peaks around 1–5 rotatable bonds for small molecules. Fewer rotatable bonds are associated with greater rigidity. For PROTACs, the distribution instead peaks around 15–20 rotatable bonds. The higher number of rotatable bonds can be largely attributed to the flexible linker regions in PROTACs.
• Hydrogen bonds: the HBD distribution of small molecules peaks around 0–2. This is in line with the drug-likeness criteria that suggest a limited number of hydrogen bond donors to ensure good membrane permeability. On the other hand, the HBD distribution of PROTACs peaks around 3–5. Similar trends are observed for the HBA distributions.
• Carbon composition: both PROTACs and small molecules have a high normalized carbon count, peaking between 0.7–0.8, with the peak being slightly lower for PROTACs. Small molecules display a slightly broader distribution in normalized carbon atom count. No significant differences in normalized nitrogen, oxygen, or fluorine atom composition were observed, although small molecules do display marginally broader distributions for all these atom types.
This comparative analysis highlights the unique challenges and opportunities facing ML models for PROTAC design. Furthermore, it should be evident that models trained on small fragments will not capture the distinct features of PROTACs, as small molecule fragments and PROTACs exist in largely non-overlapping areas of chemical space. Extending ML models developed for small molecules to PROTACs requires modifications; this could entail small changes, like re-training on larger molecules and/or more diverse datasets that include PROTACs, or changes to fundamental design principles. It is well-known in deep learning, including generative modeling, that the predictive accuracy of a model depends heavily on the availability of high-quality datasets.130 However, in drug design, datasets often have various forms of inconsistencies and missing data.131 These challenges are even more pronounced when focusing on PROTAC design; here, experimental data on the efficacy and specificity of PROTAC molecules is even scarcer. The multifaceted nature of PROTACs necessitates detailed and high-quality datasets to uncover the subtle patterns underlying their biological activity. This is especially crucial for turning predictive models into tools for robust PROTAC optimization and design.
Models that fail to incorporate structural, or even dynamical, information regarding PROTACs and their target proteins might not effectively capture the feasibility of ternary complex formation. We know that the specificity and activity of PROTACs towards a specific POI are influenced by precise 3D interactions at the molecular level, more so than by the binding affinity. Without detailed 3D structural data, ML models may not be accurate enough to generalize to new PROTAC structures or even new POIs, a concern reflected in the changing landscape of ML models for PROTAC design: while early work focused primarily on 2D representations or simplified 3D information, all work we surveyed from the past two years involved the incorporation of more complex 3D data. We don't believe this change is due solely to advances in computing hardware and software. Rather, models capable of handling 3D data offer a superior capability to capture the interplay of molecular shapes and complex spatial arrangements especially relevant in PROTAC function.
Another big challenge with PROTAC design is getting them into the cell.59 As they depend on the proteasome for degrading their target proteins, PROTACs can only be used to target proteins found in the cytosol or with cytosolic domains (for membrane proteins), thus excluding as targets any proteins found outside the cell. According to the Human Protein Atlas, ∼25% of all protein-coding human genes have been shown to encode proteins that localize to the cytosol and its substructures,132 though this estimate does not include proteins that transiently reside in the cytosol. How to improve cellular permeability in PROTACs is thus an active area of research, as it imposes hard constraints on their efficacy. Notwithstanding, exactly which mechanism PROTACs use for entering the cell is not fully understood and may very well vary depending on the specific molecule and cell type, adding another layer of complexity to the task of cell permeability prediction.
To overcome the many challenges facing de novo PROTAC design, future ML methods must place a greater emphasis on accurately modeling the 3D structures of PROTACs and the corresponding ternary complexes they form. This could involve the development of molecular dynamics or physics-based approaches that leverage ML to simulate important molecular interactions and conformational changes at a coarse-grained or even atomistic level, and it could also involve experimental advances that allow us to better isolate and characterize these complexes, possibly with the assistance of active learning or other ML-driven strategies. Scientists leveraging ML have undeniably driven many recent breakthroughs in protein structure prediction133 and conditional protein structure generation.134 Perhaps it's time to apply similar guiding principles to PROTAC engineering (e.g., systematic data collection and accessibility, better algorithms harnessing biological knowledge), and see what breakthroughs we can achieve in this domain.
In this comprehensive review, we have highlighted the significant impact of ML on PROTAC design. The complexities involved in PROTACs make traditional ML in the context of FBDD less effective. These complexities include the unique mechanism of action of PROTACs, the delicate spatial configuration required for effective protein degradation, and the need for favorable PK profiles for drug-like compounds, which are not adequately captured by models designed for small molecular fragments. Advanced ML techniques, such as generative models tailored to PROTAC peculiarities offer promising solutions for optimizing PROTAC design.
In the hope of spurring more research in what we view as a hugely impactful but formidable research direction, we have prepared this comprehensive review on ML for PROTAC design. We hope that this review and the insights described in it serve as a comprehensive guide to researchers looking to apply their deep ML knowledge to the design of an exciting “new” therapeutic modality, or conversely, to enable biologists to venture into the rewarding world of deep generative models. The synergy between ML and PROTAC design holds immense potential, and we encourage further research in this pioneering domain.
This journal is © The Royal Society of Chemistry 2024 |