Yuyao
Yang‡
ab,
Shuangjia
Zheng‡
a,
Shimin
Su
ab,
Chao
Zhao
a,
Jun
Xu
*a and
Hongming
Chen
*b
aResearch Center for Drug Discovery, School of Pharmaceutical Sciences, Sun Yat-Sen University, 132 East Circle at University City, Guangzhou 510006, China. E-mail: junxu@biochemomes.com
bCenter of Chemistry and Chemical Biology, Guangzhou Regenerative Medicine and Health Guangdong Laboratory, Guangzhou 510530, China. E-mail: chen_hongming@grmh-gdl.cn
First published on 22nd July 2020
Linking fragments to generate a focused compound library for a specific drug target is one of the challenges in fragment-based drug design (FBDD). Hereby, we propose a new program named SyntaLinker, which is based on a syntactic pattern recognition approach using deep conditional transformer neural networks. This state-of-the-art transformer can link molecular fragments automatically by learning from the knowledge of structures in medicinal chemistry databases (e.g. ChEMBL database). Conventionally, linking molecular fragments was viewed as connecting substructures that were predefined by empirical rules. In SyntaLinker, however, the rules of linking fragments can be learned implicitly from known chemical structures by recognizing syntactic patterns embedded in SMILES notations. With deep conditional transformer neural networks, SyntaLinker can generate molecular structures based on a given pair of fragments and additional restrictions. Case studies have demonstrated the advantages and usefulness of SyntaLinker in FBDD.
In recent years, fragment-based drug design (FBDD) has gained considerable attention as an alternative drug discovery strategy due to its lower cost in running the assay and potential advantages in identifying hits for difficult targets.9–11 The concept of FBDD dates back to the pioneering work of William Jencks in the mid-1990s.12 It usually starts from screening low molecular weight molecules (for example, MW < 300 daltons; binding affinity of the order of mM), which have weak, but efficient interactions against a target protein.13 The fragment screening is usually carried out at high concentration and a typical fragment collection is around a few thousands of compounds in contrast to millions of compounds in HTS.14 The effective use of fragments as starting points for step-wise optimization has shown capability to overcome the major obstacles for further drug development, such as limited chemical space, low structural diversity, and unfavourable drug absorption, distribution, metabolism, excretion, and toxicity (ADMET) properties.15 Therefore, the popularity of fragment-based drug design has grown at a remarkable rate in both industry and academic institutions.16
There are two key factors for successfully utilizing FBDD in drug discovery: (i) finding suitable fragments (ii) growing and optimizing these fragments to develop lead-like molecules. Many experimental and computational efforts for finding the fragments have been developed in the past decade,17 such as nuclear magnetic resonance (NMR), X-ray crystallography, surface plasmon resonance (SPR) and virtual screening.18 Fragment growing, merging, and linking are three main techniques to convert fragments to leads.19–22 Linking fragments is still challenging because it is difficult to retain the binding modes of the fragments after the linking. Thus, linking fragments is a key problem to be resolved to improve the ligand efficiency of the fragments.23–25 Conventional fragment linking techniques26 are database search27,28 and quantum mechanical (QM) calculations (such as fragment molecular orbital (FMO)).29–31 These techniques are limited by the size of the database or computational complexity.
Recently, advances in the development of deep generative models have spawned a mass of promising methods to address the structure generation issue in drug design.32–34 The deep generative models have been applied in de novo molecular design35–38 and lead optimization.39–41 Many generative architectures, such as RNNs,42 autoencoders,43,44 and generative adversarial networks45 (GANs), have been proposed to generate desired molecules, which are either represented in chemical structure linear notations46 (such as SMILES) or connection tables. Recently, Imrie and co-workers reported a fragment linking technique with a generative model47 (aka DeLinker), which can link fragments while keeping the relative positions of the fragments intact.
In the current study, we propose a new technique (aka SyntaLinker) for fragment linking. The algorithm works by recognizing syntactic patterns embedded in the SMILES representation. Inspired by the machine translation task in natural language processing (NLP), we regard the fragment linking as an NLP-like task (sentence completion48), and develop a new conditional transformer architecture (SyntaLinker) for linker generation in a controllable manner. In the current study, we divide ChEMBL compounds into terminal fragments and linkers, which are used to train our transformer models to learn syntactic patterns for linking fragments. Thus, it is unnecessary to use any rigid transformation rule.
Our model takes terminal fragments and linker constraints, such as the shortest linker bond distance (SLBD), the existence of the hydrogen bond donor, hydrogen bond acceptor, rotatable bond and ring etc., as the input and generates product compounds containing input fragments. Compared to DeLinker, our approach achieved higher recovery rate in terms of rational linker prediction. Finally, through a few case studies, we demonstrate the effectiveness of our approach on some common drug design tasks such as fragment linking, lead optimization and scaffold hopping.
There are two types of linking constraints used as control codes during training. One is the SLBD between two anchor points, which are used to maintain the relative position of a pair of fragments. The other one is the constraint with multiple features, such as the presence (1) or absence (0) of hydrogen bond donors (HBDs), hydrogen bond acceptors (HBAs), rotatable bonds (RBs) and rings, which can be regarded as additional pharmacophoric constraints.
This process can be regarded as an end-to-end sentence completion process, where a pair of fragments is the input signal and the full compound is the output signal. Schwaller et al.49 used encoder–decoder neural network architecture to predict reaction products in an end-to-end manner, where the reactants serve as the source sequence and the reaction product as the target sequence, whereas in our case, a pair of fragments and the constraints of the SLBD (as a prepended token) are defined as the source sequence, and the full molecule as the target sequence. Examples of the sequence expression are shown in Fig. 1. A training example with SLBD as the constraint is described as “[L_4]c1ccccc1[*].[*]C1CCOCC1 ≫ c1ccc(CNCC2CCOCC2)cc1”, where “[L_4]c1ccccc1[*].[*]C1CCOCC1” is the source sequence, “c1ccc(CNCC2CCOCC2)cc1” is the target sequence, and “[L_4]” is the SLBD (equal to four bond distance) as the control code. In the other example, “[L_4 1 1 1 0]” represents the multiple constraints, where “1 1 1 0” represents the presence of HBD, HBA, RB and the absence of a ring.
Throughout the study, SLBD only and SLBD plus multiple pharmacophore constraint model are named SyntaLinker and SyntaLinker_multi model respectively. Additionally, a reference model without using any constraint, SyntaLinker_n, was also trained for comparison.
All source and target sequences of our data set were first tokenized to construct a vocabulary. For each example sequence containing n tokens, where a token is referred to as an individual letter of the SMILES string, it was first encoded into a one-hot matrix by the vocabulary and then transformed into an embedding matrix Ms = (e1, …, en) by a word embedding algorithm53 as we did in our previous work.54Ms was composed of n corresponding vectors in Rd, where d is the embedding dimension.
The core architecture of SyntaLinker consists of multiple encoder–decoder stacks. The encoder and decoder consist of six identical layers. Each encoder layer has a multi-head self-attention sub-layer and a position-wise feedforward network (FFN) sub-layer. A multi-head attention consists of several scaled dot-product attention functions running in parallel and concatenates their outputs into final values, which allows the model to focus on information from different subspaces at different positions. An attention mechanism computes the dot products of the query (Q) with all keys (K), introduces a scaling factor dk (equal to the size of weight matrices) to avoid excessive dot products, and then applies a softmax function to obtain the weights on the values (V). Formally,
(1) |
For better understanding of the attention mechanism, we convert the mathematical expression in formula (1) into a chemically meaningful expression via a graphical illustration (as shown in Fig. S2†).
The FFN sub-layer adopts Rectified Linear Unit (ReLU) activation.55 Then, layer normalization56,57 and a residual connection58 are introduced to integrate the above two core sub-layers. Each decoder layer has three sub-layers, including two attention sub-layer and an FFN sub-layer. The decoder self-attention sub-layer uses a mask to preclude attending to future tokens, while the encoder–decoder attention sub-layer helps the decoder to focus on important characters in the source sequence, and capture the relationship between the encoder and decoder.
For a given source sequence, its input embedding is processed by encoder layers into a latent representation L = (l1, …, ln). Given L, the decoder output is normalized with a softmax function, yielding a probability distribution for sampling a token, and then generates an output sequence Y = y1, …, ym of one token at a time until the ending token “〈/s〉” is generated. Finally, the cross-entropy loss between the target sequence Mt = (e1, …, ek) and the output sequence Y is minimized during training.
(2) |
To mimic the fragment linking scenario, we constructed the data set using the matched molecular pairs (MMPs) cutting algorithm62 proposed by Hussain et al. Firstly, each molecule was deconstructed using the MMPs cutting algorithm, which executed double cuts of non-functional groups, acyclic single bonds in every compound and this will transform the compound into a quadruple form like “fragment 1, linker, fragment 2, molecule”, which corresponds to the two terminal fragments, a linker and the original compound. In total 5873503 fragment molecule quadruples (FMQs) were enumerated; secondly, the FMQs were further filtered using “Rule of three”63 criteria, i.e. an FMQ will be removed if any of its terminal fragment violates the “Rule of three” criteria; considering that the requirement of linking fragments in reality is to connect two close fragments using a linker as simple as possible, the remaining FMQs were then filtered by the SLBD of the linker (SLBD less than 15) and SAscore according to eqn (1) to ensure the terminal fragments have reasonable synthesis feasibility (SAscore is less than 5) and the SAscore of the linker is lower than the sum of the fragments, which can prevent the generation of highly complicated linkers. In our experience, applying these filters does help the generation of lead-like molecules.
(3) |
In the end, only terminal fragments and original compounds were kept as the fragment molecule triplet (FMT) for model training and all chemical structures in the FMTs were transformed to the canonicalized SMILES format46,64 with RDKit.65
Our ChEMBL data set was further divided into three sets with a ratio of 8:1:1 for training, validating, and testing, respectively. All FMTs were grouped by the corresponding SLBD. When splitting the ChEMBL set into those three sets, a random sampling strategy was adopted to make sure the distribution of SLBD is similar among all three sets.
In addition, for further evaluating the generalization capability of our model, we also considered an external validation set derived from the CASF-2016 data set,66 which consists of 285 protein–ligand complexes with high-quality crystal structures.
Moreover, the SyntaLinker method was validated in three case studies derived from the literature to demonstrate the capability of fragment linking, lead optimization, and scaffold hopping. To make sure diverse structures are generated, only the SyntaLinker model was used in the case studies. It is worthwhile to mention that the ground truth compounds in these examples were not included in the training set of our SyntaLinker models.
(4) |
(5) |
(6) |
The quality of compounds generated by the SyntaLinker model at the 3D level was also examined. In this case, the 3D conformations of compounds are generated and docked into protein binding pockets; the root mean square deviation (RMSD), and shape and colour combo-similarity score (SC) to the X-ray bound conformation of the actual ligand are generated to evaluate the model performance for selected cases. Here, RMSD is merely calculated among a pair of fragments of the X-ray and generated structures. The SC score is calculated by the pharmacophoric feature similarity69 and the shape similarity70 between the X-ray conformation of the actual ligand and the docking pose of the generated structure. The SC score is a real number in the range of [0,1] and the higher its value, the more similar the generated molecule is to the original ligand. For each molecule, the best similarity score among all docking conformers was taken as the SC score. In the current study, converting SMILES to the 3D conformation and docking were done by using the Molecular Operating Environment (MOE) software.71 The RMSD and SC score for each conformation were calculated via RDkit.65
Metrics | Models | ||
---|---|---|---|
SyntaLinker (%) | SyntaLinker_multi (%) | SyntaLinker_n (%) | |
Validity | 97.2 | 97.8 | 96.0 |
Uniqueness | 88.1 | 84.9 | 86.7 |
Recovery | 84.7 | 87.1 | 80. |
Novelty | 91.8 | 92.3 | 90.3 |
Metrics | Models | |||
---|---|---|---|---|
DeLinker (%) | SyntaLinker (%) | SyntaLinker_multi (%) | SyntaLinker_n (%) | |
Validity | 95.5 | 96.4 | 96.5 | 86.8 |
Uniqueness | 51.9 | 69.9 | 63.8 | 65.6 |
Recovery | 53.7 | 62.7 | 60.2 | 55.4 |
Novelty | 51.0 | 75.4 | 77.2 | 71.3 |
These metrics indicate that the performance of our models is significantly improved over the DeLinker model on the CASF validation set. Especially, our model improves the linker novelty of the DeLinker model by a margin of 20% without losing the recovery, meaning SyntaLinker can sample a diverse range of linkers more effectively.
SLBD controlling efficiency of three SyntaLinker models on the ChEMBL test set in top 10 and top 250 are shown in Fig. 3a and b. They contain the percentage of structures with correct SLBD and structures whose SLBD variation is less than one bond distance. 79.2%, 78.1% and 39.9% of structures have exactly the same SLBD as the control for SyntaLinker, SyntaLinker_multi and SyntaLinker_n models respectively. That is, the models with length constraints (SyntaLinker, SyntaLinker_multi) outperform the model without constraints (SyntaLinker_n). Moreover, if SLBD is allowed to vary within one bond, then the percentage of structures complying the criteria from three models are 96.1%, 96.2% and 69% respectively.
For the CASF validation set, the same trend was observed in Fig. 3c. When we increase the beam search width from 10 to 250, the control efficiency of the shortest bond length for all three models will decrease (Fig. 3b and d). However, the conditional model still demonstrates superior performance to the one without condition restrictions.
The percentage of structures with exactly equivalent pharmacophore properties to their ground truth from all three SyntaLinker models on the ChEMBL test set in top-10 and top-250 is depicted in Fig. 4. As expected, the model with pharmacophore constraints (SyntaLinker_multi) outperforms the model without this constraint (SyntaLinker, SyntaLinker_n), where 36.5%, 55.2% and 35.0% of structures have exactly the same pharmacophore properties as the control for SyntaLinker, SyntaLinker_multi and, SyntaLinker_n models on the ChEMBL test set. For the CASF validation set also the same trend was observed. Furthermore, when the beam search width was increased from 10 to 250, the control efficiency of the pharmacophore for all three models decreased. Due to the combination of multiple constraints, the control efficiency of pharmacophore constraints is lower than the one with the bond length constraint (Fig. 3). Nevertheless, the model with the constraints is superior to the model without constraints.
Fig. 5 Distribution of chemical properties for ChEMBL molecules and the molecules generated from SyntaLinker models. |
In other words, the SyntaLinker model is able to identify the SMILES substrings for the terminal fragments of the input sequence and rearrange them correctly in the output sequence. Meanwhile, the breaking position in the input sequence can also be identified and the linker tokens are successfully added in, even though they are not arranged sequentially. As we can see that, in this example, an additional ring is created as the linker, ring number tokens can still be correctly assigned. This suggests that the SyntaLinker model has learned the implicit rules from the ChEMBL dataset and can nicely deal with both the local and global information during the structure generation.
Fig. 7 Fragment linking case study. (a) The binding poses of the pair of starting fragments (PDB: 5OU2). (b) The binding mode of the molecule generated by SyntaLinker (PDB: 5OU3). (c) The starting fragments and the linked molecule. |
Starting from the fragments in 5OU2 with the SLBD condition of 3 to 5, we generated 500 candidates with SyntaLinker. The native molecule (Fig. 7b) was successfully recovered and after 3D conformer generation and docking, most of the generated structures actually have a better docking score than the native molecule (Table 3). Three generated molecules with the highest 3D fragment similarity and favourable MOE docking scores are depicted in Fig. 8, where the native ligand binding poses are recovered by SyntaLinker (Fig. 8c).
Metrics | Cases | ||
---|---|---|---|
Fragment linking | Lead optimization | ||
Top 100 | Top 500 | Top 100 | |
Unique structures | 47 | 341 | 808 |
MOE score < lead | 35 | 306 | 107 |
MOE score < −8.0 | 7 | 153 | 462 |
MOE score < −9.0 | 1 | 22 | 22 |
RMSD < 2.0 Å | 22 | 152 | 179 |
SC > 0.5 | 9 | 36 | 44 |
Fig. 8 Overlays of the native ligand (PDB: 5OU3, green carbons) and three generated molecules (pink carbons, chemical structures shown in blue boxes) with the highest 3D fragment similarities and high MOE docking scores. |
Fig. 9 Optimizing leads by linking fragments. (a) Binding mode of chitinase A and dequalinium (PDB: 3ARP). (b) Binding modes of the two fragments derived from dequalinium. (c) Dequalinium structure. (d) Two fragments derived from dequalinium (marked in blue). |
To optimize dequalinium, we used the MMPs algorithm to derive fragment pairs from the molecule. This resulted in 45 non-redundant fragment pairs. They were used as terminal fragments for running SyntaLinker and 100 candidates were generated for each pair. The original linker bond length in dequalinium is set as the SLBD constraint for each pair. Dequalinium (native ligand) was recovered in these trials, and after removing redundant molecules, the RMSDs of two fragments were calculated. Details are also listed in Table 3.
SyntaLinker can generate a large number of novel molecules. Among these molecules, 179 molecules have RMSD values less than 2 Å comparing with dequalinium, and 30 molecules have an RMSD value less than 1 Å comparing with dequalinium. For comparison, the best RMSD value for the top-5 molecules generated from DeLinker was 2.5 Å. Some molecules similar to dequalinium (the lead) are depicted in Fig. 10. Fig. 11 lists three molecules generated by SyntaLinker and their RMSD values in comparison with dequalinium structure. Our SyntaLinker model seems to give chemically reasonable compounds (three examples are shown in Fig. 11) and also better RMSD and SC metrics. This may be because chemically more attractive linkers existed in our ChEMBL training set and also the ease of learning chemical syntactics by a sequence generation model comparing with the graph generation model.
Fig. 11 Three molecules generated by SyntaLinker and their RMSD values in comparison with dequalinium structure. |
Fig. 12 Scaffold of JNK3 inhibitors. (a) Overlay of the indazole (PDB: 3FI3, green) and aminopyrazole (PDB: 3FI2, pink) structures. (b) Chemical structure and Murcko scaffold of the indazole (upper) and aminopyrazole (down) compounds. The highlighted substructures are fragmental pairs for linking. |
SyntaLinker generated 2500 molecules with the restriction of SLBD ranging from 6 to 8. This resulted in 2138 non-redundant molecules, in which more than thousand molecules had RMSD values less than 1 Å and SC values greater than 0.5. 634 linkers are novel (not found in the training set) covering 186 unique Murcko scaffolds.78 Both indazole (Fig. 13a) and aminopyrazole native molecules (Fig. 13b) were also recovered. This result highlights that our SyntaLinker model is well generalized to design novel linkers by combining the chemical information of starting fragments, not merely remembering the linkers in the ChEMBL training set.
Fig. 13 Overlay of the indazole inhibitor (PDB: 3FI3, green) and several example structures (yellow) with high 3D similarity. The linker structures are shown (novel scaffolds are coloured yellow, the recovered ground truth scaffolds are coloured green) in the upper left, and the order numbers (sorted by the SC score) are shown in the bottom right. |
Fig. 13 demonstrates the binding modes for eight generated structures with novel scaffolds. These new molecules are superimposed nicely with the native ligand.
Footnotes |
† Electronic supplementary information (ESI) available. See DOI: 10.1039/d0sc03126g |
‡ These authors contributed equally. |
This journal is © The Royal Society of Chemistry 2020 |