Open Access Article
Jie
Yue‡
a,
Bingxin
Peng‡
ac,
Yu
Chen‡
c,
Jieyu
Jin‡
b,
Xinda
Zhao
c,
Chao
Shen
bc,
Xiangyang
Ji
d,
Chang-Yu
Hsieh
bc,
Jianfei
Song
*c,
Tingjun
Hou
*bc,
Yafeng
Deng
*cd and
Jike
Wang
*bc
aCollege of Information Engineering, Hebei University of Architecture, Zhangjiakou 075132, Hebei, China
bCollege of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, China. E-mail: tingjunhou@zju.edu.cn; jikewang@zju.edu.cn
cCarbonSilicon AI Technology Co., Ltd, Hangzhou 310018, Zhejiang, China. E-mail: songjianfei@carbonsilicon.ai; dengyafeng@carbonsilicon.ai
dDepartment of Automation, Tsinghua University, Beijing 100084, China
First published on 29th July 2024
Molecular generation stands at the forefront of AI-driven technologies, playing a crucial role in accelerating the development of small molecule drugs. The intricate nature of practical drug discovery necessitates the development of a versatile molecular generation framework that can tackle diverse drug design challenges. However, existing methodologies often struggle to encompass all aspects of small molecule drug design, particularly those rooted in language models, especially in tasks like linker design, due to the autoregressive nature of large language model-based approaches. To empower a language model for a wider range of molecular design tasks, we introduce an unordered simplified molecular-input line-entry system based on fragments (FU-SMILES). Building upon this foundation, we propose FragGPT, a universal fragment-based molecular generation model. Initially pretrained on extensive molecular datasets, FragGPT utilizes FU-SMILES to facilitate efficient generation across various practical applications, such as de novo molecule design, linker design, R-group exploration, scaffold hopping, and side chain optimization. Furthermore, we integrate conditional generation and reinforcement learning (RL) methodologies to ensure that the generated molecules possess multiple desired biological and physicochemical properties. Experimental results across diverse scenarios validate FragGPT's superiority in generating molecules with enhanced properties and novel structures, outperforming existing state-of-the-art models. Moreover, its robust drug design capability is further corroborated through real-world drug design cases.
Recent advancements have yielded noteworthy methodologies for handling individual tasks. Within the realm of de novo molecule generation, Bagal et al. utilized a language model to interpret molecular simplified molecular input line entry system (SMILES)1 character sequences, ultimately leading to the development of MolGPT, a novel framework that leverages the self-attention mechanism with masking.2 Moreover, Juan-Ni et al. introduced a fragment-based approach for de novo molecular design, significantly enhancing both the effectiveness and uniqueness of the synthesized molecules.3 Considering the importance of generating molecules with desired pharmaceutical properties in lead discovery, Wang et al. presented MCMG,4 a novel methodology that facilitates the generation of molecules compliant with multiple constraints. Additionally, other frameworks, such as REINVENT2,5 MIMOSA,6 and Mol-CycleGAN,7 have achieved remarkable results in the generation of molecules with specific property constraints.
In 2020, Imrie et al. introduced the groundbreaking linker design paradigm, DeLinker, rooted in the variational autoencoder (VAE) framework.8,9 This innovative model was designed to amalgamate two fragments or partial structures, thus orchestrating the synthesis of a molecule that embodies both components. Later, Igashov et al. pioneered the development of 3D equivariant molecular features using equivariant networks, subsequently employing a diffusion process for linker synthesis.10 This model enables accurate prediction of the linker's size, facilitating the generation of a diverse array of linkers and exhibiting state-of-the-art (SOTA) performance across various benchmark datasets such as ZINC,11 CASF,12 and GEOM.13 Then, Imrie et al. further refined the DeLinker framework to develop DEVELOP,14 a versatile tool applicable to linker and R-group design,15,16 scaffold hopping,17,18 and PROTAC design,19,20 all demonstrating promising results. More recently, Jin and colleagues introduced FFLOM,21 a model that utilizes molecular graphs to represent molecular fragments. By integrating node and edge flow layers to regulate atom and bond sampling, this model enhances crucial metrics such as traceability and molecular binding affinity for across diverse molecular generation applications.
While the achievements of these molecular design methodologies are undoubtedly noteworthy, it is crucial to recognize the inherent complexity of drug research and development. The current methodologies tend to focus on modeling specific generation contexts, thus limiting their adaptability to the wide range of challenges encountered in drug design. However, in recent years, substantial progress has been achieved in the development of large-scale general natural language models.22–25 These models have demonstrated remarkable efficacy across diverse domains, attributed to their utilization of pretraining and fine-tuning methodologies.26–29
Utilizing textual representation enables comprehensive utilization of the modeling approach offered by pre-trained language models, making SMILES-based pre-trained models more adaptable and effective in molecular generation compared to graph-based methodologies.30,31 Moreover, SMILES-based language models, utilizing architectures like transformer, are capable of handling lengthy molecular sequences.32–35 Studies reveal that these models outperform graph generation models in capturing complex molecular distributions and possess superior generative capabilities.36 Traditional autoregressive language models, typically utilized in SMILES or SELFIES, generate molecules sequentially, atom by atom, from left to right. However, this approach is prone to exposure bias, where the accuracy of subsequent atom generation hinges heavily on the preceding fragment, potentially leading to error accumulation. Additionally, it lacks the capability to handle molecular design tasks such as linker design, which require filling gaps within the molecular structure.
In response to this challenge, our study introduces FU-SMILES, a novel molecular representation that identifies disconnection points among molecular fragments, enabling their seamless integration into whole molecules. Unlike the traditional left-to-right sequential representation, FU-SMILES incorporates fragment details from any part of the molecule into the context. Building upon FU-SMILES, we propose FragGPT, an innovative and comprehensive fragment-based drug design large language model. By employing FU-SMILES, FragGPT proficiently handles fragment generation tasks, efficiently mitigating th error accumulation issues associated with atom-by-atom generation, thereby enhancing the efficiency of molecule construction.
Following a methodology akin to general language models, FragGPT undergoes initial pretraining on an extensive molecular dataset to enhance its generalization capabilities, followed by fine-tuning tailored for specific downstream tasks. To fulfill the requirement for drug molecules that need to meet multiple biophysical properties, the proximal policy optimization (PPO) algorithm37 is utilized to steer the fine-tuning process of our model across specific case studies. We propose a comprehensive evaluation reward model for the generated molecules, encompassing several key metrics such as docking score, synthetic accessibility (SA) score,38 penalized
log
P (p
log
P) score, and quantitative estimation of drug-likeness (QED) score. To evaluate the efficacy of FragGPT, we conducted assessments across a wide range of drug design scenarios, achieving performance comparable to SOTA methods across all tasks. Additionally, to examine the performance of FragGPT in real-world drug design settings, we executed case studies covering diverse aspects such as de novo design, fragment linker design, R-group exploration, PROTAC design, side chain optimization, and scaffold hopping. Our experimental findings highlight that FragGPT not only accelerates drug design in various scenarios but also demonstrates significant effectiveness in handling multi-constraint generation tasks.
To enforce constraints on the generation of molecular pharmacological properties, we augment the pre-training process by incorporating information about the absorption, distribution, metabolism, excretion, toxicity (ADMET) properties of molecules, leading to the development of FragGPT-ADMET. Unlike FragGPT, FragGPT-ADMET requires specifying a set of ideal ADMET values during the generation process. Typically, we provides the ADMET properties of a reference molecule, enabling the model to generate molecules with similar properties, thus adhering to he specified constraints. For a thorough understanding of the methodology, please refer to the Methods section.
FragGPT underwent a rigorous evaluation involving various metrics. Initially, benchmark assessments are conducted across multiple test datasets, covering five distinct tasks: de novo design, linker design, R-group exploration, side chain optimization, and scaffold hopping. Subsequently, we employed reinforcement learning (RL) based on FragGPT to delve deeper into specific case studies.
000 molecules and evaluated them on the MOSES40 benchmark, utilizing multiple metrics, including validity, uniqueness, novelty, SNN, Frag, and IntDiv. Our model was benchmarked against several baseline models including cMOlGPT,41 MOlGPT,2 develop,42 VAE,43 AAE,44 JTN-VAE,45 and LatentGAN.46 The evaluation results are shown in the Table 1. Since the original literature for cMOlGPT did not report the results for IntDiv and novelty scores, these two metrics o were excluded from the comparison.
| Model | FragGPT | CharRNN | VAE | AAE | LatentGAN | JT-VAE | MolGPT | cMolGPT |
|---|---|---|---|---|---|---|---|---|
| Validity↑ | 0.983 | 0.975 | 0.977 | 0.937 | 0.897 | 1.000 | 0.994 | 0.988 |
| Unique@1K↑ | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
| Unique@10K↑ | 0.999 | 0.999 | 0.998 | 0.997 | 0.997 | 0.999 | 1.000 | 0.999 |
| Novelty↑ | 0.994 | 0.842 | 0.695 | 0.793 | 0.949 | 0.914 | 0.797 | — |
| IntDiv1↑ | 0.862 | 0.856 | 0.856 | 0.856 | 0.857 | 0.855 | 0.857 | — |
| IntDiv2↑ | 0.861 | 0.851 | 0.852 | 0.852 | 0.851 | 0.849 | 0.851 | — |
| Frag/Test↑ | 0.994 | 1.000 | 0.999 | 0.991 | 0.999 | 0.997 | — | 1.000 |
| SNN/Test↓ | 0.547 | 0.601 | 0.626 | 0.608 | 0.538 | 0.548 | — | 0.619 |
As shown in Table 1, FragGPT notably excelled in all other models in terms of uniqueness, novelty and diversity. This exceptional performance may be attributed to its extensive data-driven training and the sufficient utilization of the comprehensive and diverse molecular representations and properties. Most models, particularly MOlGPT, cMOlGPT and FragGPT, demonstrated high validity, surpassing 98%. However, LatentGAN lagged behind in validity due to its reliance on molecular latent vector for training, which posed challenges in reconstructing molecules from the latent space. Besides, all models generated molecules with nearly 100% uniqueness and frag scores.
Beyond traditional novelty metrics, we utilized multiple additional metrics including IntDiv and SNN to highlight the models' capability to generate molecules with diverse and distinct structures. Notably, FragGPT outperformed the other models in novelty, IntDiv and SNN, further demonstrating its efficacy in generating novel molecules. In comparison, VAE and AAE showed lower novelty among the models, likely due to their design strategy of reducing the latent space dimensions, leading to higher similarity to the training dataset and lower novelty.47
log
P, QED, and recovery. The detailed results are shown in Table 2.
| Metric | FragGPT | FragGPT-LoRA | DeLinker | DiffLinker | 3DLinker | DEVELOP |
|---|---|---|---|---|---|---|
| ZINC | ||||||
| Validity↑ | 90.75% | 97.34% | 98.40% | 94.80% | 98.67% | — |
| Uniqueness↑ | 65.47% | 37.93% | 44.20% | 50.90% | 29.42% | — |
| Novelty↑ | 98.61% | 98.16% | 39.50% | 47.70% | 32.48% | — |
| Recovery↑ | 21.25% | 24.75% | 79.00% | 77.50% | 93.58% | — |
| SA↓ | 3.14 | 2.93 | 3.10 | 3.24 | — | — |
p log P↑ |
0.74 | 0.75 | 0.32 | −0.24 | — | — |
| QED↑ | 0.56 | 0.66 | 0.64 | 0.65 | — | — |
![]() |
||||||
| CASF | ||||||
| Validity↑ | 90.00% | 91.36% | 95.50% | 68.40% | — | — |
| Uniqueness↑ | 24.00% | 23.00% | 51.90% | 57.10% | — | — |
| Novelty↑ | 99.21% | 99% | 51.00% | 56.90% | — | — |
| Recovery↑ | 25.42% | 26.00% | 53.70% | 48.80% | — | — |
| SA↓ | 3.91 | 3.84 | 4.05 | 4.12 | — | — |
p log P↑ |
−0.36 | −0.36 | −0.91 | −0.41 | — | — |
| QED↑ | 0.43 | 0.42 | 0.36 | 0.40 | — | — |
![]() |
||||||
| PDBbind | ||||||
| Validity↑ | 93.20% | 94.56% | 96.90% | — | — | 93.10% |
| Uniqueness↑ | 39.60% | 35.30% | 86.10% | — | — | 77.30% |
| Novelty↑ | 98.33% | 99.00% | 84.00% | — | — | 88.70% |
| Recovery↑ | 19.80% | 14.10% | 1.90% | — | — | 22.40% |
| SA↓ | 3.73 | 3.60 | 4.05 | — | — | 4.05 |
p log P↑ |
−0.89 | −0.83 | −2.00 | — | — | −1.93 |
| QED↑ | 0.41 | 0.45 | 0.37 | — | — | 0.37 |
FragGPT exhibited an overall satisfactory performance in terms of validity, achieving a validity rate above 90% through autoregressive fragment generation without conducting valence checks during the pre-training and fine-tuning phases. Fine-tuning FragGPT with LoRA further enhanced its validity, comparable to those of DeLinker and 3DLinker. DeLinker, employing a masking mechanism to enforce simple valence rules, achieved top results on the CASF and PDBbind datasets, closely trailing 3Dlinker on ZINC. The validity of the molecules generated by DiffLinker exhibited significant variation across the ZINC and CASF datasets, possibly owing to its implicit organization of atom coordinates.
Remarkably, FragGPT surpassed all models in generating over 98% novel molecules across datasets, exceeding the second-best model by a substantial margin of approximately 50% on ZINC, 42% on CASF, and 9% on PDBbind, showcasing its exceptional capability to generate novel molecules and explore a wider chemical space.
However, FragGPT displayed a relatively weaker performance in recovery metrics compared to other models, partly attributed to mismatches between fragment tokens and the specific connections of test molecules. Nevertheless, FragGPT still achieved notable recovery rates even without fine-tuning, and its recovery performance was only slightly inferior to that of DEVELOP on PDBbind, reflecting a balance between recovery and a reasonable training data segmentation strategy.
The performance of FragGPT on three drug-likeness metrics corroborated our initial hypothesis. FragGPT, especially FragGPT-LoRA, excelled in drug-like properties across all datasets, achieving superior SA, QED, and p
log
P score performances. This success can be attributed to our autoregressive fragment-by-fragment molecule generation strategy, which effectively circumvented the generation of chemically infeasible structures, especially in complex ring systems. FragGPT's outstanding performance in SA, QED, and p
log
P scores surpassed all other atom-by-atom generation models across all test sets, highlighting its efficacy in generating molecules with desired properties.
| Metric | FragGPT | FragGPT-LoRA | DeLinker | DEVELOP |
|---|---|---|---|---|
| CASF | ||||
| Validity↑ | 91.12% | 91.09% | 100.00% | 99.80% |
| Uniqueness↑ | 35.66% | 40.80% | 74.20% | 39.70% |
| Novelty↑ | 98.18% | 98.56% | 55.10% | 43.40% |
| Recovery↑ | 39.40% | 40.00% | 33.60% | 58.70% |
| SA↓ | 3.26 | 3.21 | — | 3.39 |
p log P↑ |
−0.26 | −0.24 | — | −0.54 |
| QED↑ | 0.39 | 0.54 | — | 0.52 |
![]() |
||||
| PDBbind | ||||
| Validity↑ | 95.86% | 95.52% | 100.00% | 99.50% |
| Uniqueness↑ | 40.00% | 43.00% | 87.80% | 76.20% |
| Novelty↑ | 99.19% | 99.61% | 71.10% | 78.20% |
| Recovery↑ | 16.78% | 13.00% | 1.00% | 15.30% |
| SA↓ | 3.48 | 3.41 | — | 3.87 |
p log P↑ |
−0.81 | −0.78 | — | −1.57 |
| QED↑ | 0.45 | 0.52 | — | 0.42 |
Due to the constrained modification space and comparatively larger building blocks in the R-group exploration task, our uniqueness performance on CASF was intermediate between DeLinker and DEVELOP, which slightly trailing on PDBbind. However, in terms of novelty, FragGPT exhibited robust capabilities, generating over 95% novel molecules and surpassing other models on both test datasets. Particularly note-worthy, FragGPT's pre-trained model achieved a remarkable 98.18% novelty on the CASF test dataset, surpassing the second-highest model, DeLinker, by 43.08%.
In terms of recovery, FragGPT exhibited comparable performance to other models. On the CASF test dataset, FragGPT demonstrated superior performance in the R-group exploration task, achieving a 25.42% higher recovery rate compared to the linker design task. This discrepancy may stem from the distinct linking methodologies employed in these two tasks. Linker design involves the intricate connection between two breakpoints, followed by the integration of the generated linker fragments, while R-group exploration simplifies this process by only considering the connection of a single breakpoint. The variability in molecules generated based on different linking points, even with identical linker fragments, may explain the lower recovery observed in the linker design task.
Regarding drug likeness properties, FragGPT excelled in SA, QED, and p
log
P on both test datasets, surpassing the other two models. Its consistently outstanding performance across both linker design and R-group exploration tasks underscores FragGPT's capacity in generating chemically reasonable molecules with substantial potential for lead design.
| Metric | FragGPT | DiffHopp | DiffHopp-EGNN | GVP-inpainting | EGNN-inpainting |
|---|---|---|---|---|---|
| Validity↑ | 85.30% | 91.40% | 75.70% | 65.20% | 79.30% |
| Uniqueness↑ | 82.80% | 59.20% | 0.64% | 66.80% | 66.70% |
| Novelty↑ | 99.80% | 99.80% | 100.00% | 99.70% | 99.90% |
| QED↑ | 0.48 | 0.61 | 0.51 | 0.55 | 0.47 |
log
P, and QED metrics of the molecules generated by FragGPT, FragGPT-LoRA (fine-tuned on the PDBbind training set), and FragGPT-ADMET. The pre-trained FragGPT model, without the constraints imposed by the PDBbind training set, achieved the highest scores in validity (92.81%), uniqueness (72.73%), and novelty (99.99%) among the three models. However, its recovery rate was comparatively lower, standing at 3.9%. Upon fine-tuning with the PDBbind training set, FragGPT-LoRA demonstrated a slight decrease in validity and uniqueness but an increase in recovery. This observation could be attributed to the limited size of the PDBbind training dataset, comprising less than 20
000 entries, which may reduce the model's search space.
| Model | FragGPT | FragGPT-LoRA | Test/PDBbind |
|---|---|---|---|
| Validity↑ | 92.81% | 77.63% | — |
| Uniqueness↑ | 72.73% | 43.97% | — |
| Novelty↑ | 99.99% | 99.98% | — |
| Recovery↑ | 3.90% | 11.80% | — |
| SA↓ | 3.04 | 3.00 | 3.30 |
p log P↑ |
−0.20 | −0.89 | −2.23 |
| QED↑ | 0.54 | 0.62 | 0.56 |
FragGPT demonstrated superior performance in the SA, QED, and p
log
P evaluations in comparison to the molecules from the PDBbind dataset. FragGPT achieved a higher SA score than the test set by a margin of approximately 0.26 and significantly outperformed it in p
log
P, with a difference of 2.03 higher. However, regarding the QED score, FragGPT's performance was comparable to that of the test set. After fine-tuning with the training set, FragGPT-LoRA exhibited substantial enhancement in both SA and QED scores, significantly surpassing the scores of the molecules in the PDBbind test set. In summary, FragGPT and FragGPT-LoRA demonstrated notable improvements in molecular quality compared to the molecules generated by the PDBbind test set.
P, stress response-antioxidant response element (SR-ARE) and inhibitor of cytochrome P450 2C9 protein (CYP2C9), under single- and multi-constraint conditions.
As depicted in Fig. 2, our findings revealed that FragGPT-ADMET outperformed FragGPT in generating molecules with improved properties across all three dimensions, both in linker design and de novo design tasks. Subsequently, we tested the proportion of molecules that met the specified criteria under multiple constraints. These multiple constraints include a log
P value between 1 and 3, as well as meeting the criteria for SR-ARE and CYP2C9 inhibition. As summarized in Table 6, FragGPT-ADMET achieved success rates 2.6 times higher in linker design and 1.2 times higher in de novo design compared to FragGPT. Integrating these results with Fig. 2, we observed a more pronounced enhancement of FragGPT-ADMET in the de novo design task, likely due to the limited chemical space in linker design, which restricts molecule generation to linker structures defined by terminal groups. In essence, FragGPT-ADMET exhibited superior conditional generation capabilities.
| Task | De novo design | Linker design |
|---|---|---|
| FragGPT | 24.55% | 39.89% |
| FragGPT-ADMET | 63.09% | 46.27% |
Molecular design tasks were conducted by utilizing RL tailored to the specific objectives across diverse scenarios. Fig. 3 depicts the progression of the RL optimization steps and the mean docking scores, QED, and SA values for the molecules generated at each step. Throughout all tasks, a consistent improvement in these three molecular properties was observed as the RL iterations progressed under the guidance of FragGPT. In each subplot, the red dashed line represents the values of the reference molecules.
In the linker design task, RL was employed to refine the linker structure of the reference molecule, aiming to optimize their three key properties. As depicted in Fig. 3(a), a continual improvement in the average molecular properties was observed through RL, ultimately reaching a level comparable to the reference molecule. Initially, the docking score increased, followed by a slight decrease, converging closely to the docking score of the reference molecule. Concurrently, QED demonstrated steady enhancement, while SA gradually declined, eventually surpassing the reference value. The initial rise and subsequent minor decline in docking score were attributed to model adjustments for balancing the three properties. The R-group exploration task involved RL optimization of the R-group structure of the reference molecule to enhance their three properties. As illustrated in Fig. 2(b), continuous enhancement was observed across all three molecular properties via RL, ultimately exceeding the reference molecule values. In the scaffold hopping and PROTAC design tasks, FragGPT demonstrated exceptional performance. As depicted in Fig. 3(c and d), the model-generated molecules exhibited a more significant enhancement in docking and QED scores compared to the reference molecules. Given the significantly longer length of PROTAC, the average SA scores for both generated and reference PROTACs were higher than the molecules in the other three cases. Consequently, the shift range of SA score was limited and occupied a relatively small part of the optimization process.
According to the foregoing results, we hypothesize that the variance in performance may stem from the relevance of each task to the corresponding objectives. As shown in Fig. 4, the fragments to be modified in linker design and R-group exploration were quite small, and thus the molecules generated by FragGPT remained highly similar docking conformations to the reference molecule, with minimal changes to the unaltered fragments. This feature not only stayed consistent with our initial design but also reduced the complexity of optimizing molecule properties, which was confirmed by the RL performance of FragGPT in these two cases, as seen in Fig. 3(a and b). In contrast, for scaffold hopping, while the generated molecules occupy the same protein pocket as the reference, the non-fixed orientations of the excised fragments lead to a more significant variation in the generated molecule structures compared to the other three cases. This may be the reason why the molecular properties in this case were much more difficult to be predicted or optimized.
Besides, FragGPT exhibited a remarkable capability in producing fragments that were highly analogous yet superior to the reference fragment. For example, the linker in the second row of Fig. 4(a) differed from the reference linker with only one atom, while the R-group in the second row of Fig. 4(b) precisely resembled the reference, comprising a five-atom heterocyclic ring. Remarkably, even in the PROTAC design case, FragGPT generated PROTACs with conformations quite similar to the reference PROTAC, leveraging the flexibility of its linkers to outperform the latter in terms of docking and QED scores. The visualization of these docking conformations, together with the above RL optimization steps, demonstrated the ability of FragGPT to generate molecules with comprehensive chemical properties and controllable conformational transformation.
743
265 training molecules and 19
367 validation molecules.
922 training molecules and 400 validation molecules.
393 small molecules and the validation set included 1933 molecules.
| Xfrag = {x1,x2,…,xn}. | (1) |
![]() | (2) |
At the same time, we found that the two ends of the disconnection sites in assembled molecular fragments and bonds is independent of the value of i or the order of the fragments within a molecular fragment sequence. The value of i can be considered as an unordered categorical information used to distinguish different pairs of break points. Based on this, we performed the first data augmentation method: randomly transforming the i value within (1 ∼ n), so that the break point identification of each fragment has a probability of any value in (1 ∼ n) (ensuring that the serial number i <= n appears only once). Through this transformation, for the molecular fragment group obtained through n break keys, there are a total of n factorial combinations of possibilities. By adjusting the order of break point numbering, the model can more effectively learn the relationship between the data. Furthermore, the assembly of fragments into valid molecules relies solely on their identification information. Unlike language models that operate with sequential sequences where varying orders can yield different meanings, different fragment sequences may represent the same molecule. Therefore, we conducted a second data augmentation method: by randomly shuffling molecular fragments, allowing the fragments to be in any position in the sequence, achieving the disordering of fragments in the sequence. Assuming the molecule consists of three fragments, namely fragment A, fragment B, and fragment C. After data augmentation, there are six possible combinations, namely {ABC, ACB, BCA, BAC, CBA, CAB}, and during the training phase, one of these possibilities is randomly selected as input. Both data augmentations are carried out simultaneously to further enrich the diversity of data and enhance the robustness of the model.
![]() | (3) |
For a SMILES data set, the language modeling task is used as the training target.23 The standard goal of language modeling is to maximize the likelihood:
F(D) = ∑log P(di|di−k…,d1;θ). | (4) |
GPT2 builds a neural network θ through the transformer decoder to model the conditional probability P, where k is the prefix or contextual information generated by the molecule. Here Xseq is the input of the model, which is initially encoded through TokenEncode. Among them, TokenEncode utilizes a tokenizer to obtain the token encoding of Xseq, followed by position encoding and word embedding on the token to derive H0. Subsequently, H0 is fed into the GPTBlock block of l layer, and the predicted molecular fragment sequence is decoded using TokenDecoder. Among them, TokenDecoder is Hl that is generated through the MLP layer and is the score of the model for each possible fragment in the vocabulary at each output position. The final fragment sequence is determined through softmax activation, and these fragments are then assembled to form the final molecule generated by the model.
| H0 = TokenEncode(Xseq), | (5) |
| Hl = GPTBlock(H0), | (6) |
| P(Xi) = TokenDecode(Hl). | (7) |
| Sadmet = {s1,s2,…,sk}, | (8) |
| S0 = Admet Encoder(Sadmet). | (9) |
Then we concatenate the ADMET feature S0 and the molecular fragment sequence feature H0, yielding Hs as the input of GPTBlock, which is expressed as:
| Hs = concatenation(S0,H0). | (10) |
| Hsl = GPTBlock(Hsl−1), | (11) |
| Hl = Slicing(Hsl), | (12) |
| P(Xi) = TokenDecode(Hl). | (13) |
| W0 + Wl = W0 + L1L2,L1 ∈ Rd*rL2 ∈ Rr*d | (14) |
During the training process, the parameters W0 are frozen, and only the parameters in L1 and L2 are updated. To maintain the original output of the network at the start of training while ensuring better convergence during learning, we follow the LoRA parameter initialization method. The parameters in L1 are initialized with a Gaussian distribution, and those in L2 are initialized to 0. If both matrices are initialized to 0 at the same time, all neurons will be initially equivalent and may easily cause the gradient to disappear. If all initialization is Gaussian, an excessively large offset will be obtained in the initial stage of model training, and too much noise will be introduced, which could hinder model convergence. By using these methods, we have fine-tuned our model on datasets such as MOSES and ZINC, enhancing its adaptability to specific data.
| S(m) = S(comment) + S(docking) + S(drug). | (15) |
The process of establishing the reward function involves inputting a specific task and calculating a penalty term for the difference between generated and reference molecules. The penalty term is sued to punish or reward RL for any deviation from the defined optimization goals within each training batch. The aim is to ensure that the model generates molecules that align with the set optimization goals. Finally, the model is optimized based on the reward index of the current batch of data to guide the training of LoRA accordingly.
000 molecules for testing on Moses, the other tasks generate 250 molecules for each test pair.
000 molecules for evaluation. For the other tasks, we sampled 250 molecules per test pair. The evaluation metrics, including validity, uniqueness and novelty were aligned with the findings reported in the original papers.2,41 Additionally, for traceability, SA, p
log
P and QED, we relied on the results reported in the FFLOM paper.21
log
P score. This score, penalized by ring size and synthetic reachability, is predicted using the model reported by You et al.55 A higher p
log
P score indicates better overall properties. For the de novo task, we used the MOSES evaluation metric.
![]() | (16) |
000.![]() | (17) |
![]() | (18) |
![]() | (19) |
| Frag(Gm,Tm) = 1 − cos(F(Gm),F(Gt)) | (20) |
![]() | (21) |
For the linker, R-group, and sidechain tasks, in addition to the validity, uniqueness, novelty, and drug feasibility evaluations, traceability evaluations were also performed. Recovery refers to the proportion of test data in the test set(test) that can regenerate the same test case set(G) as the ground truth.
![]() | (22) |
Footnotes |
| † Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4sc03744h |
| ‡ Equivalent authors. |
| This journal is © The Royal Society of Chemistry 2024 |