Open Access Article
Jerret
Ross
*,
Brian
Belgodere
,
Samuel C.
Hoffman
,
Vijil
Chenthamarakshan
,
Jiri
Navratil
,
Youssef
Mroueh
and
Payel
Das
*
IBM Research, Yorktown Heights, NY 10598, USA. E-mail: rossja@us.ibm.com; daspa@us.ibm.com
First published on 18th August 2025
Transformer-based models trained on large and general purpose datasets consisting of molecular strings have recently emerged as a powerful tool for successfully modeling various structure–property relations. Inspired by this success, we extend the paradigm of training chemical language transformers on large-scale chemical datasets to generative tasks in this work. Specifically, we propose GP-MoLFormer, an autoregressive molecular string generator that is trained on more than 1.1b (billion) chemical SMILES. GP-MoLFormer uses a 46.8m parameter transformer decoder model with linear attention and rotary positional encodings as the base architecture. GP-MoLFormer's utility is evaluated and compared with that of existing baselines on three different tasks: de novo generation, scaffold-constrained molecular decoration, and unconstrained property-guided optimization. While the first two are handled with no additional training, we propose a parameter-efficient fine-tuning method for the last task, which uses property-ordered molecular pairs as input. We call this new approach pair-tuning. Our results show GP-MoLFormer performs better or comparable with baselines across all three tasks, while producing molecules with higher diversity demonstrating its general utility for a variety of molecular generation tasks. We further report strong memorization of training data in GP-MoLFormer generations, which has so far remained unexplored for chemical language models. Our analyses reveal that training data memorization and novelty in generations are impacted by the quality and scale of the training data; duplication bias in training data can enhance memorization at the cost of lowering novelty. We further establish a scaling law relating inference compute and novelty in generations, and show that the proposed model excels at yielding molecules containing unique scaffolds while generating at ≈106 to 109 scale.
Interestingly, much of the recent performance gains for natural language models have come from training at scale—in terms of the number of parameters and the number of training samples.1–3 It is reported that larger language models that can memorize training data show improved generalization.4–6 Furthermore, data that is seen during training many times is memorized more and de-duplication of training data plays a big role in preventing such memorization.4,5,7,8
However, less work has been done in understanding the impact of training data scale and its memorization on the performance of generative models of molecules. Specifically, it remains under-explored to what extent a causal large language model of molecules, trained on large-scale (>100m) training data, memorizes its training data and demonstrates such memorization in its generations. In chemical language modeling tasks, molecules in training data originate from publicly available databases such as ZINC9 and PubChem.10 It is known that certain molecules as well as certain molecular features are over-represented in those databases11 but how such training bias is perpetuated by generative chemical language models remains relatively unknown.
An additional dimension of scaling in traditional large language models that has been investigated recently is inference-time compute scaling.12 It has been shown that with increasing inference compute, performance across multiple tasks can be increased for the same model, as it allows better coverage of the search space. On the other hand, the effect of inference scaling by increasing number of generations is under-explored for molecular generative models.
To bridge these gaps, in this work, we present a family of generative pre-trained molecular foundation models for the unconstrained and targeted generation of novel molecules. These decoder-only models are based on the recently published Molecular Language transFormer (MoLFormer) architecture.13 We refer to these Generative Pre-trained models as GP-MoLFormer. The base transformer architecture of our GP-MoLFormer consists of ≈47m parameters and uses an efficient linear attention mechanism together with rotary positional encodings—analogous to MoLFormer13 but using decoder instead of encoder blocks (Fig. 1A). The model is then trained with a causal language modeling objective on a large corpus of 0.65–1.1 billion canonicalized SMILES strings of small molecules from publicly available chemical databases.
We evaluate GP-MoLFormer on an unconditional de novo generation task as well as to two targeted molecular design tasks: scaffold-constrained molecular decoration and unconstrained property-guided optimization. For scaffold decoration, we exploit GP-MoLFormer's causal language modeling ability and establish GP-MoLFormer's ability to handle the task without undergoing any task-specific tuning. For the optimization task, we provide a prompt-tuning or soft prompt-learning algorithm that learns from partial orderings of molecules. We name this method pair-tuning (Fig. 1B). Results show that pair-tuning on GP-MoLFormer provides on par or better performance in three different property optimization tasks, namely (i) drug-likeness optimization, (ii) penalized log
P optimization, and (iii) optimization of dopamine type 2 receptor binding activity.
We further extensively evaluate quality of GP-MoLFormer-generated molecules, in the light of the training data scale and the bias present in the training data. Experiments reveal significant memorization in de novo generations affecting novelty therein. We further analyze how representational bias encoded in the public chemical databases is perpetuated by a generative chemical language model and is reflected in its generation quality. To our knowledge, this is the first report on effect of training data memorization in a generative pre-trained chemical language model. Further, we investigate the effect of inference compute as another scaling dimension by increasing the number of generated samples and establish a inference scaling law relating number of generations with novelty in them. Experiments demonstrate that novelty in de novo generations by GP-MoLFormer drops when number of generated samples reaches a scale of ≈1b. Nevertheless, GP-MoLFormer is able to generate novel, unique, diverse, and valid molecules even when the generation pool reaches a size of 10b, while showing consistent memorization of training data.
Our main contributions are:
• We provide a pre-trained, autoregressive, transformer-based SMILES decoder, GP-MoLFormer-Uniq.
• We report the beneficial effects of training this class of models on up to 1.1 billion SMILES, compared to models trained on smaller datasets, by demonstrating higher scaffold-level uniqueness and diversity in GP-MoLFormer generations even when performed at scale, which is attributed to its training data scale and diversity.
• We provide a parameter-efficient finetuning method, which utilizes property-ranked molecule pairs as input, for property-guided molecule generation and show its effectiveness on three different tasks.
• We further study how training data duplication bias (and therefore training size) affects de novo generation and reveal that more duplication significantly reduces novelty in generations.
• We also report a scaling behavior relating inference compute and novelty that follows exponential decay, while showing that GP-MoLFormer can generate a notable fraction of novel SMILES, even when number of generations reaches 10b.
We compare GP-MoLFormer-Uniq with different baseline models, such as the character-level recurrent neural network (CharRNN),14 SMILES variational autoencoder (VAE),14 junction tree VAE (JT-VAE),15 latent inceptionism on molecules (LIMO),16 and MolGen-7b.17 Except MolGen-7b, all baselines were trained and tested on datasets from MOSES,14 whose origin is the ZINC Clean Leads dataset.9 The size of that training set is 1.6m. MolGen-7b was trained on 100m filtered molecules from ZINC-15 (ref. 9) as detailed in Irwin et al. (2022).18 LIMO and MolGen-7b are trained using an alternative molecular string representation, SELFIES,19 that guarantees 100% validity of generated molecules. We, in contrast, train GP-MoLFormer-Uniq on SMILES as recent work shows that training a generative language model on SELFIES may hurt model's exploratory ability.20 All baseline performances are reported on their corresponding test set consisting of 175k molecules (if the original test set was larger, this is a randomly selected subset).
First, we note that GP-MoLFormer-Uniq (and GP-MoLFormer) exhibits excellent validity and uniqueness at standard generation size (30/10k). See SI Table S1 for comparison with the baseline models. At the same time, we argue that these metrics are insufficient to measure generation at scale. Furthermore, as we show later, novelty is dependent on training set size in addition to generation size so models trained on different size datasets are not directly comparable. See the Scaling results section below for further discussion.
Standard metrics for evaluating model-generated molecules are reported in Table 1 for a generation set of 30k molecules. When compared to baselines, GP-MoLFormer-Uniq is equally performant in generating molecules that share high cosine similarity with the corresponding reference molecules at the fragment (Frag) level, consistent with low Fréchet ChemNet Distance (FCD).21 The scaffold cosine similarity (Scaf) and similarity to the nearest neighbor in the test set (SNN) of GP-MoLFormer-Uniq is comparable to that of baselines for 30k generations. At the same time, GP-MoLFormer-Uniq generates molecules with high internal diversity (IntDiv), i.e., average pairwise dissimilarity. All these metrics are computed using the MOSES14 framework (we limit our scope to MOSES in this study, although we note that myriad other benchmarks are available for evaluating generative molecular models22–24).
| MOSES metrics | MoLFormer-based metrics | |||||||
|---|---|---|---|---|---|---|---|---|
| Frag↑ | Scaf↑ | SNN↑ | IntDiv↑ | FCD↓ | DNN↓ | IntDiv2↑ | FMD↓ | |
| a https://huggingface.co/zjunlp/MolGen-7b. | ||||||||
| CharRNN14 | 0.9998 | 0.9242 | 0.6015 | 0.8562 | 0.0732 | 5.735 | 13.03 | 0.1515 |
| VAE14 | 0.9984 | 0.9386 | 0.6257 | 0.8558 | 0.0990 | 5.549 | 13.09 | 0.2531 |
| JT-VAE15 | 0.9965 | 0.8964 | 0.5477 | 0.8551 | 0.3954 | 6.312 | 12.97 | 1.700 |
| LIMO16 | 0.6989 | 0.0079 | 0.2464 | 0.9039 | 26.78 | 11.41 | 13.08 | 162.0 |
| MolGen-7b17 | 0.9999 | 0.6538 | 0.5138 | 0.8617 | 0.0435 | 6.788 | 12.58 | 0.1237 |
| GP-MoLFormer-Uniq | 0.9998 | 0.7383 | 0.5045 | 0.8655 | 0.0591 | 6.970 | 13.10 | 0.1844 |
We further report analogous metrics computed using MoLFormer13 embeddings as the chemical features and estimate distances using those embeddings as a measure of similarity (under column MoLFormer-based metrics; see Table 1 caption for details). The trends observed on these metrics further support the fact that GP-MoLFormer-Uniq generates a molecular distribution that is close to the training in terms of fragment and scaffold composition as well as projections to MoLFormer space, while exhibiting high diversity, when compared to baselines.
We also calculated the pairwise Tanimoto similarity between novel and unique generations and molecules from the corresponding 175k sample test set using molecular fingerprints as features. We then report both the average similarity per generated molecule and the maximum similarity per generated molecule over the test set. These results are presented in Table 2. GP-MoLFormer-Uniq is slightly lower than MolGen-7b in both average mean and average maximum similarity, indicating generations are slightly more dissimilar with respect to its test set. LIMO results are much lower than both of these, though, as we see in Table 1, the outputs of this model do not match its test set well, so this is to be expected. Also, LIMO is more suited for property-optimized generation, and therefore we do not include LIMO in the further comparison for de novo generations.
| Model | Mean | max |
|---|---|---|
| LIMO | 0.0905 | 0.2474 |
| MolGen-7b | 0.1385 | 0.5138 |
| GP-MoLFormer-Uniq | 0.1354 | 0.4533 |
Interestingly, Table 1 shows higher internal diversity within the generated molecules for GP-MoLFormer-Uniq. We further extend the internal diversity analysis to the scaffolds present in generated molecules. As shown in Table 3, GP-MoLFormer-Uniq generated scaffolds show more internal diversity than baselines like CharRNN, VAE, and MolGen-7b, suggesting that training on data at scale promotes diversity within generations. As will be shown in later sections, GP-MoLFormer-Uniq also excels at yielding higher number of unique scaffolds, while performing generation at scale. These results reinforce the greater utility of the proposed model, especially when tested at scale. Further analysis of scaffold novelty can be found in SI Fig. S1.
| GP-MoLFormer-Uniq | MolGen-7b | CharRNN | VAE | |
|---|---|---|---|---|
| IntDiv | 0.855 | 0.842 | 0.840 | 0.847 |
Fig. 2 shows the property distributions of the different test sets, as well as of molecules generated using GP-MoLFormer-Uniq. The generated distribution shows very good reproduction of the corresponding test distribution. Furthermore, while GP-MoLFormer-Uniq's performance is estimated on a held-out test set that is of similar size, we found this test set to be more diverse in terms of number of unique scaffolds present within the set (126k compared to 124k in the ZINC-15 subset and 77k in the MOSES set) and by comparing different property distributions with that of the other baselines. More analyses on how these statistics change with training data variations and generated pool size can be found later (see Discussions).
We also examine the domain adaptation of GP-MoLFormer via down-stream finetuning on a set of 36.7m drug-like molecules from PubChem.10 In Fig. 2, we show results of this fine-tuned model, referred as GP-MoLFormer-Druglike. The fine-tuning set contains molecules with QED >0.6.25 Results show that, the generated molecules undergo a distribution shift in properties as expected. For example, the QED distribution is shifted toward right compared to GP-MoLFormer-Uniq. We also provide examples of de novo generated molecules in the SI Fig. S3.
It is important to note that all this analysis is intended simply to show that our model is able to balance reproducing the training distribution, while generating novel, diverse, and unique outputs, both at SMILES and at scaffold level. While there are diminishing returns to be had trying to more closely match the training distribution, we show that increasing the size and diversity of the training data is one way to produce better quality molecules.
O)N(CCN1*)* would score 2 + 0 = 2 while C1N(*)C(
O)N(*)C1 would score 12 + 3 = 15. If multiple representations are equivalently optimal, we save all of them. During the generation step, we provide the candidates produced in this pre-processing step as input to GP-MoLFormer.
Next, the task is to generate multiple possible candidates for the first attachment point given an input scaffold. First, we collect all valid candidates from that generation. Then, we again generate multiple possible candidates for the recently extended scaffolds. This process is repeated until all the attachment points are decorated then we collect all valid molecules generated.
We compare the performance of GP-MoLFormer in terms of generating DRD2 active molecules that will pass the DRD2 binding classifier (p > 0.5). For baselines of comparison, we consider our own random generations from GP-MoLFormer, as well as an earlier scaffold-conditioned generation model26 that was specifically trained for scaffold decoration tasks and was then used to decorate the same scaffolds under investigation here with fragments from ChEMBL. In contrast to this baseline model, GP-MoLFormer has not seen scaffold-constrained generation task during pre-training, nor is it specifically fine-tuned for this purpose. Table 4 shows that GP-MoLFormer generates more DRD2 active hits compared to a random baseline of de novo generation, as well as a generative model trained on this specific task. Examples of scaffold-decorated molecules using GP-MoLFormer are shown in the SI Fig. S4.
| Predicted active hits (%) | |
|---|---|
| Scaffold decorator26 | 3.64 |
| De novo GP-MoLFormer | 0.83 |
| Scaffold-conditioned GP-MoLFormer | 4.58 |
We exploit prompt-tuning to introduce a novel means for enabling GP-MoLFormer to tackle property-specific molecular optimization tasks, where the goal is to generate molecules with a specific property value above a pre-defined threshold. Below, we describe the pair-tuning framework and then show that pair-tuning performs well on a set of three tasks. We evaluate pair-tuning using GP-MoLFormer on three property optimization benchmarks, namely drug-likeness (QED) maximization, penalized log
P maximization, and activity maximization for DRD2. The first two properties, QED and penalized log
P, are important considerations in drug discovery, and these task shows the ability of a model to optimize salient aspects of a molecule, even if maximization of these properties by themselves is of low utility.28 The goal of the third task is to increase the binding affinity of a compound to a ligand, in this case, the dopamine D2 receptor.
In this formulation, we do not need absolute property values of the molecules, rather only ordered pairs of molecules are needed. This is to mimic the scenario of many drug and material development tasks, in which two molecules are compared with each other to guide molecular optimization and prioritization, especially for tasks with limited available data. For example, Matched Molecular Pair (MMP) analysis allows the rapid estimation of property differences.29,30 However, MMP analysis is limited to comparing close molecular derivatives and common molecular derivations, and it can fail to model important chemical contexts. The present formulation of optimizing molecules is free from such constraints and only aims to learn task-specific soft prompts to generate more optimal molecules given a seed molecule.
P optimization.
Table 5 shows results of pair-tuning on GP-MoLFormer, as well as of the baselines, in terms of the generated molecules with high penalized log
P. Penalized log
P is calculated as log
P − SA − max(maxrings(size) − 6, 0), i.e., log
P penalized by SA and maximum ring size, if larger than 6. We report pair-tuning performances as a function of two different k, where k is the number of targeted generation attempts per molecule. For k = 125, using a test set containing 800 molecules gives a total number of generated molecules of 100k, which is the same used for the baselines. The baselines under consideration are JT-VAE,15 GCPN,31 MolDQN,28 MARS,32 GraphDF,33 and LIMO.16 Penalized log
P can be artificially inflated simply by generating molecules with increased length, specifically by adding alkyl carbons.16,32 Many works, e.g., GCPN, MolDQN, and LIMO, avoid this by reporting top property scores given length constraints, e.g., limiting the length up to the maximum molecule size of the ZINC250k dataset.34 MARS, on the other hand, does not consider such a length constraint. We also report the top 3 scores for pair-tuning with a length constraint (length < 38), added post generation, in Table 5 as the value within parentheses. Compared to the strongest length-constrained baselines, pair-tuning generates molecules with comparably high values. When the length constraint is not considered, pair-tuning still generates molecules with higher but reasonable penalized log
P values. Note that pair-tuning does not require feedback or evaluation on generations from an additional reward model or a property predictor, nor is the generative model updated during the tuning. We also report top 3 scores for 1m generations (k = 1000), which requires less than an hour to generate. Although all the baselines produce molecules with 100% validity due to their methods utilizing SELFIES or graphs, our method's validity is still very high (around 95%) and overall this is negligible compared to the ease of generating additional molecules. Altogether, Table 5 shows that the proposed method can generate molecules with even higher penalized log
P values, both with and without a length constraint.
P and QED optimization. Pair-tuning is performed using frozen GP-MoLFormer. Baseline performances are taken from Zhou et al. (2019)28 and Eckmann et al. (2022)16 and are reported on 100k generations as per LIMO.16 For GP-MoLFormer, we set k, the number of targeted generation attempts per molecule, to 125—given a test set of size 800 this results in 100k total generations. Values in parentheses are after post hoc length filtering. Bold values indicate the highest property values found (both length-constrained and unlimited)
Penalized log P |
QED | |||||||
|---|---|---|---|---|---|---|---|---|
| 1 st | 2nd | 3rd | Validity | 1st | 2nd | 3rd | Validity | |
| JT-VAE | 5.30 | 4.93 | 4.49 | 100% | 0.925 | 0.911 | 0.910 | 100% |
| MARS | 45.0 | 44.3 | 43.8 | 100% | 0.948 | 0.948 | 0.948 | 100% |
| GRAPHDF | 13.7 | 13.2 | 13.2 | 100% | 0.948 | 0.948 | 0.948 | 100% |
| LIMO on z | 6.52 | 6.38 | 5.59 | 100% | 0.910 | 0.909 | 0.892 | 100% |
| LIMO | 10.5 | 9.69 | 9.60 | 100% | 0.947 | 0.946 | 0.945 | 100% |
| GCPN | 7.98 | 7.85 | 7.80 | 100% | 0.948 | 0.947 | 0.946 | 100% |
| MolDQN-bootstrap | 11.84 | 11.84 | 11.82 | 100% | 0.948 | 0.944 | 0.943 | 100% |
| Pair-tuning (k = 125) | 13.18 (7.12) | 12.24 (6.61) | 11.51 (6.40) | 94.7% | 0.948 | 0.947 | 0.947 | 94.7% |
| Pair-tuning (k = 1000) | 19.59 (9.35) | 15.51(8.93) | 15.27 (8.64) | 94.5% | 0.948 | 0.948 | 0.948 | 94.5% |
P, we show results for QED optimization in Table 5 (also see SI Fig. S5 for generated molecules) compared with the same baselines. Again, pair-tuning performances are reported for two different values of k, showing comparable performances with respect to baselines. SI Tables S2 and S3 further demonstrate that pair-tuning with GP-MoLFormer produces higher scoring molecules that also share high diversity as well as high closeness to training distribution, compared to baselines, which is consistent with results in Table 3. To further establish the usefulness of pair-tuning, we also compare it with full fine-tuning of GP-MoLFormer on the high-QED molecules from the same training set. Results show that full fine-tuning of the base model triggers collapse in terms of unique generations. Details are available in SI Table S4.
| Predicted activity score | Average seed score | |
|---|---|---|
| Mol-CycleGAN | 0.381 | 0.179 |
| Gargoyles | 0.782 | 0.122 |
| Pair-tuning | 0.844 | 0.007 |
To summarize, GP-MoLFormer is trained on a dataset of 650m–1.1b SMILES, which captures the relative abundance of molecules, as well as the presence of the same molecule in different context, as found in chemical databases and is evaluated on generations up to a scale of billions. This is in contrast to the existing molecular generation benchmarks that report performance metrics for a relatively small 10–30k generations, and to the current generative molecular models that are designed to target a specific distribution of molecules, e.g., synthetic molecules with biological activity or natural products, and are trained on 1–100m samples.14
We report in Table 7 the percentage of novel (unseen in training), valid (syntactically correct), and unique (not previously generated) molecules for both GP-MoLFormer and GP-MoLFormer-Uniq, for generation size of 30k to 10b. The results show that the fraction of novel generations stays at a consistent ≈32% for GP-MoLFormer when the number of total generated molecules is below 1b. Novelty in GP-MoLFormer-Uniq is ≈5–8% higher compared to that of GP-MoLFormer for all generation pool sizes. At or beyond 1b generations, the fraction of novel and unique generations drops but still remains significant. Even for 10b generations, GP-MoLFormer is able to generate a significant 16.7% novel molecules while GP-MoLFormer-Uniq is able to generate 21.4% novel molecules. GP-MoLFormer, irrespective of training data, outputs chemically valid SMILES almost all the time. While the percentage of valid molecules drops slightly with increasing generation pool size, it still is over 99% for 10b generations.
| Generation size | Training size = 650m | Training size = 1.1b | ||||
|---|---|---|---|---|---|---|
| Novel | Unique | Valid | Novel | Unique | Valid | |
| 30k | 0.390 | 0.997 | 1.000 | 0.323 | 0.997 | 0.997 |
| 100k | 0.393 | 0.996 | 0.999 | 0.326 | 0.998 | 0.998 |
| 1m | 0.395 | 0.996 | 0.999 | 0.323 | 0.996 | 0.997 |
| 10m | 0.400 | 0.991 | 0.996 | 0.322 | 0.989 | 0.997 |
| 100m | 0.385 | 0.947 | 0.996 | 0.327 | 0.989 | 0.997 |
| 1b | 0.340 | 0.675 | 0.996 | 0.278 | 0.611 | 0.997 |
| 10b | 0.214 | 0.270 | 0.996 | 0.167 | 0.223 | 0.997 |
Additionally, when comparing the 10b molecules generated by the GP-MoLFormer and by the GP-MoLFormer-Uniq model, 67 to 74% of the novel molecules generated by a model are unique to that model (i.e., not in the other model's generated set). This implies that the two models learned separate but overlapping manifolds. This aspect of different coverage of the molecular manifold with different model variants will be investigated further in future work.
This result confirms that (i) GP-MoLFormer trained on a billion of SMILES memorizes training samples, as seen from the high number of exact matches (1 – novelty, which can be up to 60%) with training molecules; and (ii) training memorization becomes less when the training data is de-duplicated, enabling more novel generation. (iii) With scaling of inference compute, novelty in generations reduces, but remains significant, even when evaluated against ≈10b generations. In summary, in all cases studied here, GP-MoLFormer is capable of generating novel, diverse, and valid molecules.
Data de-duplication is the first step towards removing such bias, which reduces the concentration of the high density regions of the data manifold. In this case, de-duplication removes isomeric population information as well as repeated molecules across databases. The de-duplicated data is closer to a data manifold that has more homogenized density all over. Training on such a data manifold results in higher novelty in generations, as found for GP-MoLFormer-Uniq when compared to GP-MoLFormer, as shown in Table 7.
Although many existing molecular generative models trained on a much smaller and much focused datasets have demonstrated near-perfect (100%) novelty in generations, they are for most part not suitable for studying the trade-off between training data memorization and generation novelty. Investigating such phenomenon requires studying a generative chemical (language) model that has been trained on a broader-purpose and much larger dataset at scale. Our experiments attempt to address this under-explored aspect in this study. As shown in Table 7, novelty in GP-MoLFormer generations are lower compared to ≈100% reported by baselines,14 but still sufficiently high for practical use. When compared with recent baselines, GP-MoLFormer generations are more dissimilar to test molecules (see the earlier sections and Table 1), though GP-MoLFormer's test set is more diverse. And, finally, the low novelty in GP-MoLFormer's generations is reflective of modeling its vast training set that represents the relative usage of molecules in real-world.
Similarly, the present study highlights the importance of studying generated sets of different sizes to obtain a comprehensive view of the quality of generations, particularly when the generative model is trained on data at scale. As GP-MoLFormer-Uniq aims to capture a training data manifold of more uniform density, which is enabled by de-duplicating the training SMILES, we see a 1% rise in novelty as we increase the number of generated samples from 30k to 10m. A similar observation has been reported in image generation38 and language generation.7,39 To summarize, novelty in generations is influenced by the support provided by both the training distribution and the generated distribution, and should therefore be assessed relative to the sizes and diversity of those two sets.
These results in Table 7 complement and support earlier efforts focusing on studying scaling behaviors of chemical language models. One such noteworthy effort along this line is Frey et al. (2023),40 where neural-scaling behavior in large chemical models was investigated by studying models with over 1b parameters and a scaling relation following a power law was established between training loss and model parameters. However, the models tested in that work were only pre-trained on datasets of size up to ≈10m data points, which is very small compared to the size of the chemical universe. In Ross et al. (2022),13 the scaling behavior of MoLFormer, which is a transformer-based molecular encoder built using a masked language modeling objective, was studied. That work clearly established the scaling behavior underlying adaptation of a pre-trained model across downstream tasks, in which the number of model parameters was up to 47m while the number of training points considered was >1b. It was shown that a MoLFormer trained on 100m SMILES consistently underperformed across a wide variety of property prediction tasks, including quantum mechanical and physiological, when compared to the model trained on >1.1b SMILES, indicating predictive ability may benefit from such bias in training data. In contrast, the results in Table 7 show that a generative chemical language model trained on cleaner de-duplicated training data produces more novel generations.
We next investigate how these metrics change with varying number of generated and test molecules. Table 8 shows that, with increasing the generated pool size, scaffold similarity with respect to the test molecules becomes >0.9 while SNN reaches >0.5 when compared against 175k held-out test samples. When a larger test set of 1m molecules is used, further increases in both scaffold similarity and SNN are observed. These results imply that, with increasing size and diversity of the training data, the typical metrics used in assessing molecular generative models, such as various similarity measures with respect to a test set, should be carefully analyzed with generation and test sets that are larger in size compared to what is typically used in the field. Note that, even for 1m generations, GP-MoLFormer produces highly diverse molecules.
| Test size | Gen. size | MOSES | MoLFormer | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Frag↑ | Scaf↑ | SNN↑ | IntDiv↑ | FCD↓ | DNN↓ | IntDiv2↑ | FMD↓ | ||
| 175k | 30k | 0.9998 | 0.7383 | 0.5045 | 0.8655 | 0.0591 | 6.970 | 13.10 | 0.1844 |
| 100k | 0.9998 | 0.8653 | 0.5045 | 0.8657 | 0.0279 | 6.967 | 13.10 | 0.1025 | |
| 1m | 0.9998 | 0.9375 | 0.5040 | 0.8658 | 0.0178 | 6.970 | 13.11 | 0.0741 | |
| 1m | 30k | 0.9998 | 0.7702 | 0.5738 | 0.8655 | 0.0646 | 6.180 | 13.10 | 0.1684 |
| 100k | 0.9998 | 0.9026 | 0.5740 | 0.8657 | 0.0331 | 6.179 | 13.10 | 0.0874 | |
| 1m | 0.9998 | 0.9786 | 0.5739 | 0.8658 | 0.0227 | 6.183 | 13.11 | 0.0600 | |
| Gen. mol. | Unique scaffolds | |||
|---|---|---|---|---|
| GP-MoLFormer-Uniq | MolGen-7b | CharRNN | VAE | |
| 10k | 0.839 | 0.840 | 0.714 | 0.724 |
| 100k | 0.742 | 0.723 | 0.525 | 0.533 |
| 1m | 0.581 | 0.550 | 0.326 | 0.326 |
| 10m | 0.388 | 0.343 | 0.163 | 0.160 |
| y = ae−bx | (1) |
| y = ae−10cx | (2) |
To better model positional dependence of tokens within a SMILES string, MoLFormer deviates from using the default absolute position embeddings and instead uses rotary embeddings:44
. Given the size of the transformer model and the efficient linear attention, GP-MoLFormer takes only around 3 milliseconds for a single forward pass during generation, using a single A100 GPU.
In order to scale our training to large datasets (>1b data points), we relied on adaptive bucketing of mini-batches by sequence length, as well as parallelization via distributed data-parallel training. The combination of linear attention, bucketing, and data parallelism allowed us to reduce the number of GPUs needed from roughly 1000 for quadratic attention with no bucketing to 16.
644 molecule pairs where the first/seed molecule has a QED value in the range of 0.7–0.8 while the second/target molecule has a QED of 0.9–1.0. The penalized log
P paired data consists of 60
227 molecule pairs. It should be noted that while the paired datasets were collected such that molecular similarity within the pair is 0.4 and 0.6 for QED and log
P, respectively, we demonstrate pair-tuning only on unconstrained property optimization tasks—we do not account for similarity preservation. The test set size for both QED and penalized log
P optimization was 800. For the DRD2 binding optimization task, we used 34
404 molecule pairs from ZINC and Olivecrona et al. (2017)36 for training and a test set of 1000 molecules.49 For scoring the generated molecules, the bioactivity prediction model from Olivecrona et al. (2017)36 is used; inactive compounds were defined with p < 0.05 and actives were with p > 0.5.
The vocabulary includes 20 randomly initialized prompt embeddings as well as the <unk> embedding from GP-MoLFormer training. For training, we prepended all 20 prompt embeddings to the <bos> embedding, followed by the embeddings of the first/seed molecule in a specific pair. We then add the <unk> embedding at the end of the first/seed molecule. After the <unk> embedding, we add the embeddings of the target molecule, followed by the <eos> embedding.
For evaluation, we do a forward pass using the following sequence: the first 20 prompt embeddings + the <bos> embedding + the input molecule embeddings + the <unk> embedding. After that, we sample from the token distribution generated by GP-MoLFormer until <eos> is encountered. For all pair-tuning experiments, batch size was set to 35, the learning rate was fixed at 3 × 10−2, and the number of epochs run was 1000. Each epoch took 6 minutes to complete on a single GPU.
Supplementary information contains additional context, descriptions, experiments, and output samples from the model. See DOI: https://doi.org/10.1039/d5dd00122f.
| This journal is © The Royal Society of Chemistry 2025 |