Mauricio
Cafiero
Department of Chemistry, University of Reading, Reading, RG1, UK. E-mail: m.cafiero@reading.ac.uk
First published on 19th June 2025
Token generation in generative pretrained transformers (GPTs) that produce text, code, or molecules often uses conventional approaches such as greedy decoding, temperature-based sampling, or top-k or top-p techniques. This work shows that for a model trained to generate inhibitors of the enzyme HMG-coenzyme-A reductase, a variable temperature approach using a temperature ramp during the inference process produces larger sets of molecules (screening libraries) than those produced by either greedy decoding or single-temperature-based sampling. These libraries also have lower predicted IC50 values, lower docking scores, and lower synthetic accessibility scores than libraries produced by the other sampling techniques, especially when used with very short prompt-lengths. This work explores several variable-temperature schemes when generating molecules with a GPT and recommends a sigmoidal temperature ramp early in the generation process.
In this work and other work cited below, GPTs are discussed extensively. A GPT is simply a neural network model (often an LM) that has been trained on some body of text—or, in this work, SMILES strings—in order to learn the rules, or grammar of the text. This is known as the pre-training. The GPT is then fine-tuned on some specific, usually smaller, dataset to complete a task; in this case generating novel SMILES strings. These neural networks include a component called a transformer, which is simply a component that uses a technique called self-attention to figure out which tokens typically occur before or after a given token,2 such as putting “bike” after “my” in the example above. A Transformer decoder specifically is trained to predict what tokens come next in a series, while an encoder takes context from before and after the token in question.
Unlike the natural language generation (NLG) needed for LMs, SMILES strings have a more rigid set of “grammar” rules: a string with c1ccccc in the sequence has to have a “1” either next or at some point soon so as to close the aromatic ring; if there is no 1 after this point the SMILES string is not viable. This grammar rigidity is similar to coding, wherein certain structures have to have a particular structure, such as
for (int i = 0); |
These variable temperature approaches show that challenging and confident tokens can and should be generated at different temperatures. In molecular generation, confident tokens lead to predictable structures, while challenging tokens lead to more variability. For example, the beginning of a SMILES string can be considered more challenging, as there are many options for how the molecule's structure can develop, while the middle and end of SMILES strings have more confident tokens, as they must follow the grammar rules in order to complete the structure correctly. When trying to generate novel molecules, having a lower temperature at the beginning, where a challenging token can lead to a nonsensical structure, can help produce viable SMILES strings, while higher temperatures near the end, were there are more confident tokens, can lead to greater variability.
Other molecule generation models typically use standard token sampling techniques. Bagal et al. trained a GPT to generate molecules with tuned properties.6 Their model uses only T = 1.0 token sampling, which returns the native probability distributions, i.e. no scaling of probabilities is performed, so a narrow distribution will remain narrow. Two other recent transformer-based molecule generation models by Tysinger et al.,7 Yang et al.,8 and an RNN-based generation model by Urbina et al.9 make no mention of temperature in the generation process, suggesting the use of greedy decoding. The transformer and RNN-based “Reinvent” model of Loeffler et al.10 makes use of constant temperature sampling as well as beam-search, wherein a set of generated SMILES strings are kept during generation and the best are selected for by using log-probabilities. Tibo et al. have published a transformer model that does not emphasize sampling a wide chemical space, but rather searches for similar molecules.11 This model also uses beam search, and no mention of temperature is made. The RNN-based bidirectional generative model of Grisoni et al. can build a molecular SMILES string in both the forward and backwards directions, and uses temperature-based sampling at T = 0.7.12 Chang and Ye use a transformer encoder model and bimodal inputs to generate novel molecules using both greedy decoding and stochastic token selection.13 The transformer decoder of Ross et al. generates molecules using temperature-based sampling at T = 1.0.14 Sob et al. have trained a variational autoencoder within a transformer encoder-decoder framework to generate new molecules using reinforcement learning based on docking scores to specific targets.15 This type of model, since it generates from a latent-space representation, cannot implement a temperature-like parameter equivalent to those discussed here, though the generation of latent space representations can be based on such a variable. Another recent non-transformer-based generative model (a pixel-CNN) by Noguchi and Inoue likewise makes no mention of generation temperature.16 The current work thus seems to be unique in its approach to using temperature in the generation process, as no other molecular generator reports using dynamic variable-temperature sampling for token generation.
In this work, a previously trained and calibrated GPT is used to generate libraries of molecules using greedy decoding, temperature-based sampling, and dynamic variable temperature sampling. The molecule libraries generated are evaluated using a previously published deep neural network (DNN) trained to predict HMGCR IC50 scores with a training score of 0.92 and a validation score of 0.84. The libraries of molecules are also evaluated by docking calculations, synthetic accessibility scores, quantitative estimates of druglikeness, and various similarity measures. A sigmoidal temperature ramp with a high final temperature is shown to be the most effective generation technique when used with very short prompt-lengths.
To generate a molecule from a prompt, at each step k of the generation process, the existing set of tokens (or just the prompt if it is the first step) is passed through the model and the probability that each of the possible tokens (i) out of the 85 tokens in the vocabulary being the next token [Pk(i)] is calculated. With T = 0.0, or greedy decoding, the token with the highest probability is chosen each time. In temperature-based sampling, at each step, the probabilities are scaled according to the temperature:
![]() | (1) |
In the variable-temperature token generation used in this work, the token selection process switches between greedy decoding and temperature-based sampling while the molecule is being generated, and the temperature increases or decreases during the process as well. In this work, three increasing temperature schemes were tested. First, a slowly increasing exponential:
T = T0χeχ−1 | (2) |
T = T0[1 − e−χ + χe−χ] | (3) |
![]() | (4) |
T = T0(1 − χ)e−χ, | (5) |
T = T0(1 − χ)eχ, | (6) |
![]() | (7) |
![]() | ||
Fig. 2 Variable temperature sampling schemes with an increasing ramp beginning at zero and ending at 0.5 (eqn (2)–(4), left) and a decreasing ramp beginning at 0.5 and ending at zero (eqn (5)–(7), right). Traces are: slowly increasing/decreasing exponential (blue), rapidly increasing/decreasing exponential(orange), sigmoid (green). |
The final temperature ramp used in this work, based on the results obtained with eqn (2)–(7), was an increasing sigmoid, activated at 10% of the total number of generation steps. This ramp is shown in Fig. 3. In all temperature ramps used in this work T = 0.0, or greedy decoding, was used for any temperature less than 0.015, in order to improve numerical stability in the generation process.
![]() | ||
Fig. 3 Sigmoidal variable temperature sampling scheme with an increasing ramp beginning at zero and ending at 0.5. Sigmoid centered at 10% of the maximum token length. |
In this work, libraries of up to 5000 molecule were generated for each of the four prompt lengths with four set temperatures: 0.0, 0.5, 1.0, and 2.0. Libraries were also then generated using eqn (2)–(7) to vary the temperature during the generation process, all beginning or ending at T = 0.5. This temperature was chosen as it was found to produce robust, potent libraries in the previous work.1 Finally, the increasing sigmoid of eqn (4), activated at 10% of the number of generation steps, was used, ending at four temperatures: 0.5, 1.0, 1.5 and 2.0. Overall this produced fourteen temperature variations for each of the four prompt lengths, or fifty-six total libraries being generated.
The libraries were analysed for the presence of the HMG coenzyme-A pharmacophore (Fig. 1c) as well as several moieties typical of type I and II statins: a fluorophenyl ring and a methane sulfonamide group (both found in type II statins), and a butyryl group and decalin ring (both found in type I statins). The presence of the decalin ring and butyryl group (Fig. 4) are the defining characteristics of a type I statin; type II statins are fully synthetic compounds that often (but not always) have a fluorophenyl ring replacing butyryl group and are in general larger and more bulky than type I statins (Fig. 4). The counts for these moieties in each library are presented in the supporting data (ESI†).
Tanimoto similarities27 between every pair of molecules in each library and between each molecule in each library and a set of known statin molecules were calculated by using Morgan fingerprints28 of radius 2, which is roughly equivalent to extended connectivity fingerprints of diameter 4. The percentages of each library that showed a greater than 0.25 similarity to Atorvastatin and Simvastatin (a representative type I and type II statin, respectively) are shown in Table 1 (%A and %S). Percent similarities to other statins are shown in the supporting data (ESI†), and largely follow the patterns for the representative type I and II molecules. Also in the supporting data (ESI†) are the percentages of pairs in each library that have a similarity of more than 0.25. This characteristic can serve as a measure of the diversity of each library, as a higher percentage of similarity means that the library covers a smaller chemical space.
T | Valid | Usable | <μM | [IC50 (nM)] | [Score (kcal mol−1)] | [SAS] | %A | %S | p | |
---|---|---|---|---|---|---|---|---|---|---|
Scaffold (SC) | 0.0 | 1429 | 2 | 0 | — | — | — | — | — | — |
0.5 | 1849 | 114 | 27 | 278 | −7.39 | 3.77 | 44 | 0 | 0.42 | |
1.0 | 1756 | 485 | 79 | 309 | −7.42 | 3.92 | 41 | 1 | 0.39 | |
2.0 | 1233 | 786 | 71 | 326 | −7.1 | 4.37 | 35 | 0 | 0.17 | |
6 Tokens (6S) | 0.0 | 4590 | 367 | 44 | 260 | −7.35 | 4.16 | 25 | 16 | 0.57 |
0.5 | 4463 | 867 | 121 | 216 | −7.58 | 4.06 | 40 | 10 | 0.55 | |
1.0 | 4042 | 1500 | 220 | 204 | −7.74 | 4 | 46 | 8 | 0.53 | |
2.0 | 1204 | 1096 | 95 | 274 | −7.51 | 4.13 | 37 | 1 | — | |
3 Tokens (3S) | 0.0 | 4604 | 46 | 5 | 283 | −7.7 | 4.29 | 20 | 0 | 0.72 |
0.5 | 4589 | 472 | 107 | 171 | −7.89 | 3.89 | 56 | 10 | 0.53 | |
1.0 | 4193 | 1328 | 255 | 177 | −7.84 | 4.01 | 48 | 12 | 0.37 | |
2.0 | 1123 | 998 | 81 | 231 | −7.53 | 4.28 | 41 | 4 | 0.32 | |
1 Token (1S) | 0.0 | 5000 | 1 | 1 | 3 | −8.4 | 3.58 | 100 | 0 | — |
0.5 | 4482 | 352 | 108 | 129 | −7.96 | 3.82 | 62 | 10 | 0.48 | |
1.0 | 4303 | 1406 | 285 | 164 | −7.86 | 3.92 | 55 | 12 | 0.41 | |
2.0 | 2818 | 1808 | 224 | 217 | −7.63 | 4.02 | 42 | 11 | 0.39 |
Finally, Pearson correlations between several of the properties presented here were calculated, including correlation between ln-IC50 and docking score, ln-IC50 and SAS, ln-IC50 and alog
P and docking score and SAS. The correlation between ln-IC50 and docking score is important for the following reason: while a docking score does not directly correlate to inhibitory power, a ligand with a strong docking score is more likely to linger in the binding site and have an inhibitory affect. This relationship is given in the equation:
As was seen in the previous work,1 shorter prompt lengths result in lower IC50 values, with the one token models producing the lowest IC50 values of all of the constant temperature models: 170 nM on average compared to 304 nM, 239 nM and 215 nM for the scaffold, six token and three token models. A temperature of 0.5 produces the lowest IC50 value for each prompt length. The same pattern can be seen for the docking scores: from longest prompt to shortest the average score goes from −7.30 kcal mol−1 to −7.55 kcal mol−1 to −7.74 kcal mol−1 to −7.96 kcal mol, though the temperature with the lowest score is not predictable. This correlation of trends for the IC50 values and docking scores does reinforce the reliability of both methods as discussed above. The SAS does not follow this patter, as the value increases from the scaffold to six and then to three token models (with higher values indicating more difficult syntheses), but the one-token models again do show the lowest SAS, indicating they are on average less difficult to synthesize. The percentage of the libraries which are similar to Atorvastatin are similar (∼40%) for all libraries except the one-token libraries, which average 65% similarity to Atorvastatin. This correlates with IC50 and docking score, as Atorvastatin is known to be a powerful inhibitor of HMGCR. The similarity to Simvastatin is less meaningful for this dataset. Finally, the ln-IC50/docking score correlations are of medium correlation (scaffold) or on the medium/strong correlation border (all other prompt lengths).
Table 2 shows the number of sub-micromolar molecules generated with each prompt-length using the increasing temperature ramps described in eqn (2)–(4). The scaffold-based models did not generate any sub-micromolar molecules; since these models start with T = 0.0, and the T = 0.0 greedy decoding molecule failed to produce any sub-micromolar molecules, it follows that these temperature ramp models could not produce them either. The number of sub-micromolar molecules produced by the other models decreased with decreasing prompt length. The temperature ramp model using eqn (3) produced molecules with significantly lower average IC50 values and, for the 3-token models, a lower average docking score, than the other models, including greedy decoding and constant temperature-based sampling. This behaviour is explored further below.
Table 3 shows the number of sub-micromolar molecules generated with each prompt-length using the decreasing temperature ramps described in eqn (5)–(7). The numbers of molecules are significantly higher than those generated by the increasing temperature-ramp models: at least twice as many and in two cases, about an order of magnitude more. However, in almost all cases the average IC50 values are nearly identical to those for the greedy decoding and constant-temperature sampling models, and for the one-token based models, the average IC50 values are higher. In two specific cases (SC eqn (5) and 6S eqn (5)) the values for IC50 are marginally lower. In all cases, the docking scores similar to or were slightly higher than the other models.
<μM | [IC50 (nM)] | [Score (kcal mol−1)] | ||
---|---|---|---|---|
Scaffold | Eqn (5) | 12 | 200 | −7.33 |
Eqn (6) | 8 | 285 | −7.35 | |
Eqn (7) | 13 | 413 | −7.26 | |
6 tokens | Eqn (5) | 123 | 213 | −7.49 |
Eqn (6) | 98 | 227 | −7.41 | |
Eqn (7) | 117 | 256 | −7.36 | |
3 tokens | Eqn (5) | 92 | 195 | −7.73 |
Eqn (6) | 67 | 187 | −7.73 | |
Eqn (7) | 84 | 230 | −7.7 | |
1 token | Eqn (5) | 101 | 151 | −7.89 |
Eqn (6) | 55 | 175 | −7.83 | |
Eqn (7) | 84 | 173 | −7.79 |
The only temperature ramp model out of eqn (2)–(7) that produced a significant improvement on IC50 values was the increasing temperature ramp of eqn (3). Eqn (3) is a rapidly increasing exponential, and so other temperature ramps that exaggerated that rapidly increasing were tested. An increasing sigmoid (eqn (4)) was again used, but rather than have the sigmoid ramp up in the middle of the generation process (50% of the maximum tokens, 90 in this case), it was tested with the ramp occurring at 5%, 10%, and 20% of the maximum tokens. This was done by replacing the 0.5 multiplier in eqn (3) with either 0.05, 0.10 or 0.20. In all cases, these models showed improvement over eqn (2)–(7) as well as greedy decoding and constant-temperature sampling, but the 10% model was chosen as it produced molecules with a slightly lower average IC50 value than the other two models. This model is referred to as the Sigmoid 10% (S10) model in the remainder of this work. Table 4 shows the numbers of sub-micromolar molecules generated with S10 model with each prompt length, and with final temperatures of 0.5, 1.0, 1.5 and 2.0. While these numbers are slightly lower than those produced by all other temperature models for the scaffold-based generation, the numbers a significantly larger for 6S and especially 3S and 1S, for which the numbers of molecules produce are more than three-times larger than the next best temperature model. The numbers of sub-micromolar molecules increases as the final temperature of the S10 ramp increases, with the exception of the 1S models at T = 2.0, which decreased compared to T = 1.5.
S10 | <μM | [IC50 (nM)] | [Score (kcal mol−1)] | SAS | %A | %S | p | |
---|---|---|---|---|---|---|---|---|
Scaffold | T = 0.5 | 0 | — | — | — | — | — | — |
T = 1.0 | 16 | 455 | −7.42 | 3.86 | 56 | 0 | 0.27 | |
T = 1.5 | 38 | 362 | −7.14 | 4.17 | 42 | 0 | 0.5 | |
T = 2.0 | 39 | 378 | −7.16 | 4.17 | 46 | 0 | 0.32 | |
6 tokens | T = 0.5 | 82 | 229 | −7.5 | 4.15 | 30 | 15 | 0.56 |
T = 1.0 | 130 | 218 | −7.53 | 4.07 | 35 | 13 | 0.53 | |
T = 1.5 | 182 | 223 | −7.44 | 3.95 | 40 | 12 | 0.68 | |
T = 2.0 | — | — | — | — | — | — | — | |
3 tokens | T = 0.5 | 27 | 98 | −7.94 | 3.8 | 74 | 0 | 0.41 |
T = 1.0 | 112 | 122 | −8.09 | 3.66 | 74 | 1 | −0.03 | |
T = 1.5 | 329 | 109 | −8 | 3.73 | 77 | 1 | 0.21 | |
T = 2.0 | 373 | 120 | −7.95 | 3.75 | 74 | 0 | 0.24 | |
1 token | T = 0.5 | 61 | 57 | −8 | 3.61 | 90 | 0 | 0.19 |
T = 1.0 | 263 | 71 | −8.08 | 3.61 | 85 | 0 | 0.39 | |
T = 1.5 | 695 | 69 | −8.05 | 3.67 | 81 | 0 | 0.41 | |
T = 2.0 | 597 | 81 | −7.97 | 3.74 | 76 | 0 | 0.43 |
Table 4 also shows the average IC50 values for each S10 library, and the S10 temperature ramps produce molecules with significantly lower average IC50 values for the 3S and 1S models (slightly lower for 6S and slightly higher for scaffolds). For the 1S and 3S models, the average IC50 values generally increase with increasing final temperature, with some small fluctuations. The 6S and SC models have no predictable pattern. The docking scores for each library (also Table 4) are lower than other temperature models for the 1S and 3S libraries, and slightly higher than other temperature models for the 6S and SC libraries. Average SAS for each library follows the same pattern, with the 1S and 3S libraries having lower average SAS than other temperature models, while 6S and SC show little change. The percentage of molecules in each library that have a greater than 0.25 Tanimoto similarity to Atorvastatin (type II statin) follow the same pattern: significantly increased similarity for 1S and 3S, and little change for 6S and SC. The percentage of molecules with similarity to Simvastatin (type I statin) follow the opposite trend: the percentage increases for the 6S library and decreases for the 1S and 3S libraries (the SC S10 library has zero molecules with similarity to Simvastatin). Finally, Table 4 shows the Pearson correlation between ln-IC50 and docking score for the S10 libraries. With one exception (the three-token libraries), all libraries show medium or strong correlation between these two variables, with values between ∼0.3 and ∼0.7. While the three-token libraries do have weaker correlation overall, only the T = 1.0 library has truly poor correlation (−0.03, which is not only weak, but also shows an inverse relationship). The correlations for the S10 libraries are on average slightly weaker than was found for the greedy decoding and constant-temperature based sampling libraries (Table 1), which had values between ∼0.4 and ∼0.7. Still, the amount of correlation present does reinforce the fact that IC50 and docking scores show considerable agreement and thus are likely good indicators of the molecules’ inhibitory power.
Scaffold | 6 tokens | 3 tokens | 1 token | ||
---|---|---|---|---|---|
Temperature | T = 0.0 | — | 45 | 80 | 100 |
T = 0.5 | 0 | 39 | 51 | 48 | |
T = 1.0 | 0 | 29 | 33 | 34 | |
T = 2.0 | 0 | 14 | 15 | 31 | |
Decreasing | Eqn (2) | — | 45 | 83 | 100 |
Eqn (3) | — | 45 | 64 | 69 | |
Eqn (4) | — | 43 | 67 | 33 | |
Increasing | Eqn (5) | 0 | 42 | 60 | 51 |
Eqn (6) | 0 | 46 | 69 | 65 | |
Eqn (7) | 0 | 43 | 58 | 60 | |
S10 | T = 0.5 | — | 42 | 41 | 31 |
T = 1.0 | 0 | 29 | 15 | 12 | |
T = 1.5 | 0 | 18 | 7 | 5 | |
T = 2.0 | 0 | — | 5 | 5 |
Overall, ∼2900 molecules were generated between all of the models presented in this work. Table 6 shows the overlap between all greedy decoding, constant-temperature sampling and S10 libraries except those generated with scaffold prompts (diagonal elements are the size of each library). The SC-libraries did not have any overlap with the six, three and one token-based models and so the overlaps between those libraries are presented separately in Table 7. While the maximum overlap for any library is usually with its nearest neighbours (3S T = 1.0 would overlap strongly with 3S T = 0.5 and 3S T = 2.0) there is often a large overlap with a different prompt length the same temperature (3S T = 1.0 would overlap with 1S T = 1.0 and 6S T = 1.0). The table show that for any give prompt length and sampling-type, the amount of overlap decreases with increasing temperature, which is expected given the reduced probabilistic nature of the higher temperature models. Overall the table shows that each library has a considerable amount of unique molecules not present in any other library. For the scaffold-based libraries in Table 7, there is very limited overlap between any of the libraries.
Count | Mean IC50 (nM) | Median IC50 (nM) | σ (nM) | Representative | |
---|---|---|---|---|---|
Group 0 | 12 | 121 | 86 | 158 | Sulfonamides, halogens |
Group 1 | 366 | 424 | 388 | 303 | Halogens |
Group 2 | 979 | 60 | 9 | 139 | Fluvastatin |
Group 3 | 83 | 311 | 220 | 277 | Peptide-like |
Group 4 | 399 | 117 | 18 | 207 | Simvastatin, Lovastatin |
Group 5 | 133 | 327 | 256 | 251 | Steroid-like |
Group 6 | 98 | 339 | 328 | 264 | Large rings |
Group 7 | 459 | 51 | 5 | 148 | Atorvastatin |
Group 8 | 291 | 231 | 81 | 288 | Rosuvastatin, Pravastatin |
Group 9 | 33 | 287 | 144 | 324 | Multiple halogens |
Table 9 shows how the molecules in each library are distributed into the K-means groups. Having an even distribution across all the groups implies that a library is sampling a wide swath of chemical space, while have a large fraction of molecules in one group implies the model has focused in on one type of structure. The group with the highest fraction for each library is shown in bold. The S10 libraries for all prompt lengths except scaffolds have the highest fraction of their molecules in group 2 (type II statin-like, second lowest IC50). The second most common group to have the highest fraction in any given library is group 8 (type II statin, fourth lowest IC50), though this group is only common for the scaffold-based libraries. The third most common group to have the highest fraction in any given library is group 7 (type II statin, lowest IC50), with group 4 (type I statin, third lowest IC50) being the fourth mostly likely to have the highest fraction in any given group. If the scaffold-based libraries are removed, the fractional populations correlate exactly with the inverse of IC50. The S10 libraries are more likely to have their most common group be group 2, suggesting that the temperature ramp favours this type of structure.
Group → | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
---|---|---|---|---|---|---|---|---|---|---|
1S S10 T0.5 | 0.00 | 0.02 | 0.64 | 0.00 | 0.18 | 0.00 | 0.00 | 0.10 | 0.07 | 0.00 |
1S S10 T1.0 | 0.00 | 0.06 | 0.51 | 0.00 | 0.19 | 0.00 | 0.00 | 0.14 | 0.10 | 0.00 |
1S S10 T1.5 | 0.00 | 0.08 | 0.48 | 0.00 | 0.17 | 0.00 | 0.00 | 0.20 | 0.07 | 0.00 |
1S S10 T2.0 | 0.01 | 0.13 | 0.45 | 0.00 | 0.16 | 0.00 | 0.00 | 0.17 | 0.08 | 0.00 |
1S T0.0 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
1S T0.5 | 0.00 | 0.05 | 0.28 | 0.05 | 0.19 | 0.04 | 0.05 | 0.30 | 0.06 | 0.01 |
1S T1.0 | 0.00 | 0.06 | 0.25 | 0.05 | 0.19 | 0.08 | 0.05 | 0.25 | 0.06 | 0.01 |
1S T2.0 | 0.00 | 0.11 | 0.20 | 0.03 | 0.17 | 0.11 | 0.09 | 0.16 | 0.09 | 0.04 |
3S S10 T0.5 | 0.00 | 0.00 | 0.63 | 0.00 | 0.11 | 0.07 | 0.11 | 0.04 | 0.00 | 0.04 |
3S S10 T1.0 | 0.00 | 0.05 | 0.63 | 0.00 | 0.13 | 0.07 | 0.04 | 0.05 | 0.01 | 0.03 |
3S S10 T1.5 | 0.00 | 0.07 | 0.54 | 0.02 | 0.15 | 0.03 | 0.04 | 0.12 | 0.02 | 0.02 |
3S S10 T2.0 | 0.00 | 0.10 | 0.55 | 0.01 | 0.13 | 0.03 | 0.04 | 0.09 | 0.03 | 0.01 |
3S T0.0 | 0.00 | 0.00 | 0.20 | 0.00 | 0.00 | 0.20 | 0.40 | 0.00 | 0.00 | 0.20 |
3S T0.5 | 0.00 | 0.03 | 0.25 | 0.01 | 0.19 | 0.07 | 0.08 | 0.27 | 0.06 | 0.04 |
3S T1.0 | 0.00 | 0.04 | 0.22 | 0.04 | 0.21 | 0.09 | 0.08 | 0.22 | 0.08 | 0.03 |
3S T2.0 | 0.03 | 0.13 | 0.25 | 0.03 | 0.10 | 0.09 | 0.09 | 0.08 | 0.16 | 0.05 |
6S S10 T0.5 | 0.00 | 0.04 | 0.21 | 0.04 | 0.24 | 0.18 | 0.12 | 0.09 | 0.04 | 0.05 |
6S S10 T1.0 | 0.00 | 0.06 | 0.27 | 0.04 | 0.22 | 0.18 | 0.06 | 0.08 | 0.04 | 0.04 |
6S S10 T1.5 | 0.01 | 0.13 | 0.34 | 0.03 | 0.18 | 0.16 | 0.05 | 0.04 | 0.05 | 0.02 |
6S T0.0 | 0.00 | 0.07 | 0.09 | 0.05 | 0.20 | 0.16 | 0.14 | 0.18 | 0.02 | 0.09 |
6S T0.5 | 0.00 | 0.05 | 0.21 | 0.05 | 0.18 | 0.16 | 0.08 | 0.18 | 0.04 | 0.05 |
6S T1.0 | 0.00 | 0.06 | 0.22 | 0.04 | 0.18 | 0.12 | 0.08 | 0.21 | 0.07 | 0.02 |
6S T2.0 | 0.01 | 0.21 | 0.26 | 0.02 | 0.15 | 0.11 | 0.05 | 0.09 | 0.05 | 0.03 |
SC S10 T1.0 | 0.00 | 0.13 | 0.40 | 0.07 | 0.00 | 0.00 | 0.00 | 0.00 | 0.40 | 0.00 |
SC S10 T1.5 | 0.00 | 0.11 | 0.30 | 0.08 | 0.00 | 0.00 | 0.00 | 0.03 | 0.49 | 0.00 |
SC S10 T2.0 | 0.03 | 0.14 | 0.11 | 0.08 | 0.00 | 0.00 | 0.00 | 0.05 | 0.59 | 0.00 |
SC T0.5 | 0.00 | 0.19 | 0.30 | 0.04 | 0.00 | 0.00 | 0.04 | 0.04 | 0.41 | 0.00 |
SC T1.0 | 0.00 | 0.22 | 0.23 | 0.01 | 0.01 | 0.01 | 0.03 | 0.08 | 0.42 | 0.00 |
SC T2.0 | 0.00 | 0.40 | 0.06 | 0.02 | 0.03 | 0.00 | 0.03 | 0.08 | 0.38 | 0.00 |
To further understand this mechanism, the Shannon entropy of the generation process was studied for a sample set of generated molecules. The three-token-prompt models were chosen for this exercise as they create more robust libraries than the six-token-prompt models when using the variable-temperature approach, and are intermediate in performance between the six-token and one-token-prompt models. One-hundred prompts were fed into the three-token-prompt model with a controlled generation method (greedy decoding, or constant T = 0.0 generation), and six other methods (constant temperatures at T = 0.5. 1.0 and 2.0, and S10 with T = 0.5. 1.0 and 2.0). The Shannon entropy for each viable molecule (having an interpretable SMILES string) for each model was calculated at each of the inference steps according to
Fig. 6 shows that at T = 0.0, there are challenging tokens (high entropy, ∼0.8) in the first ∼10 steps, followed by greatly decreased entropy for the T = 0.0, 0.5 and 1.0 models, while the T = 2.0 model maintains high entropy (challenging token) across all steps. This is interpretable as the greedy and low-temperature decoding models choosing “safe” token with higher probabilities early in the SMILES string when entropy is high, leading to a more predictable SMILES string, with more confident tokens as the generation process progresses. The T = 2.0 model, however, is always able to choose tokens with lower probability, and so the SMILES string never becomes predictable and tokens remain challenging across the generation process.
Fig. 7 shows that the use of the S10 ramp decreases the number of steps with high entropy (∼0.8) from ∼10 with greedy and constant temperature decoding to ∼5 steps with S10. For T = 0.5 and 1.0, the number of high entropy spikes above about 40 steps also decreases dramatically. For T = 2.0, constant temperature decoding has high entropy across the generation process, while S10 has several lulls in entropy around 20 and 45 steps, and less high entropy spikes overall. These results support the interpretation that the S10 ramp is creating a more stable SMILES string overall, while still allowing for some variability and novelty.
This variable temperature approach is easily implementable in any GPT, recurrent neural network, or other autoregressive molecular generation model. These models all produce a set of probabilities for the next token at each step of inference; these probabilities need only be scaled according to eqn (1), using the temperature at each inference step calculated by eqn (4). The only non-learned variables needed are the total number of inference steps (kmax), and the centre of activation for the sigmoid function (0.1 × kmax is used here). As the dynamic temperature scaling acts only on the final product of inference (the probabilities), the approach is generalizable to any of the models mentioned above.
Overall, in transformer-decoder GPT based molecule library inference, single token prompts and an S10 temperature ramp ending at T = 1.0 are suggested.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d5cp00692a |
This journal is © the Owner Societies 2025 |