Open Access Article
Tianfan
Jin
a,
Veerupaksh
Singla
a,
Hsuan-Hao
Hsu
a and
Brett M.
Savoie
*b
aDavidson School of Chemical Engineering, Purdue University, West Lafayette, Indiana, USA
bDepartment of Chemical and Biomolecular Engineering, The University of Notre Dame, Notre Dame, Indiana, USA. E-mail: bsavoie2@nd.edu
First published on 27th September 2024
Generative models for the inverse design of molecules with particular properties have been heavily hyped, but have yet to demonstrate significant gains over machine-learning-augmented expert intuition. A major challenge of such models is their limited accuracy in predicting molecules with targeted properties in the data-scarce regime, which is the regime typical of the prized outliers that it is hoped inverse models will discover. For example, activity data for a drug target or stability data for a material may only number in the tens to hundreds of samples, which is insufficient to learn an accurate and reasonably general property-to-structure inverse mapping from scratch. We’ve hypothesized that the property-to-structure mapping becomes unique when a sufficient number of properties are supplied to the models during training. This hypothesis has several important corollaries if true. It would imply that data-scarce properties can be completely determined using a set of more accessible molecular properties. It would also imply that a generative model trained on multiple properties would exhibit an accuracy phase transition after achieving a sufficient size—a process analogous to what has been observed in the context of large language models. To interrogate these behaviors, we have built the first transformers trained on the property-to-molecular-graph task, which we dub “large property models” (LPMs). A key ingredient is supplementing these models during training with relatively basic but abundant chemical property data. The motivation for the large-property-model paradigm, the model architectures, and case studies are presented here.
Deep generative models try to directly solve the inverse problem by learning the conditional probability, P(molecule|properties), then sampling this distribution with respect to targeted properties to yield exemplary structures. The hope is that a model of this distribution, f(properties) = P(molecules), that is provided sufficient examples of molecules with different property combinations would be able to generate non-trivial structures for unseen property combinations. The popular examples of language models—where the corresponding task is to learn P(next token|context)—image generators—P(image|caption)—and music generators—P(waveform|description)—have become ubiquitous over the past several years.12,14,16–20 As anyone who has experimented with these can attest, they also demonstrate that non-trivial interpolations can emerge from such models as they are scaled up. Despite ample forerunners to developing analogous generative models for molecule generation, none have yet to significantly outperform expert intuition or forward-prediction workflows (e.g., for screening molecular libraries and their derivatives using ML-augmented filters).21
Several problems have been identified with deep generative chemical models, including the high frequency of invalid structures, false positives, and high data intensity, which rules out applications to prized but data-scarce properties.16,21–36 The structures generated by generative chemical models can also fail in more subtle ways—they may match targeted properties, but they aren’t stable, can’t be synthesized, aren’t soluble, are too expensive, or any number of other things that experts subconsciously normalize over when trying to design a molecule.21,22,26,37,38 We hypothesize that the origin of this poor performance is fundamentally due to the paucity of general chemical information utilized during training contemporary generative chemical models. For instance, although large numbers of chemical structures are typically utilized (>100k), only a small number of properties are supplied, which leaves these models with the unrealistic task of trying to learn chemistry from scratch while simultaneously generating application-relevant molecules. The motivating idea for LPMs may be glibly expressed as teaching generative models general chemistry before teaching them to predict PhD-level properties.
In conventional formulations, generative chemical models are trained to learn a conditional distribution P(G|p0), where p0 is some property of interest (e.g., bandgap, toxicity, binding affinity, etc.), and G is the molecular graph typically expressed using a grammar-like SMILES or SELFIES.39,40 However, every molecule has many more properties than just the sought after p0. For example, every molecule has a heat of formation, an electric dipole moment, a vibrational spectrum, and so-forth. So in practice, when one samples P(G|p0), one is also necessarily sampling the larger conditional distribution P(G|p0, p1, p2, …, pN), where {p0, p1, p2, …, pN} constitutes some “complete” set of properties that represent a basis set for uniquely specifying G. Thus, a user that is querying P(G|p0) is asking for a set of molecules conditioned on a host of implicit properties. In common terms, the user querying P(G|p0) is asking “give me a molecule with p0 but sample the rest of the unspecified properties from a reasonable physical distribution.” The limited exposure to these implicit properties helps explain why generative models often generate what seem to be unphysical structures when sampling the edge of the observed property distribution.3,4,41 In light of this, it should be advantageous to train the model to explicitly learn the full conditional distribution P(G|p0, p1, p2, …, pN) from examples with a complete set of properties supplied, rather than try to indirectly learn the conditional distribution by only viewing examples of P(G|p0), with the other properties implicit in the graph but not directly represented.
The conditional distribution P(G|p0) is typically learned indirectly, using architectures based on autoencoders with auxiliary prediction tasks or adversarial architectures.24,42–44 In contrast, the most straightforward formulation would be to learn the property-to-molecule mapping, f(p) = P(G), directly:
![]() | (1) |
(1) The reconstruction accuracy of the model should monotonically increase with the number of independent properties supplied during training (i.e., the length of p).
(2) Including off-target properties in training may still improve the performance of sampling useful molecules with on-target property values.
(3) A finite number of properties are necessary to uniquely specify a molecule of a given size.
(4) A finite number of properties are necessary to uniquely predict every additional molecular property.
(5) The complexity of the conditional distribution P(G|p) decreases as the length of p increases, terminating in a delta function about a single molecule.
Others implications can be imagined. Not all of these will be directly explored in the following case studies, but these might serve as a basis for further discussion.
It is beyond the scope of a Faraday Discussions article to fully excavate all the details of how these properties were calculated and their accuracy. For the purposes of training and evaluating the LPMs, we will take these properties as ground-truth labels. However, the training splits (https://www.doi.org/10.6084/m9.figshare.26380666), raw property data (https://www.doi.org/10.6084/m9.figshare.26380918), and model checkpoints (https://www.doi.org/10.6084/m9.figshare.26380837) have been deposited on figshare.
A set of 80 trivial properties were also calculated for each molecule that we refer to as “constraints”, because these are properties that the user will often know in advance and would like to apply as a design constraint. For example, setting the number of fluorines to zero or limiting the size of the molecule is easy owing to the explicit inclusion of these constraints during training. The training constraints include the number of atoms of each element and boolean true/false flags for a list of common functional groups. These constraints are concatenated with property vectors after embedding and prior to the attention layers. For the purpose of the following discussion, when we refer to “property vectors”, we are referring to the catenated tensor associated with the separately embedded constraints and properties.
The graph decoder is constructed as a next-token SMILES predictor that begins with a “start” token. The decoding occurs recursively until the decoder predicts an “end” token or the decoded string reaches the maximum length. The input to the decoder is tokenized and embedded into a [dwin, demb] tensor based on a SMILES vocabulary with dvocab tokens, where dwin is the maximum length of the context window that is equal to 35 for all of the models discussed here. Sinusoidal positional embedding is added to the decoder embedding to capture the positional context (this isn’t required in the property encoder because we desire it to be positionally invariant). The embedded [dwin, demb] tensor is then transformed through four eight-headed cross-attention cells where the key and value inputs are supplied by the encoder output. Finally, the output of the decoder is projected to a [dwin, dvocab] tensor during training with a dense layer and a softmax to predict the probability of the next SMILES token. During inference, the final projection is to a [1, dvocab] tensor because it is performed in a token-by-token fashion.
:
10
:
10 training
:
validation
:
testing splits assigned randomly from the 1.3 M molecule dataset. The models were trained on next-token prediction using masked cross-attention in the decoder, a cross-entropy loss, dropout for the dense layers in each attention cell, the Adam optimizer with learning rate 2
000
000−0.5 × demb−0.5, and a patience of 30 epochs evaluated on the validation set to conclude training. The 1.05 M training samples of property/graph pairs were randomly sampled in batches of 100 for training. All numerical properties were min–max [0, 100] normalized with respect to the training distribution. Training until termination by patience took between 54–106 epochs for the models trained here.
A subset of the models were trained under conditions where a fraction of the inputted properties were masked. Masking was incorporated using a special token for class-based inputs and the mean value across the training set for scalar inputs. Both constrained properties and real properties were masked.
During structure inference, a beam search was implemented to decode the top-n structures predicted by each model to be consistent with the supplied property vector.47–49 For a beam size of 1, the beam search is simply a greedy decoding. For a beam size of n, next-token prediction occurs for the n most probable decodings that occur after each cycle.
How many properties does it take to uniquely specify a chemical structure? To sketch an answer to this question, 22 properties from the dataset (all xtb-calculated properties, except the quadrupole moment, plus the number of h-bond donors and acceptors from Pubchem) were used to specify a position in property space for all 1.3 M molecules and calculate the nearest-neighbor separations in various scenarios (Fig. 4). The theoretical maximum separation between a pair of molecules in property space grows as
, where the summation runs over all properties and pi is the range of the property. All properties were percent normalized between [0, 100] and the natural log of the percentage normalized separation was used for the y-axes. Under these conditions, the maximum log(rNN) values are ∼4.6 and 6.2 for 1-dimensional and 22-dimensional property spaces, respectively.
Unsurprisingly, the mean separation between molecules, 〈rNN〉, decreases as molecules are added to the property space (Fig. 4a). For example, if the molecular scope was limited to diatomics, then a single property—say, electric dipole moment—would probably be sufficient to uniquely identify the species. But as the number of heavy atoms (HAs) in the molecules grows (and correspondingly the number of molecules in the space), the number of properties required to uniquely specify the chemical graph also grows. Nevertheless, the molecules remain unusually clustered in property space. For example, a 22-dimensional volume with sides of 100 containing 1.3 M molecules has a number density of 1.3 × 10−38. This corresponds to an average nearest-neighbor separation of 66 (∼4.2 on natural log scale) for an ideal gas of 22-dimensional spheres occupying the same volume. The ∼5× larger ideal gas separation than 〈rNN〉 for the full dataset (i.e., the 14-HA case) is evidence of significant clustering in property space of these molecules. It isn’t clear if this clustering is intrinsic to the physically relevant space of chemistry or if this clustering merely reflects the limits of PubChem curation and synthetic biases. Regardless, the existence of this relatively low-dimensional manifold is consistent with the hypothesis that a relatively small set of physical properties may usefully span molecular space.
Although they are clustered, the molecules are still distinguishable from one another when provided a sufficient number of properties (Fig. 4b). If we use a 1% difference in at least one property as a measure of distinctiveness, then the molecules are on average distinguishable in the full 22-dimensional property space. But as the dimensionality of the property space shrinks, many of the molecules become indistinguishable by this measure. It isn’t until approximately 10 properties that the molecules are distinct on average (i.e., exhibiting an effective separation of 1% from another molecule in property space). We hypothesized that the choice of properties would play a major role in distinguishing molecules, with more orthogonal properties producing property spaces with larger effective separations. To test this, we estimated the standard deviations in separations upon resampling subsets of the properties at random and calculating the molecular separations in the resulting property spaces. Somewhat surprisingly, the uncertainty with respect to property selection becomes effectively zero after moving into a 10-dimensional property space or larger. This is indirect support of the motivating hypothesis that property redundancy emerges from a sufficient basis set of physical properties.
Less than 100% accuracy in the top-1 reconstruction task is not necessarily a bad thing, given that the most direct application of the LPM is to generate new molecules. Additionally, the clustering of molecules in property space (Fig. 4) suggests that multiple molecules are likely to exist with similar property profiles to the property vectors being used here for inference. Indeed, ∼90% of the top-1 predicted molecules are new (i.e., contained neither within the training nor validation splits). But how well do the generated molecules actually reproduce the property vectors that were used during inference? To assess this, the properties of the top-1 predicted structures were calculated according to the same protocol as the training data and the statistics for reproducing individual properties and all properties were calculated (Fig. 5, green). Not all properties are equally easy to reproduce. The most easily satisfied property was total energy, which was reproduced in ∼99.87% of the top-1 predictions, and the hardest individual property to reproduce was average electrostatic potential, which was only reproduced in ∼49% of the top-1 predictions. Remarkably, over 40% of the top-1 predicted structures reproduced all 22 properties within 10% of the requested value.
To test this we implemented a simple masking strategy that consisted of keeping a fixed set of input properties and constraints but randomly masking subsets of the inputs during training (Fig. 6a). Masking was implemented by replacing scalar properties with the mean value from the training dataset and replacing categorical properties with a special masking token. The rationale for this strategy was that it would force the model to rely on a broader set of relationships between the properties because the available information was not fixed from inference to inference. Moreover, the relationships used by the LPM for inference would have to be dynamic in the masking scenario, because the inputs being masked were randomly selected from sample to sample. Conversely, this training strategy would make conditional inference easy for the user, as any unknown properties could simply be masked during inference.
Four LPMs were trained and tested under conditions with varying levels of property masking (Fig. 6b). The 0% masking LPM is the same as that used in Fig. 5, but the other LPMs were newly trained for this case study. All LPM architectures were held fixed and no attempt was made to fine-tune the architecture to improve performance in the masking scenario. Masking has a monotonic adverse effect on LPM performance in the structure reconstruction tasks (Fig. 6b). Masking a fraction of properties is the same as reducing the property space from the perspective of information, and so it makes sense that the confidence in predicting a specific graph goes down as more properties are masked. Notably, the top-1 accuracy nearly falls to zero for the 50% masking case, which approximately matches the 10-property threshold that we identified in the Fig. 4b discussion as being necessary for practically distinguishing molecules within the training distribution. It is also notable that masking has a negligible effect on the prediction of invalid molecules and new molecules. This is consistent with all of these LPMs being trained in property spaces that are sufficiently informative to learn both the grammar and interpolation of the training distribution of molecules.
Masking was envisioned to help in predicting molecules with targeted properties subject to limited off-target property information. Thus, although masking is expected to hurt reconstruction accuracy, it should help property prediction accuracy in property-scarce scenarios. To test this, the LPMs trained in the varying masking scenarios were tested for property reproduction in both unmasked and masked scenarios (Fig. 6c). During these tests, the full testing set of property vectors were used with the specified percentage of inputs masked. The properties of the resulting top-1 predictions were then characterized and compared with the unmasked portions of the inputted property vectors. The accuracy is reported as the percentage of the top-1 predictions that exhibit all properties (i.e., excluding constraints) within a specified percentage (either 5% or 20%) of the unmasked inputted values. The 0% masking LPM was used as a baseline and tested under all masking scenarios. Each masked LPM was tested under the same masking conditions as its training and also the 0% masking scenario. Note that these tests are quite expensive because the properties of all new molecules must be characterized to evaluate the accuracy of the predictions; this is the only reason why all combinations of masked training and masked testing were not performed.
Several notable behaviors emerge from this case study. First, the performance of the LPM trained without masking rapidly deteriorates in circumstances where it only has access to a subset of properties. In contrast, the masked LPMs all outperform the unmasked LPM in masked testing scenarios. This largely validates the hypothesis that masking forces the LPMs to learn a more dynamic set of property relationships, whereas the unmasked LPM relies on a fixed set of relationships that produce very poor results subject to incomplete information. Second, the LPMs trained with masking can still perform useful inference in the unmasked scenario. In particular, the LPM trained with 30% masking shows a small reduction in property accuracy in the unmasked scenario, while the LPM trained with 10% masking actually performs better in the unmasked scenario. Because the accuracy is only evaluated on the unmasked properties, this latter result unequivocally shows that some of the properties possess mutual information, such that their joint specification increases their individual accuracy. Finally, the difference between the “all properties within 20% of target” and “all properties within 5% of target” accuracy measures increases with the masking level of evaluation, regardless of the masking level during training. We interpret this as additional evidence of the mutual information amongst the properties. As the number of properties available for inference shrinks, so do the accuracy and confidence of the properties associated with the predicted molecules.
Several things have also been intentionally left out of this study: we haven’t tested the LPMs in extrapolative scenarios; we haven’t tested the scaling behavior of the LPMs with respect to training data; we haven’t tested the scaling behavior of the LPMs beyond a small set of possible properties; we haven’t tested the transferability of LPMs to data-scarce or other unseen properties; we haven’t explored self-supervised training tasks beyond masking; we haven’t fine-tuned the architecture for performance. These and many other things are extensions of the ideas described here and will have to wait for future communications.
| This journal is © The Royal Society of Chemistry 2025 |