Open Access Article
David E.
Graff
ab,
Edward O.
Pyzer-Knapp
c,
Kirk E.
Jordan
d,
Eugene I.
Shakhnovich
a and
Connor W.
Coley
*be
aDepartment of Chemistry and Chemical Biology, Harvard University, Cambridge, MA 02138, USA
bDepartment of Chemical Engineering, MIT, Cambridge, MA 02139, USA. E-mail: ccoley@mit.edu
cIBM Research Europe, Warrington WA4 4AD, UK
dIBM Thomas J. Watson Research Center, Cambridge, MA 02142, USA
eDepartment of Electrical Engineering and Computer Science, MIT, Cambridge, MA 02139, USA
First published on 3rd August 2023
Quantitative structure–property relationships (QSPRs) aid in understanding molecular properties as a function of molecular structure. When the correlation between structure and property weakens, a dataset is described as “rough,” but this characteristic is partly a function of the chosen representation. Among possible molecular representations are those from recently-developed “foundation models” for chemistry which learn molecular representation from unlabeled samples via self-supervision. However, the performance of these pretrained representations on property prediction benchmarks is mixed when compared to baseline approaches. We sought to understand these trends in terms of the roughness of the underlying QSPR surfaces. We introduce a reformulation of the roughness index (ROGI), ROGI-XD, to enable comparison of ROGI values across representations and evaluate various pretrained representations and those constructed by simple fingerprints and descriptors. We show that pretrained representations do not produce smoother QSPR surfaces, in agreement with previous empirical results of model accuracy. Our findings suggest that imposing stronger assumptions of smoothness with respect to molecular structure during model pretraining could aid in the downstream generation of smoother QSPR surfaces.
Dataset roughness is typically assessed qualitatively, although there are metrics that attempt to quantify this, such as the structure–activity relationship index (SARI)6 and the modelability index (MODI).7 However, these metrics are primarily intended for application to bioactivity datasets (SARI) or to classification datasets (MODI), and extending these metrics to arbitrary regression tasks remains a challenge. To address this, we have recently proposed the ROuGhness Index (ROGI),8 a scalar metric that captures global surface roughness by measuring the loss in the dispersion of molecular properties as a dataset is progressively coarse-grained. Briefly, we are given an input representation for each molecule x ∈
d and a distance metric d :
d ×
d →
, then (1) the dataset is clustered using complete linkage clustering at a given distance threshold t, (2) the dataset is coarse-grained by replacing the property label yi of each point with the mean of its respective cluster ȳj, (3) the standard deviation of the coarse-grained dataset σt is calculated, (4) steps (1)–(3) are repeated for t ∈ [0, …, max
dx], (5) the area under the curve of 2(σ0 − σt) vs. t is measured to yield the ROGI. Datasets with larger ROGI values result in larger cross-validated model errors, consistent with intuition. Across a variety of datasets from GuacaMol,9 TDC,10 and ChEMBL11 and machine learning (ML) model architectures, the ROGI correlates strongly with cross-validated model root-mean-square error (RMSE) and generally outperforms alternative metrics.8
Given these strong correlations, we sought to broadly examine recent claims about the superiority of molecular representations learned by “foundation models” for chemistry12–17 through the lens of QSPR surface roughness (Fig. 1). Foundation models are a class of ML models that are trained on large, unlabeled datasets via self-supervised learning (sometimes supervised learning) and are in principle capable of adapting rapidly to downstream tasks with very few labeled data points.18 Pretrained foundation models are now standard practice in several domains, such as natural language processing,19–21 computer vision,22,23 and protein modeling.24,25 Given the abundance of unlabeled chemical data and the limited amount of data encountered in many property prediction tasks, foundation models may benefit chemistry by learning meaningful molecular representations suitable for property prediction tasks in the low data regime.
Despite this interest, empirical evaluation of proposed chemical foundation models has shown mixed results. Recent work from Deng et al.26 assessed the performance both SMILES- and graph-based pretrained chemical models (PCMs), MolBERT14 and GROVER,15 respectively, on a variety of benchmark tasks from MoleculeNet27 and opioid bioactivity datasets from ChEMBL.11 For each task, they compared the performance of these proposed chemical foundation models to a random forest model trained on radius 2, 2048-bit Morgan fingerprints. The authors found that this baseline was competitive for many benchmark tasks and even superior in several of the opioid tasks. This finding is consistent with results reported in the PCM literature where learned representations offer inconsistent improvement over baseline approaches.
In this work, we complement this analysis by characterizing the roughness of the QSPR surfaces generated by PCMs on both toy and experimental modeling tasks. To do so, we reformulate ROGI as ROGI-XD to enable cross-representation comparison. While the original ROGI correlates strongly with cross-validated RMSE across datasets when holding the representation constant, it does not necessarily provide a meaningful basis for comparison among representations due to the relationship between distances and the dimensionality of a given representation. We show that for a variety of PCMs (VAE,28 GIN,29,30 ChemBERTa,12 and ChemGPT31) and a variety of molecular tasks, learned molecular representations do not provide smoother structure–property relationships than simple descriptor and fingerprint representations. The failure of PCMs to learn a continuous embedding of molecular structures that smoothly correlates with various properties of interest both (a) explains their poor empirical performance in property prediction tasks without fine-tuning and (b) motivates the use of ROGI-XD to evaluate smoothness when new pretraining strategies are proposed.
. As d increases, the normalized distance distribution of these points will become more tightly peaked and centered closer to 1 (Fig. 2B), which results in the delayed coarse-graining phenomenon mentioned earlier. This delayed coarse-graining causes the curve of loss of dispersion 2(σ0 − σt) vs. normalized distance threshold t to be depressed at lower values of t, producing lower ROGI values for higher dimensional representations (Fig. 2C). It could be argued that a higher-dimensional representation may result in a “smoother” representation due to the larger distances between points, but for large differences in d, the ROGI essentially becomes a proxy for the inverse of representation size rather than differences in the underlying SPR surface. The datasets sampled from the unit hypercube abstractly represent the “same” dataset in hyperspace as N → ∞, so they should possess roughly equal roughness values when controlling for d.
To minimize the impact of dimensionality on the ROGI, we change its integration variable to capture the degree of coarse-graining independent of representation size. Procedurally, “coarse-graining” entails taking a step up in the dendrogram produced during the clustering routine. Whereas originally we scan along the distance required to take such a step, we now opt to use 1 − log
Nclusters/log
N, where Nclusters is the number of clusters at the given step in the dendrogram and N is the dataset size. This new formulation, which we refer to as ROGI-XD, produces similar values for each toy dataset regardless of its dimensionality (Fig. 2D). We note that while there are other formulations that reflect a similar concept, they must possess a constant integration domain. For example, using 1 − Nclusters/N as the x-axis produces a similar trend as above (Fig. S1†), but it is defined on the domain [0, 1 − 1/N], thus making the score dependent on N and confounding comparisons across datasets with large differences in size.
The ROGI-XD produces strong correlations with model error across molecular representation for the majority of tasks and ML models tested (Fig. 3). The median correlation across all combinations of model and task ranges between 0.72 and 0.88, with the best correlations observed for both the random forest (RF) and k-nearest neighbors (KNN) models. This is in contrast to the original ROGI, which generally produces weak correlations (median r ∈ [−0.32, 0.28]) when subjected to the same analysis (Fig. S2†). As shown in the toy example above, the original ROGI is affected by representation size, so the range of dimensionalities in the representations tested (14 to 2048, Table S1†) negatively impacts correlation strength.
Of note are the generally strong correlations between ROGI-XD and the RMSE from a KNN model. This is perhaps not surprising due to the thematic similarities between these two algorithms. However, the RMSE of a KNN model on the full dataset provides worse correlations with the cross-validated RMSE of other models than does ROGI-XD for all values of k tested (Fig. S3†). For a more detailed discussion on the fundamental differences between the two, we refer a reader to the Differences between ROGI-XD and k-nearest neighbors section in the ESI† text.
When we measure the correlation between ROGI-XD and RMSE across tasks for a given model using molecular descriptors (as in our original study), we see similarly strong correlations (Fig. S5†). These correlations remain strong when we measure correlation over both representations and tasks, whereas they decrease significantly with the original ROGI (Table 1). In turn, this allows for the direct comparison of ROGI values measured for two datasets with differing representations.
| Metric | Model | ||||
|---|---|---|---|---|---|
| KNN | MLP | PLS | RF | SVR | |
| ROGI | 0.800 | 0.675 | 0.835 | 0.809 | 0.771 |
| ROGI-XD | 0.990 | 0.913 | 0.983 | 0.985 | 0.958 |
It is also possible to measure the correlation between ROGI or ROGI-XD and the minimum model error for a given task and representation. In other words, rather than treating each ML model separately as above, we now (1) measure the model error and roughness metric for all combinations of task, representation, and ML model; (2) take the minimum model error for each combination of task and representation; and (3) measure the correlation between the roughness metric and this “best-case” model error for each task. The ROGI-XD again produces strong correlations across all datasets (median r = 0.82) compared to the original ROGI (median r = 0.16) (Fig. 4). This discrepancy in correlation strength is expected because this analysis still relies on comparisons across representations.
Performing a similar analysis using the RMSE of a KNN model, as above, produces competitive correlations with ROGI-XD (Fig. S4†). As one model decreases in RMSE, it is likely that so too will the RMSEs of other models. Despite this, the distribution of these correlations of KNN RMSE is much broader than that of ROGI-XD with more frequent worst-case performance.
The ROGI-XD's strong correlation with best model error across representations thus allows a user to quickly get an idea of best-case model performance for a variety of representations without resorting to empirical testing. This can further be extended to comparing best-case modelability among datasets given a set of possible representations by calculating the ROGI-XD for each representation and then selecting the lowest one for the task. For example, by selecting the representation with the lowest ROGI-XD and then optimizing over model architecture in each of our 17 tasks, the average relative increase in best-case model error would be only 6.8%. In 8 out of 17 tasks, selecting the lowest ROGI-XD identifies the optimal representation with respect to best-case model error.
We find that across all tasks tested above, PCMs do not generate quantitatively smoother QSPR surfaces when compared to those generated via molecular descriptors or fingerprints (Fig. 5). In more than 50% of the tasks evaluated, both descriptors and fingerprints generated smoother QSPR surfaces. The median relative ROGI-XD values for each pretrained representation compared to descriptors and fingerprints range between 9.1–21.3% and 2.3–10.1%, respectively. Indeed, these ROGI-XD values are consistent with the cross-validation results of descriptors and fingerprints being generally lower in RMSE than the pretrained representations (Fig. S6 and S7†). An extreme case is the Scaffold Hop task, where the GIN, ChemGPT, and ChemBERTa representations produce ROGI-XDs of 0.150, 0.174, and 0.172, respectively, compared to the 0.085 of descriptors. However, we emphasize that PCMs do not generate bad representations, but rather that these learned representations are not smoother than simple, fixed representations.
One potential benefit of learned representations is their ability to be finetuned using task-specific data. Given the nature of the learning task, we would naturally expect this to smooth the corresponding QSPR surface. We tested this approach with our VAE model through a contrastive loss between the latent representations and target properties on the Lipophilicity_AstraZeneca task from the TDC.10 Finetuning the VAE on 80% of the dataset (Ntot = 4200) improves the ROGI-XD from 0.254 to 0.107 (±0.02), considerably smoother than that of descriptors at 0.227. Attempting the same strategy on the CACO2_WANG task (Ntot = 910) yields a ROGI-XD of 0.143 (±0.05), no smoother than descriptors (0.132). The impact of finetuning on smoothness varies and is sensitive to both the task and the number of labeled examples.
Studies that introduce new pretraining techniques or model architectures rarely, if ever, analyze the smoothness of the underlying QSPR surfaces. Rather, they benchmark their method on a variety of property prediction tasks and frequently report mixed results; on some tasks, the new technique outperforms the current state-of-the-art, but on others, it fails to compete with simple baselines. In our evaluations, we find that baseline representations outperform learned representations in 10 of the 17 tasks tested. The relative roughness observed for the QSPR surfaces generated by these learned representations is consistent with their generally mixed performance in property prediction tasks. Thus, we believe that this lack of smoothness at least partially explains their inability to consistently outperform established molecular representations on supervised learning benchmarks.
While it is intuitive that worse model performance could be due to a rougher QSPR surface, such analysis has not previously been conducted. We attribute this to the former lack of metrics that can (i) quantify QSPR surface roughness and (ii) directly compare these quantities across representations. Our analysis using the ROGI-XD allows us to quantitatively show that this is typically not the case.
Taken together, these observations suggest that more work remains in developing chemical foundation models. Though it is unreasonable to expect that any single pretrained representation will produce a smoother QSPR surface in every task, a reasonable desideratum is that such a representation is of comparable smoothness to simple baseline representations for a majority of useful properties. The ROGI-XD is thematically similar to a contrastive loss, as both will scale proportionally with the frequency and severity of activity cliffs in a given dataset. Imposing stronger assumptions of smoothness with respect to molecular structure during model pretraining by weak supervision on simple, calculable properties could aid in producing smoother QSPR surfaces.
A limitation of our analysis is that we have treated the pretrained representations as static for downstream modeling; an alternative is to fine-tune them by training the model on additional, labeled data, in turn helping to smooth the corresponding QSPR surface. In a sense, the evaluations here have demonstrated the need for fine-tuning in the absence of a universally smooth representation. This introduces many additional design choices, so we leave this evaluation for future work.
000 molecules from the ZINC250k dataset and then calculating the following GuacaMol9 oracle function values for these molecules: Scaffold Hop, Median 1, Aripiprazole_Similarity, Zaleplon_MPO, Celecoxib_Rediscovery. We exclude many of the original GuacaMol tasks, as their oracle functions use descriptor values in the scoring function that overlap with our descriptor representation. For the hERG_at_1uM and hERG_at_10uM tasks from the TDC, datasets were downsampled to 10
000 molecules; in these instances, reported ROGI values are the mean of five random subsamples.
,
, and
are the cross-entropy, KL divergence, and contrastive terms, respectively, and β and γ are loss weights. We set these weights to 0.1 and 50, respectively. For two points i and j, the contrastive term is defined as the squared difference between their distance in the latent space and their distance in the target space:
D ×
D →
≥0 is a (pseudo)metric. For latent space distances dz, we use the cosine distance, and for target space distances dy, we use the absolute value. We minimize the mean of all pairwise differences across an entire batch.
Footnote |
| † Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d3dd00088e |
| This journal is © The Royal Society of Chemistry 2023 |