Open Access Article
Guanming
Chen
and
Thijs
Stuyver
*
Ecole Nationale Supérieure de Chimie de Paris, Université PSL, CNRS, i-CLeHS, 75 005 Paris, France. E-mail: thijs.stuyver@chimieparistech.psl.eu
First published on 22nd September 2025
Predictive chemistry often faces data scarcity, limiting the performance of machine learning (ML) models. This is particularly the case for specialized tasks such as reaction rate or selectivity prediction. A common solution is to use quantum mechanical (QM) descriptors—physically meaningful features derived from electronic structure calculations—to enhance model robustness in low-data regimes. However, computing these descriptors is costly. Surrogate models address this by predicting QM descriptors directly from molecular structure, enabling fast and scalable input generation for data-efficient downstream ML models. In this study, we compare two strategies for using surrogate models: one that feeds predicted QM descriptors into downstream models, and another that leverages the surrogate's internal hidden representations instead. Across a diverse set of chemical prediction tasks, we find that hidden representations often outperform QM descriptors, particularly when descriptor selection is not tightly aligned with the downstream task. Only for extremely small datasets or when using carefully selected, task-specific descriptors do the predicted values yield better performance. Our findings highlight that the hidden space of surrogate models captures rich, transferable chemical information, offering a robust and efficient alternative to explicit descriptor use. We recommend this strategy for building data-efficient models in predictive chemistry, especially when feature importance analysis is not a primary goal.
One strategy to address the issue of data scarcity consists of representation engineering, i.e., describing the molecules/reactions through a limited set of (carefully selected) informative, physically meaningful descriptors, so that a robust relationship between ML model input and output can be learned.5–7 Quantum Mechanical (QM) descriptors are a particularly popular choice in this regard. Unfortunately, the calculation of such descriptors typically requires resource-intensive density functional theory (DFT) calculations,8–12 which limits the applicability of this approach to big datasets, as well as to use cases where inference is expected at a high-throughput speed.
An alternative strategy that has been pioneered in recent years is to avoid the explicit calculation of QM descriptors, by predicting their values for unseen molecules and/or reactions with the help of a surrogate ML model.1,13–16 Taking this approach, QM descriptors can be inferred on-the-fly, so that the generation of the input representation of the downstream model can be seamlessly integrated into a single end-to-end model. This aggregate model then rivals regular ML models in terms of inference speed and computational resource footprint, while enabling higher accuracy and increased robustness due to the physical information encoded in the intermediately predicted descriptors.
Of course, setting up such a surrogate model still requires an initial training dataset, constructed through high-throughput QM calculations. However, once generated, this data, as well as the resulting surrogate models, can be applied to – and repurposed for – different downstream prediction tasks. Consequently, a range of high-throughput QM datasets have been released in recent years. In addition to the prototypical QM9 dataset,17 one of the earliest large-scale examples was the QMugs18,19 dataset, which contains a wide range of quantum mechanically (QM) computed descriptors and properties for 665k biologically and pharmacologically relevant molecules extracted from the ChEMBL20 database. Other examples include the QM40 (ref. 21) dataset, which contains QM properties for 163k compounds extracted from the ZINC22 database, the QCDGE23 dataset, which contains both ground- and excited-state properties for 450k C, H, N, O, F containing compounds, the BDE-db24 dataset, which contains QM descriptors for more than 200k organic radicals, and the tmQM25 dataset, which contains properties for 86k transition metal complexes.
The growing availability and diversity of public QM descriptor datasets indicate that surrogate modeling will become increasingly accessible—and presumably more widely adopted as a consequence—in the years to come. As such, it is important to establish guidelines on how the QM information, captured in these datasets, can be leveraged optimally for downstream tasks.
Taking a closer look at the typical architecture of the surrogate models used so far,1,14,15 one can conclude that they generally start from a SMILES string from which a molecular graph is deduced. The atomic vectors of this graph are subsequently embedded into a learned (hidden) representation, after which multiple feed-forward neural networks (FFNN), or readout functions, lead to the actual QM descriptors, that is, the surrogate model targets (Fig. 1).
Starting from this realization, a natural question arises: how does directly using the final hidden representation of the embedder in the surrogate model as input for the downstream model compare to the conventional surrogate model approach of introducing the predicted QM descriptors as the input, or as supplementary features, for the downstream ML model?
Note that the alternative hidden representation strategy, introduced above, is conceptually connected to a pre-training strategy: a model is first trained on an extensive descriptor dataset, after which the weights in the trained encoder are frozen, and the prediction heads, leading to the descriptors, are detached and replaced by pristine heads that lead to the targets of the downstream task.26–30
Intuitively, arguments can be devised in favor of a superior performance for either the descriptors or the hidden representation approach. On the one hand, the readout process to go from hidden representation to QM descriptors may result in information loss due to the compression of certain, useful, hidden features, i.e., the learned hidden space may contain some features that are beneficial for downstream tasks, which are not transferred when QM descriptors are used as input for the latter model. On the other hand, the hidden representations themselves are usually high-dimensional (e.g., 1200 dimensions for a single atomic representation), and hence they may be inherently non-linear, and/or may contain a high degree of redundancy (i.e., many dimensions either correlate only very weakly with downstream targets, or are largely co-linear with some of the other hidden dimensions). As such, the high-dimensionality may make it difficult to fully leverage the encoded information under sparse data conditions, which is typically the case for the downstream tasks for which we would want to use a surrogate model.
In this work, incorporating predicted QM descriptors, and directly utilizing the hidden space of the surrogate model, will be compared head-to-head for a selection of representative downstream tasks, mainly within the realm of chemical reactivity.
Our results suggest that only when a carefully engineered – and complete – set of descriptors is selected, the direct use of QM descriptor values may result in a superior performance of the downstream model. In most cases, however, and especially when descriptors are selected without too much attention to the underlying physics/chemistry of the downstream task (which is typically the case when descriptors are added to a predictive model),1,14,31 leveraging the surrogate model's hidden space actually results in a better predictive performance – the discrepancy is often significant. As such, unless gaining qualitative insights through feature importance analysis is the main goal of the study,32–34 we propose as a general recommendation to focus on the hidden space of surrogate models to enhance predictive models for downstream tasks, not the predicted descriptors themselves.
First, we focus on the work by Alfonso-Ramos and co-workers15 on hydrogen atom transfer (HAT) reactivity, where the aim consisted of accurately predicting activation energies (ΔG‡) for a variety of small, downstream datasets. In this study, the previously mentioned BDE-db database was used to construct the surrogate model. The relevant QM descriptors were identified through a careful Valence Bond (VB) analysis for a generic HAT model reaction (see Section S1 in the SI for an in-depth discussion). Specifically, the following descriptors were mined from BDE-db: (1) partial charges and spin densities as atom-level descriptors; (2) buried volume, frozen bond dissociation energy (BDE), and bond dissociation free energy (BDFE) as molecule-level descriptors. The vector, obtained by concatenating these descriptors on both the reactant and product sides, contains 14 elements.
The specific downstream datasets considered are: (1) In-House, a dataset consisting of 1511 DFT computed gas-phase reaction profiles for pairs of organic compounds extracted from BDE-db; (2) Alkoxy,35 consisting of computed reaction profiles for alkoxy radicals abstracting hydrogens from hydrocarbons and heterosubstituted compounds in an acetonitrile solution, where 238 reactions were pre-selected for training and validation and 60 reactions for testing; (3) Exp. alkoxy,36 a small dataset of experimentally reported selectivities for 6 hydrocarbons by CH3O˙, and the corresponding 15 DFT-computed reaction barriers; (4) Photoredox HAT,37 564 photoredox-mediated HAT catalysis reactions with various allylic, propargylic, benzylic, aldehyde and alkyl substrates and O/N-based radical species extracted from a larger data set published by Hong and co-workers;37 (5) Exp. cumyloxyl,38 an experimental dataset containing 45 HAT reactions from C(sp3)–H bonds by cumyloxyl radical; (6) Cytochrome P450,39 consisting of 24 activation energies for HAT by the cytochrome P450 enzyme from organic compounds, where 6 reactions out of 24 are pre-selected for testing; (7) Atmospheric HAT,40 consisting of 73 HAT reactions encountered in atmospheric chemistry and extracted from RMechDB.41 Except for the In-House dataset, these sets were originally extracted from the literature by Alfonso-Ramos et al.,15 with the aim to cover a wide range of HAT settings (i.e., atmospheric, metabolic, and synthetic reactivity), while staying as much as possible within the scope of the surrogate model.
Secondly, we focus on the work by Guan et al.1 on predicting experimentally observed regiochemistry for selective organic reactions extracted from the patent literature. In this case, the surrogate model was trained on an in-house generated dataset of QM descriptors computed at B3LYP/def2-SVP level-of-theory42–44 for 136k organic molecules, containing C, H, O, N, P, S, F, Cl, Br, I, Si, B elements, curated from the ChEMBL and Pistachio45 databases. For the selection of the descriptors, Guan et al. simply opted to compute a series of the most frequently used local reactivity indices, that is, (1) atomic charges, nucleophilic Fukui indices, electrophilic Fukui indices46,47 and NMR48 shielding constants as atomic descriptors; (2) bond lengths and bond orders as bond descriptors.
The specific downstream datasets considered in this work, are: (1) C–H, consisting of 2244 aromatic C–H functionalization reactions, (2) C–X, consisting of 1024 aromatic C–X substitution, and (3) others, consisting of all selective substitution reactions that do not fall in either of the previously mentioned categories (552 entries in total).
Finally, we focus on the work by Li et al.,14 in which the surrogate model approach was applied to a broad range of downstream tasks. Here, the surrogate model was trained on an in-house generated dataset of QM descriptors computed at 6 different levels-of-theory for 65k organic molecules, extracted from a range of public databases.20,49–54 We decided to focus on the descriptors computed at ωB97X-D/def2-SVP//GFN2-xTB level-of-theory.44,55,56 37 descriptors were considered in total, 13 atom-level descriptors (e.g., NPA charges,57 Parr functions,58 NMR shielding constants, and valence orbital occupancies), 4 bond-level (e.g., bond order, bond length, bonding electrons, and bond natural ionicity), and 20 molecule-level descriptors (e.g., HOMO–LUMO gap, ionization potential, electron affinity, and dipole and quadrupole moments). In their work, Li et al.,14 trained 3 separate surrogate models (1 for both atom-level and bond-level, and 2 for molecule-level descriptors). Here, because of the nature of the downstream tasks, we focus on the surrogate models that predict molecule-level descriptors.
In total, 16 downstream applications were considered in the original work by Li et al.14 Here, we only consider a representative subselection of the corresponding datasets: (1) ESOL,59 an experimental regression dataset containing water solubilities for 1127 molecules, (2) FreeSolv,60 an experimental regression dataset, consisting of 642 hydration free energy values, (3) QM9,17 a regression dataset consisting of 12 (quantum chemically computed) energetic, electronic and thermodynamic properties, for 134k organic compounds with up to 9 heavy atoms, (4) HIV,61 an experimental classification dataset indicating the ability to inhibit HIV replication for 41
127 molecules, (5) ClinTox,62 an experimental classification dataset indicating the toxicity of 1477 compounds. The above datasets are also available through MoleculeNet.62,63
Note that the benchmarking datasets selected above for all three case studies are identical to those in the original studies,1,14,15 and even the same data splits are adopted to ensure a fair comparison. Their summary is shown in Table 1.
| Dataset | Labels | Task |
|---|---|---|
| In-House15 | ΔG‡ | HAT reactivity prediction (regression) |
| Alkoxy35 | ΔG‡ | HAT reactivity prediction (regression) |
| Exp. alkoxy36 | ΔG‡ | HAT reactivity prediction (regression) |
| Photoredox HAT37 | ΔG‡ | HAT reactivity prediction (regression) |
| Exp. cumyloxyl38 | ΔG‡ | HAT reactivity prediction (regression) |
| Cytochrome P450 (ref. 39) | ΔG‡ | HAT reactivity prediction (regression) |
| Atmospheric HAT40 | ΔG‡ | HAT reactivity prediction (regression) |
| C–H, C–X, Others, All1 | Primary product formed among enumerated regio-isomers | Regioselectivity prediction (classification) |
| ESOL59 | Log(S) | Molecular property prediction (regression) |
| FreeSolv60 | ΔGhydexp, ΔGhydcalc | Molecular property prediction (two-task regression) |
| QM9 (ref. 17) | DFT-derived μ, α, εHOMO, εLUMO, εgap, 〈R2〉, zpve, Cv, U0, U, H, G | Molecular property prediction (multi-task regression) |
| HIV61 | Ability to inhibit HIV replication | Molecular property prediction (binary classification) |
| ClinTox62 | Toxicity of drug candidates | Molecular property prediction (binary classification) |
An in-depth discussion of the ChemProp architecture can be found in a recent publication by Heid et al.27 In short, the model starts by constructing a molecular graph from a SMILES input, after which the graph is passed through a (multi-layer) directed message passing neural network (D-MPNN) encoder. The D-MPNN iteratively aggregates and updates information from neighboring atoms and bonds and, as a result, encodes a molecule into separate atom- and bond-level embeddings. These representations are then passed on to multiple FFNN readout functions, where each FFNN is trained to predict one individual QM descriptor. For molecule-level descriptors, the atom-level vectors are first sum-pooled (or mean-pooled), before passing the final hidden representation/embedding to the respective FFNN readout functions. For globally constrained atom-level descriptors – e.g., the sum of all the atomic charges within a neutral molecule should always be equal to zero – an attention-based correction is applied to ensure that this constraint is satisfied.
For the second1 and third14 case studies, the surrogate models were not retrained at any point, and hence, the hyperparameters selected in the corresponding original works were adopted here as well consistently. This means that the sizes of the atom- and molecule-level hidden representations are also fixed and pre-set – at 600 and 700/900, respectively.
For the first case study,15 we also opted to use the original trained surrogate model throughout the first part of our analysis. Additionally, we also performed some tests where we retrained the model on only a subset of the QM descriptors (vide infra), but with the same hyperparameters. This means that the hidden atom- and molecule-level representations contain 1200 dimensions each by default. Furthermore, we also performed some tests where we systematically modulated the hidden dimension size h, of the surrogate. As evident from the discussion in the final section below, doing so does not affect the conclusions drawn significantly.
In the standard descriptor-based pipeline, these atom-, bond-, and molecule-level embeddings are passed through corresponding FFNNs to produce the respective (surrogate model-dependent) descriptors. These descriptors are then concatenated to form the input to the downstream model.
In our hidden representation approach, we directly extract and concatenate selected subsets of the learned atom-, bond-, and molecule-level embeddings—bypassing the FFNNs altogether—to construct alternative downstream model inputs. This custom selection approach is necessary to ensure a fair comparison: both approaches should only have access to the same underlying descriptor information. Since descriptor generation varies between case studies, the construction of the hidden representation inputs must be modulated accordingly.
In the original HAT reactivity study,15 both (reactive) atom-level, and molecule-level, descriptors, respectively on the reactant and product side of the reaction, were extracted from the surrogate model as input for the downstream tasks. As such, we decided to concatenate both atom-level embeddings of the reactive atoms, i.e., the sites undergoing a bonding change, and molecule-level embeddings in the hidden representation approach.
The general reaction scheme for HAT reactivity can be written as follows,
| R1–A˙–H + R2–B˙ → R1–A˙ + R2–B˙–H | (1) |
The input representation we constructed for this case study thus consists of 6 concatenated vectors: the hidden representations of the four molecules R1–A˙–H, R2–B˙, R1–A˙, and R2–B˙–H, involved in the reaction, and the atom-level representations of the radical sites on reactant and product side respectively, i.e., B˙ in R2–B˙ and A˙ in R1–A˙.
In their study on regioselectivity prediction, Guan et al.1 focused exclusively on the effect of including (atom-level) QM descriptors from the two main reactive sites of the reactants in the downstream model. As such, we decided here to use a simple concatenation of the corresponding atom-level hidden representations of the reactive sites as the alternative input for the downstream model.
Since all the downstream tasks, considered in Li et al.’s work,14 involved molecule-level properties, we simply extracted the hidden representations of the 2 molecule-level surrogate models they generated, concatenated them, and used them as the downstream model input.
For the first case study,15 an ensemble of 4 FFNNs was consistently set as the model architecture, and the optimal number of layers, hidden size, learning rate and dropout rate of these FFNNs were determined for the In-House dataset through a grid search. The resulting settings were subsequently transferred across the different downstream tasks.
For the second case study,1 the original models for the descriptor-based strategy were directly transferred without any alterations, i.e., a 3-layer FFNN, respectively with 500, 250, and 125 nodes, was selected for each of the downstream datasets.
For the third case study,14 we selected a 3-layer FFNN, each with 1000 nodes, as the downstream model architecture. Note that for the corresponding baseline model, i.e., the one that takes the predicted descriptors as input, a radial basis function (RBF) expansion (n = 50) was performed first, as this turned out to improve the performance of this model compared to just selecting directly the descriptor values themselves (see Section S2 and S3 in the SI for more details and comparisons, respectively).
A more in-depth discussion about the downstream models can be found in S4 of the SI.
| Dataset | Size | Descriptor-based | Hidden representation-based | |||||
|---|---|---|---|---|---|---|---|---|
| Best model | RMSE (kcal mol−1) | MAE (kcal mol−1) | R 2 | RMSE (kcal mol−1) | MAE (kcal mol−1) | R 2 | ||
| In-House | 1511 | FFNNs | 2.74 ±0.01 | 1.96 ±0.00 | 0.85 ±0.00 | 3.35 ± 0.02 | 2.40 ± 0.02 | 0.77 ± 0.00 |
| Alkoxy | 298 | RF | 1.30 ± 0.03 | 1.12 ± 0.03 | 0.78 ± 0.01 | 1.26 ±0.03 | 1.03 ±0.02 | 0.79 ±0.01 |
| Exp. alkoxy | 15 | RF | 1.29 ±0.03 | 0.97 ±0.02 | 0.62 ±0.02 | 1.61 ± 0.02 | 1.25 ± 0.02 | 0.41 ± 0.02 |
| Photoredox HAT | 564 | RF | 1.36 ± 0.00 | 0.90 ± 0.01 | 0.92 ± 0.00 | 0.84 ±0.02 | 0.60 ±0.01 | 0.97 ±0.00 |
| Exp. cumyloxyl | 45 | FFNNs | 0.79 ± 0.04 | 0.66 ± 0.04 | 0.37 ± 0.16 | 0.55 ±0.04 | 0.45 ±0.03 | 0.70 ±0.10 |
| Cytochrome P450 | 24 | FFNNs | 1.09 ±0.08 | 0.95 ±0.11 | 0.47 ±0.08 | 1.28 ± 0.01 | 1.14 ± 0.02 | 0.27 ± 0.02 |
| Atmospheric HAT | 73 | FFNNs | 2.30 ± 0.12 | 1.88 ± 0.08 | 0.62 ± 0.12 | 2.08 ±0.10 | 1.60 ±0.10 | 0.75 ±0.04 |
In line with our intuition outlined in the Introduction, we observe that for the two smallest datasets considered, with only 15 and 24 data points, respectively, the descriptor approach slightly outperforms the hidden representation approach. This suggests that in these extreme situations, the high dimensionality of the downstream model input (which in this case amounts to 6 × 1200 = 7200 dimensions, vide supra) may indeed negatively affect the performance. However, even if this is indeed the root cause of this observation, it is clearly an effect that vanishes rapidly as dataset size increases: already for the datasets with 45 and 73 datapoints respectively, i.e., Exp. cumyloxyl and Atmospheric HAT, the hidden representation takes over as the most performant downstream model input in our analysis.
Next to the extremely small downstream datasets, there is only one other example for which we observe that the descriptors outperform the hidden representation, namely the biggest of them all, i.e., the In-House dataset.
We hypothesize that the reason the descriptors perform so well for this specific case is that the surrogate QM descriptor model was, in fact, specifically set up with this dataset, consisting of organic HAT reactions run in the gas phase, in mind as the main downstream application in the original publication.15 More specifically, the constructed VB model, from which the descriptors to include in the surrogate were selected, neglected solvent/environment effects altogether, and the calculations performed to construct the BDE-db database24 – the training data for the surrogate model – were also performed in the gas-phase (and at the same level-of-theory). Finally, the reactions in the In-House dataset were constructed by combining radicals and molecules from the same distribution as the BDE-db dataset, so one can expect that the QM descriptors predicted by the surrogate model are particularly appropriate and accurate for this dataset.
This stands in contrast to the other downstream datasets, where measurements/calculations were either performed in a solvent (or even in an enzymatic) environment, and/or molecules were involved that are (mostly) out-of-distribution with respect to the BDE-db database.
As a first test to verify this hypothesis, i.e., that descriptors were able to outperform the hidden representation for the In-House dataset only because a (close to) “ideal” representation had been designed, we probed what would happen if we retrain the surrogate model with only part of the QM descriptors, identified as relevant in the VB analysis. Specifically, we kept the other configurations unchanged by re-using the source code and training set from Alfonso-Ramos et al.15 and considered two alternative versions of the surrogate model where only the atom-level/molecule-level QM descriptors were kept as targets (Fig. 2).
As expected, we observe that when re-training the surrogate model with only atom-level descriptors, the performance of the descriptor-based model, evaluated on the In-House dataset, becomes significantly worse than the corresponding hidden representation-based model, due to a sharp drop in the accuracy of the former. Interestingly, also for the only other gas phase HAT reactivity dataset, Atmospheric HAT, we observe a major loss in model performance upon exclusion of the explicitly predicted molecule-level/atom-level descriptors, accentuating the superior performance of the hidden representation strategy for this dataset even further. Similar, though somewhat less pronounced, results are also obtained when re-training exclusively with molecule-level descriptors.
In direct contrast to the results for the descriptor-based strategy, the hidden-space input representations appear remarkably robust across the board, retaining an almost constant performance, regardless of which targets are selected for the surrogate model. As such, using hidden representations from the surrogate appears to be a safer option, especially when the set of QM descriptors has been selected in a suboptimal manner, i.e., when major sources of variation in the downstream target are not covered by the prediction targets of the surrogate, and/or many of the descriptors are not causally linked with the variation of this target. An interpretation for this phenomenon through visualization can be found in S6 of the SI.
It should be emphasized that only in the case of the two very small downstream datasets—Exp. alkoxy and Cytochrome P450—do descriptors consistently yield better performance, regardless of whether all descriptors or only the atom-level/molecule-level subsets are considered. Additional experiments underscore that this effect is indeed exclusively linked to dataset size: when the amount of training data is progressively reduced for slightly larger datasets (where hidden representations normally outperform descriptor-based ones), the performance gap between the two strategies systematically narrows. In the extreme low-data regime (on the order of 50 samples), the compact descriptor representation tends to surpass hidden representations for these datasets as well (see Section S7 of the SI for details).
Finally, we would also like to note that the results presented above, for the full set of descriptors, are independent of the size of the hidden representation in the surrogate model. In Fig. 4, we show that, while the errors for both strategies are modulated somewhat by changing the hidden size, the overall trends, i.e., whether the hidden representation-based or the descriptor-based strategy is the most performant, remain consistent per individual dataset. More results, along with re-training details, can be found in S8 of the SI.
Gratifyingly, this is exactly what we observe (see Table 3). For each of the downstream datasets considered, the hidden representation-based models achieve a 4–5% higher accuracy than the corresponding descriptor-based models. Interestingly, our hidden representation-based models even outperform the elaborate graph neural networks that integrate the surrogate model predicted QM descriptors, ml-QM-GNN, developed specifically for these datasets by Guan et al.1 (except for the C–X dataset, where the two methods perform nearly the same). This observation further strengthens our hypothesis.
| Dataset | Size | Hidden-FFNN | Desc-FFNN | ml-QM-GNN |
|---|---|---|---|---|
| C–H | 2244 | 92.7% ± 0.3% | 88.9% ± 0.1% | 92.4% ± 0.2% |
| C–X | 1024 | 96.5% ± 0.3% | 91.5% ± 0.6% | 96.6% ± 0.4% |
| Others | 552 | 95.4% ± 0.4% | 91.5% ± 0.4% | 94.5% ± 0.8% |
| All | 3820 | 94.0% ± 0.3% | 89.7% ± 0.4% | 93.5% ± 0.2% |
We should note here that it is not entirely inconceivable that, through the selection of a more complete set of descriptors (and taking for example solvent effects into account during their calculation/prediction), the hidden representation-based model could still be outperformed by a descriptor-based one. For the current selection, however, there is clearly no substantial benefit of using predicted descriptors.
| Dataset | Size | Hidden-FFNN | Desc-FFNN | ml-QM-GNN |
|---|---|---|---|---|
| ESOL | 1127 | 0.646 ±0.008 | 1.127 ± 0.008 | 0.539 ± 0.047 |
| FreeSolv | 642 | 0.937 ±0.019 | 2.358 ± 0.049 | 0.89 ± 0.16 |
| QM9 | 133 885 |
0.711 ±0.003 | 1.901 ± 0.004 | 0.111 ± 0.004 |
| HIV | 41 127 |
0.815 ±0.005 | 0.714 ± 0.001 | 0.823 ± 0.029 |
| ClinTox | 1477 | 0.892 ±0.007 | 0.695 ± 0.008 | 0.871 ± 0.058 |
Remarkably, despite the simplicity of the adopted downstream model architecture, the hidden representation strategy is even competitive with the elaborate QM-augmented graph neural network (ml-QM-GNN) approach for all datasets except QM9. In the case of QM9, the discrepancy between the hidden-FFNN and the ml-QM-GNN result is presumably due to a limited number of sub-tasks involving intensive properties. Indeed, the hidden-representation input based on molecular descriptors appears ill-suited to model properties that increase with molecule size (see Section S3 in the SI).
In this final section, we will attempt to analyze the characteristics of the hidden spaces for the downstream tasks, with a particular focus on their respective linearity and roughness, in the hope of gaining some more insights into this observed behavior.
As noted by Tkatchenko and co-workers,64 among others, to learn non-linearities, in particular interactions between the dimensions of a feature space, a sufficient number training points is typically needed. As such, one would naively expect that the smaller the size of the downstream dataset, the smaller the benefit of using an FFNN over a linear model.
To probe this, we compared the performance of our hidden-FFNN to its linear regression analog, hidden-LR, for a range of hidden space sizes (h ∈ [100, 200, 300,…, 2000]), in the surrogate model for the HAT case study15 (Fig. 3). Remarkably, while the performance of hidden-FFNN is relatively stable across h values, the errors achieved by hidden-LR fluctuate significantly; for some values of h, both downstream model architectures reach a similar performance, while for others, the hidden-FFNN outperforms hidden-LR by a significant margin. Note that this observation is valid for all the different downstream datasets. What this suggests is that the extent of linearity is not an inherently fixed property of the learned representation, but that the FFNN is equally good at dealing with either the fairly linear, as well as the non-linear, versions, regardless of the downstream dataset size. This implies that the hidden space inherently carries high-quality, readily exploitable information: the non-linearities that may emerge in the hidden space are generally not overly complex and easily learnable.
Interestingly, we also consistently observe that, among the dimensions of the hidden representation, there tends to be no significant co-linearity, i.e., most of the dimensions are completely independent (cf. Section S9 of the SI), further underscoring the apparent richness of the learned representation in the surrogate model.
Finally, we also took a look at the roughness of the respective input feature spaces of the downstream models. Roughness metrics aim to quantify the “modellability” of structure–activity relationships (SAR) within a dataset, where rougher landscapes contain a greater number of large target property differences between molecules/reactions that are close in feature (or hidden) space. Such large property jumps across adjacent datapoints are commonly known as activity cliffs and increase the challenge of training a performant ML model.65
We selected an advanced, recently proposed roughness index ROGI-XD66 to quantify the relationship between prediction targets and input feature spaces, with higher ROGI-XD values indicating rougher landscapes. Specifically, in the ROGI-XD approach, a dataset is progressively coarse-grained, and the evolution of the standard deviation of the targets throughout this coarse-graining process is tracked. In other words, the dataset is iteratively subdivided into clusters of increasing size, and at every instance, in each cluster, the labels of the individual samples are replaced with the average label for that cluster. Finally, the integral over the standard deviation of the (averaged) labels across cluster sizes is taken, resulting in the final roughness metric. When labels change only gradually across the feature space, the change in the standard deviation throughout the coarse-graining process will be limited, so that the ROGI-XD value will be small. As such, this metric provides an assessment of the feature/hidden space's quality, and it has previously been demonstrated that this quantity correlates well with the performance of ML models.
As evident from Fig. 4, we observe that for the In-House and Exp. alkoxy datasets, the descriptor-based method results in a smoother landscape, and this agrees with the higher performance of the descriptor-based strategy for these downstream datasets. Curiously, for most of the other datasets, the ROGI-XD values are quite similar for both strategies, though the hidden representation-based model tends to outperform the descriptor-based one as indicated in Table 2 and Fig. 2. The reason for this seemingly asymmetric behavior may be that roughness metrics such as ROGI-XD presumably become somewhat misleading when very large hidden spaces are considered, due to the presence of many potentially “unproductive” dimensions (see Section S10 in the SI for a toy model analysis). In other words, for feature spaces with many dimensions, ROGI-XD values can be artificially inflated compared to more compact feature spaces.
Overall, it appears that despite ROGI-XD's limitations, differences in roughness do seem to explain, to a reasonable extent, the observed trends in model performance. Nonetheless, an unequivocal causal explanation for the trends observed throughout this work is still lacking. This will be the focus of future research.
We acknowledge, however, a key trade-off: while hidden representations offer superior performance, they lack the transparency of explicit descriptors, making feature attribution and physical interpretability more difficult. As such, practitioners may still prefer explicit descriptors when mechanistic insight, feature importance analysis, or human-understandable model behavior is of central importance—particularly in hypothesis generation or experimental design.
Nonetheless, when predictive accuracy is the primary concern, our results strongly support the use of hidden representations from well-trained surrogate models. Their consistent outperformance across diverse applications suggests that, in most practical scenarios, they represent the more effective choice for enhancing downstream models in predictive chemistry.
Supplementary information is available. See DOI: https://doi.org/10.1039/d5dd00256g.
| This journal is © The Royal Society of Chemistry 2025 |