Harnessing surrogate models for data-efficient predictive chemistry: descriptors vs. learned hidden representations

Guanming Chen; Thijs Stuyver

doi:10.1039/D5DD00256G

Harnessing surrogate models for data-efficient predictive chemistry: descriptors vs. learned hidden representations

Guanming Chen

^a and Thijs Stuyver

*^a

Author affiliations

* Corresponding authors

^a Ecole Nationale Supérieure de Chimie de Paris, Université PSL, CNRS, i-CLeHS, 75 005 Paris, France
E-mail: thijs.stuyver@chimieparistech.psl.eu

Abstract

Predictive chemistry often faces data scarcity, limiting the performance of machine learning (ML) models. This is particularly the case for specialized tasks such as reaction rate or selectivity prediction. A common solution is to use quantum mechanical (QM) descriptors—physically meaningful features derived from electronic structure calculations—to enhance model robustness in low-data regimes. However, computing these descriptors is costly. Surrogate models address this by predicting QM descriptors directly from molecular structure, enabling fast and scalable input generation for data-efficient downstream ML models. In this study, we compare two strategies for using surrogate models: one that feeds predicted QM descriptors into downstream models, and another that leverages the surrogate's internal hidden representations instead. Across a diverse set of chemical prediction tasks, we find that hidden representations often outperform QM descriptors, particularly when descriptor selection is not tightly aligned with the downstream task. Only for extremely small datasets or when using carefully selected, task-specific descriptors do the predicted values yield better performance. Our findings highlight that the hidden space of surrogate models captures rich, transferable chemical information, offering a robust and efficient alternative to explicit descriptor use. We recommend this strategy for building data-efficient models in predictive chemistry, especially when feature importance analysis is not a primary goal.

This article is part of the themed collection: 2025 Digital Discovery Emerging Investigators

Supplementary files

Transparent peer review

To support increased transparency, we offer authors the option to publish the peer review history alongside their article.

View this article’s peer review history

Article information

DOI: https://doi.org/10.1039/D5DD00256G
Article type: Paper
Submitted: 09 Jun 2025
Accepted: 13 Sep 2025
First published: 22 Sep 2025
This article is Open Access

Download Citation

Digital Discovery, 2025,4, 3227-3237

Permissions

Request permissions

Harnessing surrogate models for data-efficient predictive chemistry: descriptors vs. learned hidden representations

G. Chen and T. Stuyver, Digital Discovery, 2025, 4, 3227 DOI: 10.1039/D5DD00256G

This article is licensed under a Creative Commons Attribution 3.0 Unported Licence. You can use material from this article in other publications without requesting further permissions from the RSC, provided that the correct acknowledgement is given.

Digital Discovery

Harnessing surrogate models for data-efficient predictive chemistry: descriptors vs. learned hidden representations

Abstract

Supplementary files

Transparent peer review

Article information

Download Citation

Permissions

Harnessing surrogate models for data-efficient predictive chemistry: descriptors vs. learned hidden representations

Social activity

Search articles by author

Spotlight

Advertisements