Out-of-distribution evaluation of active learning pipelines for molecular property prediction

Tianzhixi Yin; Peiyuan Gao; Gihan Panapitiya; Emily G. Saldanha

doi:10.1039/D5RA08055J

Out-of-distribution evaluation of active learning pipelines for molecular property prediction

Tianzhixi Yin,

*^a Peiyuan Gao,^a Gihan Panapitiya

^a and Emily G. Saldanha

Author affiliations

* Corresponding authors

^a Pacific Northwest National Laboratory, USA
E-mail: tianzhixi.yin@gmail.com

Abstract

Active learning (AL) has been widely applied as a strategy to reduce the data requirements of training machine learning models. Such a strategy can be especially valuable in fields where data collection is costly or time-consuming, as is the case for molecular property data. In this study, we evaluate AL for molecular property prediction, focusing on the performance on out-of-distribution (OOD) data. This OOD evaluation framework mimics the scenario found in real-world applications but is understudied in the prior literature. In our study, we focus on the prediction of solvation energy from molecular structure and develop an AL framework based on prediction uncertainties derived from Evidential Deep Learning (EDL). We started by training our model on an in-distribution training dataset and progressively augmented it with molecules from an OOD dataset sampled from PubChem, selected either randomly or using the AL strategy. We further examined generalization capabilities of AL by beginning with a subset of the in-distribution dataset, intentionally chosen to reduce initial diversity. Our results indicate that EDL demonstrates an advantage over random sampling. To further understand the behavior of the AL algorithm, we performed analysis of how the similarity between the training dataset and the held-out dataset affects the AL performance and of the distributional differences in the types of molecules selected by random sampling and AL.

Supplementary files

Article information

DOI: https://doi.org/10.1039/D5RA08055J
Article type: Paper
Submitted: 20 Oct 2025
Accepted: 30 Dec 2025
First published: 23 Jan 2026
This article is Open Access

Download Citation

RSC Adv., 2026,16, 5281-5295

Permissions

Request permissions

Out-of-distribution evaluation of active learning pipelines for molecular property prediction

T. Yin, P. Gao, G. Panapitiya and E. G. Saldanha, RSC Adv., 2026, 16, 5281 DOI: 10.1039/D5RA08055J

This article is licensed under a Creative Commons Attribution 3.0 Unported Licence. You can use material from this article in other publications without requesting further permissions from the RSC, provided that the correct acknowledgement is given.

RSC Advances

Out-of-distribution evaluation of active learning pipelines for molecular property prediction

Abstract

Supplementary files

Article information

Download Citation

Permissions

Out-of-distribution evaluation of active learning pipelines for molecular property prediction

Social activity

Search articles by author

Spotlight

Advertisements