Predicting aqueous and organic solubilities with machine learning: a workflow for identifying organic cosolvents

Abstract

Developing predictive models of solubility is useful for accelerating solvent selection for applications ranging from electrochemical conversion of organics to pharmaceutical drug development. Herein, we report on the development of a machine learning (ML) workflow for identifying organic cosolvents to increase the concentration of hydrophobic molecules in aqueous mixtures. This task is of particular interest for the electrocatalytic conversion of biomass and bio-oils into sustainable fuels, which faces challenges due to the low aqueous solubility of the feedstock. First, we predict the miscibility of potential cosolvents in water, and we only consider cosolvents that are miscible. Second, we rank cosolvents based on the predicted solubility of the molecule of interest in them. To achieve this, we train two separate ML models: one using the AqSolDB dataset to predict aqueous solubility, and another using the BigSolDB dataset to predict solubility in organic solvents. We select the Light Gradient Boosting Machine (LGBM) model architecture for aqueous solubility (test R2 = 0.864, RMSE = 0.851 for log(S (mol−1 dm−3))) and organic solubility (test R2 = 0.805, RMSE = 0.511 for log(x)) predictions based on comparing different ML models and features. We examine the generalizability of the organic solubility model on unseen solutes both quantitatively and qualitatively. We evaluate the utility of this ML workflow by identifying cosolvents for benzaldehyde and limonene—two hydrophobic molecules that are relevant for sustainable fuel production—and validate our predictions via experimental solubility measurements.

Graphical abstract: Predicting aqueous and organic solubilities with machine learning: a workflow for identifying organic cosolvents

Supplementary files

Transparent peer review

To support increased transparency, we offer authors the option to publish the peer review history alongside their article.

View this article’s peer review history

Article information

Article type
Paper
Submitted
07 May 2025
Accepted
11 Sep 2025
First published
15 Sep 2025
This article is Open Access
Creative Commons BY license

Digital Discovery, 2025, Advance Article

Predicting aqueous and organic solubilities with machine learning: a workflow for identifying organic cosolvents

M. Krzyżanowski, S. M. Aishee, N. Singh and B. R. Goldsmith, Digital Discovery, 2025, Advance Article , DOI: 10.1039/D5DD00134J

This article is licensed under a Creative Commons Attribution 3.0 Unported Licence. You can use material from this article in other publications without requesting further permissions from the RSC, provided that the correct acknowledgement is given.

Read more about how to correctly acknowledge RSC content.

Social activity

Spotlight

Advertisements