A case study on hybrid machine learning and quantum-informed modelling for solubility prediction of drug compounds in organic solvents

Abstract

Solubility is a physicochemical property that plays a critical role in pharmaceutical formulation and processing. While COSMO-RS offers physics-based solubility estimates, its computational cost limits large-scale application. Building on earlier attempts to incorporate COSMO-RS-derived solubilities into Machine Learning (ML) models, we present a substantially expanded and systematic hybrid QSAR framework that advances the field in several novel ways. The direct comparison between COSMOtherm and openCOSMO revealed consistent hybrid augmentation across COSMO engines and enhanced reproducibility. Three widely used ML algorithms, eXtreme Gradient Boosting, Random Forest, and Support Vector Machine, were benchmarked under both 10-fold and leave-one-solute-out cross-validation. The comparison between four major descriptor sets, including MOE, Mordred, RDKit descriptors, and Morgan Fingerprints, offering the first descriptor-level assessment of how COSMO-RS calculated solubility augmentation interacts with diverse chemical feature space. The statistical Y-scrambling was conducted to confirm that the hybrid improvements are genuine and not artefacts of dimensionality. SHAP-based feature analysis further revealed substructural patterns linked to solubility, providing interpretability and mechanistic insight. This study demonstrates that combining physics-informed features with robust, interpretable ML algorithms enables scalable and generalisable solubility prediction, supporting data-driven pharmaceutical design.

Graphical abstract: A case study on hybrid machine learning and quantum-informed modelling for solubility prediction of drug compounds in organic solvents

Article information

Article type
Paper
Submitted
10 Oct 2025
Accepted
19 Dec 2025
First published
07 Jan 2026
This article is Open Access
Creative Commons BY license

Digital Discovery, 2026, Advance Article

A case study on hybrid machine learning and quantum-informed modelling for solubility prediction of drug compounds in organic solvents

W. Wang, I. Cooley, M. R. Alexander, R. D. Wildman, A. K. Croft and B. F. Johnston, Digital Discovery, 2026, Advance Article , DOI: 10.1039/D5DD00456J

This article is licensed under a Creative Commons Attribution 3.0 Unported Licence. You can use material from this article in other publications without requesting further permissions from the RSC, provided that the correct acknowledgement is given.

Read more about how to correctly acknowledge RSC content.

Social activity

Spotlight

Advertisements