Open Access Article
Weiling Wang
a,
Isabel Cooleyb,
Morgan R. Alexander
c,
Ricky D. Wildmand,
Anna K. Croft
b and
Blair F. Johnston
*a
aCMAC, University of Strathclyde, Glasgow, G1 1RD, UK. E-mail: weiling.wang@strath.ac.uk; blair.johnston@strath.ac.uk
bDepartment of Chemical Engineering, Loughborough University, Loughborough, LE11 3TU, UK
cSchool of Pharmacy, University of Nottingham, University Park, Nottingham, NG7 2RD, UK
dCentre for Additive Manufacturing, University of Nottingham, Nottingham, NG7 2RD, UK
First published on 7th January 2026
Solubility is a physicochemical property that plays a critical role in pharmaceutical formulation and processing. While COSMO-RS offers physics-based solubility estimates, its computational cost limits large-scale application. Building on earlier attempts to incorporate COSMO-RS-derived solubilities into Machine Learning (ML) models, we present a substantially expanded and systematic hybrid QSAR framework that advances the field in several novel ways. The direct comparison between COSMOtherm and openCOSMO revealed consistent hybrid augmentation across COSMO engines and enhanced reproducibility. Three widely used ML algorithms, eXtreme Gradient Boosting, Random Forest, and Support Vector Machine, were benchmarked under both 10-fold and leave-one-solute-out cross-validation. The comparison between four major descriptor sets, including MOE, Mordred, RDKit descriptors, and Morgan Fingerprints, offering the first descriptor-level assessment of how COSMO-RS calculated solubility augmentation interacts with diverse chemical feature space. The statistical Y-scrambling was conducted to confirm that the hybrid improvements are genuine and not artefacts of dimensionality. SHAP-based feature analysis further revealed substructural patterns linked to solubility, providing interpretability and mechanistic insight. This study demonstrates that combining physics-informed features with robust, interpretable ML algorithms enables scalable and generalisable solubility prediction, supporting data-driven pharmaceutical design.
Machine Learning (ML) offers a complementary route by learning structure–property patterns from data. Once trained on an appropriate feature set, ML models can deliver predictions at lower computational cost. When the feature set is augmented with COSMO-RS-derived descriptors, the resulting models not only improve predictive accuracy but also preserve a degree of physics-based interpretability. More broadly, ML has emerged as a powerful tool for property prediction in chemistry and materials science.8,9 It can effectively capture complex links between structure, composition, and properties, allowing the design of targeted compounds and even the generation of novel materials.10–13 Within this domain, the Quantitative–Structure–Property Relationship (QSPR) and Quantitative–Structure–Activity Relationship (QSAR) approaches serve as a powerful framework that correlate molecular structure with experimentally measured physicochemical or material properties through linear or nonlinear modelling based on various descriptors.14–18 Applications span the prediction of solubility,19–21 boiling point,16,22,23 polarisability,24,25 and viscosity,26–28 while offering insights into the contributions of specific functional groups and structural motifs.29–31 It thereby supports the rational design and optimisation of molecules and materials across chemical, pharmaceutical, and materials sciences.
Given the high dimensionality introduced by molecular descriptors and fingerprints, modelling efforts often face the “curse of dimensionality”, where available data sparsely samples the chemical space.32 To address this challenge, we benchmarked three widely used, non-linear algorithms well suited to sparse, high-dimensional problems: eXtreme Gradient Boosting (XGBoost), Random Forest (RF), and Support Vector Machine (SVM).
Although the aqueous solubility of drug compounds has been widely investigated, studies on solubility in organic solvents remain comparatively limited.33–37 In this study, we presented a QSAR-based solubility prediction framework tailored for drug-like compounds in organic solvents. Extending previous work that used RF with Molecular Operating Environment (MOE)33 descriptors, we expanded the modelling framework to include XGBoost and SVM. Multiple descriptor types are systematically compared: MOE,38 Mordred descriptors,39 RDKit descriptors,40 Morgan Fingerprints,41 and COSMO-RS predictions from COSMOtherm and openCOSMO-RS. Mordred descriptors were computed using the Mordred cheminformatics package; RDKit descriptors were calculated using the RDKit cheminformatics toolkit; and Morgan Fingerprints were generated with the RDKit package. COSMO-RS simulated solubility is incorporated as an auxiliary feature rather than used standalone, following evidence that hybrid models outperform both descriptor-only ML and COSMO-RS alone.
Beyond accuracy, interpretability is emphasised through SHapley Additive exPlanations (SHAP),42 which decompose predictions into feature contributions and highlight substructural motifs influencing solubility. This not only elucidates the contribution of individual input features to model predictions but also highlights the factors most critical in determining solubilities under the QSAR framework. Among the selected algorithms, XGBoost,43 a regularised boosting algorithm, sequentially optimises decision trees to correct residual errors and has demonstrated strong performance in various molecular prediction tasks,44–46 including feature reconstruction tasks such as Raman spectra,47 Near-Infrared (NIR) spectra,48 and Infrared (IR) spectra prediction.49 In contrast, RF aggregates independently built decision trees using bootstrap resampling, providing robustness and stability but lacking the iterative refinement of boosting. SVM uses kernel functions to project input data into higher dimensions, offering strong performance for high-dimensional and small-sample problems. However, their performance can be sensitive to parameter selection, and scalability may be limited for extensive datasets. These three algorithms are frequently selected and compared in cheminformatics studies for their robust predictive capabilities.
Several recent works highlight their utility. Kim et al.50 predicted the antioxidant activity of 2,2-diphenyl-1-picrylhydrazyl (DPPH) using XGBoost, RF, and SVM with RDKit descriptors. Qu et al.51 compared XGBoost, RF, SVM, and K-Nearest Neighbour (KNN) for retention time prediction of proteolysis-targeting chimeras, using fingerprints, physicochemical descriptors, along with chromatographic-condition features. Ghuriani et al.52 developed an XGBoost-driven biomarker identification pipeline that fed selected features into RF, SVM, and logistic regression for cancer prediction from gene sequencing. Danishuddin et al.53 benchmarked XGBoost, RF, SVM, and Multi-Layer Perceptron (MLP) to model HIV-1 integrase inhibitor activity, comparing PaDEL and RDKit descriptors with ECFP4 fingerprints (Morgan Fingerprints). Collectively, these studies underscore the prominence of XGBoost, RF, and SVM as standard benchmarks in cheminformatics, where they are routinely employed in parallel to assess predictive performance under varying descriptor types and molecular contexts, supporting their selection here for solubility modelling.
Recent advances have emphasised hybrid strategies that integrate physics-based knowledge into ML frameworks, aiming to preserve mechanistic interpretability while enhancing predictive scalability. Beyond descriptors and fingerprints, the output of physics-based models is increasingly incorporated as auxiliary features to enrich the input space. For example, Vassileiou et al.33 showed that including COSMOtherm solubility predictions as features improved the RF drug solubility models. Xiong et al.54 combined first-principles descriptors generated via Multiwfn with conventional descriptors to predict flotation behaviour, while Lu et al.55 embedded quantum-derived electronic features into deep learning for the prediction of drug–drug interaction.
These studies collectively demonstrate that augmenting data-driven QSAR/QSPR with physics-based descriptors yields models that are both more predictive and mechanistically interpretable. Although the trends identified align with established chemical intuition, the strength of this framework lies in its ability to validate and generalise these relationships systematically across hundreds of solute–solvent pairs. Building on this foundation, we designed a workflow that systematically integrates diverse descriptor types with COSMO-RS features to benchmark ML algorithms for solubility prediction, as shown in Fig. 1.
The dataset used in this study is based on the solubility collection compiled by Vassileiou et al.,33 which combines experimental measurements extracted from the literature with their in-house data. The dataset contains 714 solubility measurements at room temperature, covering 75 organic solutes and 49 solvents. The solutes span a chemically diverse set of drug-like small molecules, making the dataset representative of pharmaceutical formulation challenges.
For comparison, Sodaei et al.56 integrated MD-derived properties with ML and reported a gradient boosting model that achieved an R2 of 0.87 and an RMSE of 0.537 for aqueous drug solubility prediction under 10-fold CV. Alqarni et al.57 employed one-hot encoded solvents, temperature, and mass fraction as inputs to predict rivaroxaban solubility, where the light gradient boosting model achieved the best performance with an R2 of 0.988 and an RMSE of 9.13 × 10−5 under 5-fold CV. Jiang et al.58 utilised temperature and pressure to model the solubility of nonsteroidal anti-inflammatory drugs in green solvents, achieving an R2 of 0.987 and an RMSE of 13.7 under 10-fold CV using AdaBoost with Gaussian Process Regression (ADA–GPR).
It is important to note that these studies were trained on different datasets, often restricted to a single solvent system or a narrow chemical domain. As a result, direct numerical comparisons to our results can be misleading, since dataset composition and chemical diversity strongly influence apparent model accuracy. Moreover, most of these works rely on n-fold CV, which typically yields higher apparent performance than LOSO while providing a weaker measure of generalisability.
By contrast, the focus of our work lies in integrating diverse cheminformatics descriptors with multiple ML frameworks and providing physicochemical interpretation across a broader range of drug-like molecules and organic solvents, rather than solely pursuing the highest apparent predictive accuracy.
Our integrated framework benchmarked descriptor sets, algorithms, and hybrid strategies, while linking substructure-level contributions to solubility behaviour. The remainder of this paper presents the dataset and descriptors (Section 2.1), COSMO-RS method (Section 2.2), ML modelling and evaluation procedures (Section 2.3), predictive performance and interpretability outcomes (Section 3.1), and concludes with broader implications for data-driven pharmaceutical design (Section 3.2).
Solubility arises from the interactions between solute and solvent, and cannot be explained solely by the properties of either in isolation. A solute that dissolves readily in a polar solvent can be expected to display very low solubility in a non-polar organic solvent with contrasting chemical characteristics. It is therefore important, when examining the dataset, to define the chemical space spanned by the solvents, so that any conclusions about solute features contributing to solubility are interpreted within the context of solvent type. The dataset used in this work was originally designed to include organic solvents,33 which generally display high lipophilicity and low hydrophilicity, participating in favourable interactions with non-polar organic solutes. The octanol/water partition coefficient, log
P(octanol/water), is a widely used measure of lipophilicity/hydrophilicity. In this work, it was calculated using the MOE log
P(o/w) descriptor, a fragment-based method trained on experimental data. Higher log
P(o/w) values indicate greater lipophilicity and lower polarity, whereas lower values reflect greater hydrophilicity. A compound with a log
P(o/w) of 0 would partition equally between an octanol and a water phase, with positive log
P(o/w) values favouring octanol (hydrophobic) and negative log
P(o/w) values favouring water (hydrophilic). In our dataset, solvent log
P(o/w) values range from −1.1 to 7.8. In total, 8 solvents have log
P(o/w) below 0, while the other 41 have log
P(o/w) above 0. The solvents are predominantly lipophilic, although the presence of several low-log
P solvents introduces chemical diversity that complicates simple trends in solute–solvent solubility relationships. As introduced at the end of Section 1, the 714 solute–solvent pairs consist of 75 organic solutes and 49 solvents. The 12 most commonly occurring solvent molecules in this dataset each appear in at least 20 solvent/solute mixtures and together account for 381 total mixtures, over half of the dataset. These are ethanol, methanol, ethyl acetate, 2-propanol, acetone, 1-butanol, acetonitrile, chloroform, water, 1-propanol, 1-octanol and tetrahydrofuran. Thus, the range includes water and a number of relatively small organic molecules, all containing some functionality which can contribute to dipolar or hydrogen bonds. Some of the longer chain molecules among this group have lipophilic characteristics, particularly 1-octanol in which polar compounds are only sparingly soluble and which is used to define log
P(o/w) and has a log
P(o/w) of 2.8. However, there is enough polar functionality among the set of solvents that solvent polarity must also be considered when concluding trends in the structure of solutes.
After applying all filtering steps described above, the final dimensionalities reported here refer to the combined solute–solvent descriptor space used as input to the ML models. In total, the retained descriptors comprised 357 MOE features, 322 RDKit features, 2439 Mordred features, and 875 informative Morgan Fingerprint bits. These dimensionalities therefore reflect the sum of all filtered solute and solvent descriptors entering the final models. Although the Mordred representation remains larger than the other descriptor families, its effective dimensionality was substantially reduced from the initial raw feature set, and all retained descriptors passed the missingness, variance, and stability criteria. Moreover, all models were trained exclusively under rigorous CV schemes, which provide a strong safeguard against overfitting. The differences in predictive performance between descriptor families therefore cannot be attributed solely to the number of retained features.
It is also worth noting that all selected modelling approaches: XGBoost, RF, and SVM, are widely recognised for their robustness in high-dimensional settings owing to intrinsic regularisation, non-linear feature compression, and resistance to overfitting. Together with the descriptor filtering steps and the use of strict CV, this ensures that the observed differences in predictive performance cannot be attributed solely to the size of the feature space. To further validate that the models learn genuine chemical signal rather than spurious correlations arising from high dimensionality, we performed Y-scrambling with B = 200 permutations for every descriptor family and modelling approach. As shown in S2-SI, in all cases, the scrambled models collapsed to chance level, and the probability that the original model performance could be reproduced under randomly permuted targets was p = 0.005. These results confirm that the input features contain statistically significant information, and that the hybrid ML-physics models do not rely on spurious correlations introduced by descriptor dimensionality.
In the interest of reproducibility, open science and the FAIR (Findable, Accessible, Interoperable and Reusable) data principles, we investigated the use of the open source openCOSMO-RS software61,62 to generate the physics-based features. Our openCOSMO-RS workflow involved accounting for the influence of multiple conformers as described by Klamt63 and has been previously used by Schindl et al.64 We calculated solubility iteratively from COSMO-RS activity coefficients using infinite dilution as an initial guess, as described elsewhere.64,65 Solubility calculations were performed at 298.15 K, and for molecules whose melting point was above working temperature, the free energy of fusion was calculated from literature experimental melting point and enthalpy of fusion values taken from the original dataset of Vassileiou et al.4,33 Newly generated openCOSMO-RS results were benchmarked against the original COSMOtherm data33 before being used as inputs for each QSAR model.
Given the nature of this work, it is important to note that our current solute–solvent dataset is already highly challenging, containing substantial chemical diversity and many intrinsically difficult cases. Expanding to external datasets such as AqSolDB66 or BigSolDB67 would require COSMO-RS calculations for every associated solute–solvent pair, which is computationally prohibitive at present. As our models rely on COSMO-RS-derived solubility inputs, such benchmarking is deferred to future work once large-scale COSMO-RS data become available.
Instead, a more efficient strategy is to perform targeted validation on a representative subset of compounds drawn from these larger datasets. This allows us to benchmark model accuracy externally without undertaking full COSMO-RS enumeration. The results will be discussed in detail in Section 3.1.5.
To minimise hardware-induced variability, XGBoost was run in CPU-only mode (CUDA_VISIBLE_DEVICES = −1, device = ‘cpu’) using the histogram tree method (tree_method = ‘hist’). OpenMP parallelism was restricted to eight threads (OMP_NUM_THREADS = 8), and the regressor was configured with eight worker threads (n_jobs = 8) while CV fits were run sequentially (RandomizedSearchCV(n_jobs = 1)). These settings control the degree of parallelism and help stabilise floating-point round-off behaviour, so that repeated runs on different machines produce numerically comparable CV metrics.
Model inputs included molecular descriptors derived from one of four descriptor sets: MOE descriptors, RDKit descriptors, Mordred descriptors, or Morgan Fingerprints. COSMO-RS-predicted solubility was incorporated as an additional input feature in hybrid models. Model performance was evaluated using 10-fold CV and Leave-One-Solute-Out (LOSO) CV, with results reported using R2, RMSE, and MAE, averaged across folds. LOSO ensures that no solute appears in both training and test folds, preventing solute memorisation and reducing leakage from solute recurrence.
To quantify descriptor importance and generate heatmaps, SHAP values were computed. For XGBoost regressors, we used TreeExplainer in interventional mode with a fixed background of 100 samples. For SVM regressors, SHAP values were obtained with KernelExplainer using a K-means background (≤ 10 clusters), with perturbation sampling limited for efficiency and l1_reg (num_features(10)) applied to stabilise attributions. Random Forest models were not analysed with SHAP due to weaker predictive performance and limited interpretability within this framework.
The top 20 most influential descriptors were visualised as a heatmap and exported for further analysis, highlighting the physicochemical properties that predominantly impact predicted solubility. For fingerprint-based models using Morgan Fingerprints, we extended the analysis by mapping top-ranked fingerprint bits to substructures using RDKit.40 Substructures corresponding to high-ranking bits were extracted and visualised across all activating solutes and solvents, enabling interpretation of fragment-level contributions. Additionally, bit-wise SHAP heatmaps were generated, offering a chemically intuitive view of substructure importance.
| Version | CV | R2 | RMSE | MAE |
|---|---|---|---|---|
| Original R | 10-Fold | 0.783 | 0.553 | 0.363 |
| Rewritten Python | 10-Fold | 0.785 | 0.551 | 0.364 |
| Original R | LOSO | 0.562 | 0.786 | 0.59 |
| Rewritten Python | LOSO | 0.556 | 0.791 | 0.581 |
| COSMO model | R2 | RMSE | MAE |
|---|---|---|---|
| OpenCOSMO | −0.098 | 1.243 | 0.940 |
| COSMOtherm | 0.314 | 0.983 | 0.707 |
The standalone performance of openCOSMO was poor (R2 = −0.098, RMSE > 1), indicating large deviations from the experimental solubility values. In contrast, COSMOtherm showed stronger predictive accuracy in the same dataset (R2 = 0.314), although this still reflects limited standalone reliability. The openCOSMO-RS software is newer than the established COSMOtherm and is still under development. Although the latest parameterisation72 provides a less accurate performance than COSMOtherm, its performance is improving. The solute molecules in this dataset for which openCOSMO-RS displays the poorest agreement with experiment tend to be larger molecules featuring more complex ring systems than those for which the openCOSMO-RS agreement is the best. They also contain a wider variety of different polar and hydrogen-bonding groups, and are more likely to contain less common atoms, like S or P, which featured less heavily in the latest openCOSMO-RS parameterisation set than C, O, and N. When combined with machine learning models, both COSMO sources contributed to improved hybrid predictions. In particular, models using openCOSMO features still benefit from error correction via ML, narrowing the performance gap with COSMOtherm-based hybrids. Default settings are used for the RF model. The COSMO-RS–predicted log solubility was incorporated as an auxiliary numerical feature, providing a physically grounded, quantum-chemistry-derived summary of the underlying solvation thermodynamics for each solute–solvent pair. By combining this physics-based estimate with detailed structure-encoded descriptors, the hybrid models are able to learn the remaining structure–property relationships that COSMO-RS does not capture on its own, thereby systematically improving upon the purely physics-based baseline. These findings highlight the potential of openCOSMO as a lower-cost, open-access alternative for hybrid solubility prediction, especially when high-fidelity quantum chemical tools are not available. While its standalone accuracy remains limited, integration with ML enables correction of systematic errors, particularly for polar or strongly hydrogen-bonding systems where COSMO-based models might struggle, or areas of chemical space where COSMO-based models are less reliable. Accordingly, the relative improvement obtained by including openCOSMO–RS features is more important than its absolute standalone performance. Given its stronger performance and broader data coverage, COSMOtherm is selected as the hybrid COSMO source for subsequent analyses and discussion.
For RF, tuning produced only minor and mixed effects. Under LOSO, performance shifted marginally in the favourable direction (e.g., R2 increased from 0.5578 to 0.5582, although MAE rose from 0.581 to 0.593), whereas under 10-fold CV the tuned model performed slightly worse (e.g., R2 decreased from 0.777 to 0.769). These changes are numerically small and consistent with the well-known stability of RF, which typically operates close to its default optimum. Because tuning alters tree depth and feature-sampling behaviour, modest fluctuations in performance across validation schemes are expected. For consistency with the tuned XGB and SVM baselines, we therefore retain the tuned RF configuration in subsequent comparisons.
In contrast, both XGB and SVM exhibited clear improvements following tuning. For XGB, all metrics improved under both CV strategies (10-fold R2 increased from 0.781 to 0.808; LOSO R2 from 0.506 to 0.555). SVM showed the largest gains overall, particularly under LOSO, where R2 increased from 0.492 to 0.564 and RMSE decreased from 0.846 to 0.784, yielding the strongest out-of-distribution performance among all tuned models. A notable observation is that SVM outperforms XGB under the LOSO protocol, even though their 10-fold CV performance is nearly identical: both models achieve comparable R2 and RMSE values, with XGB showing only a slightly lower MAE. This behaviour is well understood in high-dimensional chemical datasets where the number of available samples is limited relative to the descriptor space. XGB relies on iterative, boosted tree expansions that exploit correlations present within the training folds; when the held-out solute is structurally dissimilar to the training set, these correlations can become unreliable, leading to reduced stability and higher variance in the predictions. In contrast, the SVM with an RBF kernel imposes a stronger inductive bias and heavier regularisation, and the model capacity is controlled primarily by the kernel bandwidth and the regularisation parameter. As a result, SVMs tend to produce smoother decision boundaries and are less sensitive to small fold-to-fold fluctuations in descriptor–property correlations. Under LOSO, where the task explicitly tests out-of-distribution generalisation across different solutes, this regularisation leads to improved robustness, yielding lower RMSE and higher R2 than XGB. Thus, the superior LOSO performance of the tuned SVM reflects its more conservative extrapolative behaviour in hybrid descriptor spaces.
Overall, these trends demonstrate that XGB and SVM benefit substantially from hyperparameter optimisation due to their sensitivity to regularisation strength and margin or learning-rate parameters, whereas RF is already strongly regularised by design, remains close to its default optimum. Full results are provided in Table 4.
As mentioned in Section 3.1.2, four solutes were removed, eliminating 27 associated solute–solvent pairs from the openCOSMO-RS calculations. For a fair comparison, both openCOSMO and COSMOtherm predictions were restricted to the subset of entries for which openCOSMO results were available. Therefore, the hybrid RF results without tuning in Fig. 2 and Table 4 will be different from what has been shown in Table 3.
| COSMO feature | CV | R2 | RMSE | MAE |
|---|---|---|---|---|
| None | 10-Fold | 0.653 | 0.699 | 0.450 |
| LOSO | 0.349 | 0.958 | 0.699 | |
| OpenCOSMO | 10-Fold | 0.747 | 0.598 | 0.387 |
| LOSO | 0.510 | 0.831 | 0.600 | |
| COSMOtherm | 10-Fold | 0.777 | 0.561 | 0.364 |
| LOSO | 0.549 | 0.797 | 0.581 |
| Model | Hyper P | R2 | RMSE | MAE | |||
|---|---|---|---|---|---|---|---|
| LOSO | 10-Fold | LOSO | 10-Fold | LOSO | 10-Fold | ||
| RF | Default | 0.558 | 0.777 | 0.789 | 0.561 | 0.581 | 0.369 |
| Tuned | 0.558 | 0.769 | 0.789 | 0.571 | 0.593 | 0.378 | |
| XGB | Default | 0.506 | 0.781 | 0.835 | 0.556 | 0.627 | 0.358 |
| Tuned | 0.555 | 0.808 | 0.792 | 0.520 | 0.575 | 0.332 | |
| SVM | Default | 0.492 | 0.707 | 0.846 | 0.643 | 0.615 | 0.432 |
| Tuned | 0.564 | 0.808 | 0.784 | 0.520 | 0.550 | 0.339 | |
Performance values reported in Table 5 correspond to the mean ± standard deviation obtained from N = 20 replicated 10-fold CV runs, each performed using a distinct random seed. This replicated protocol was applied to every model–descriptor combination to provide a fair and statistically robust comparison. In contrast, LOSO-CV is deterministic with respect to data partitioning and therefore yields a single, seed-independent result. The method details are illustrated in S3-SI, and the observed variability across the 20 seeds was consistently small (RMSE, MAE, and R2 differing only in the second decimal place), confirming that hybrid improvements are stable and not driven by stochastic variation in training or fold assignment.
| Model | Descriptors | R2 | RMSE | MAE | |||
|---|---|---|---|---|---|---|---|
| LOSO | 10-Fold | LOSO | 10-Fold | LOSO | 10-Fold | ||
| RF | MOE | 0.558 | 0.765 ± 0.006 | 0.789 | 0.565 ± 0.007 | 0.593 | 0.381 ± 0.003 |
| RDKit | 0.568 | 0.764 ± 0.006 | 0.780 | 0.567 ± 0.007 | 0.577 | 0.384 ± 0.002 | |
| Mordred | 0.567 | 0.756 ± 0.006 | 0.781 | 0.576 ± 0.007 | 0.584 | 0.388 ± 0.003 | |
| Morgan | 0.543 | 0.757 ± 0.004 | 0.803 | 0.574 ± 0.005 | 0.576 | 0.394 ± 0.003 | |
| XGBoost | MOE | 0.555 | 0.801 ± 0.010 | 0.792 | 0.517 ± 0.012 | 0.575 | 0.337 ± 0.005 |
| RDKit | 0.567 | 0.794 ± 0.011 | 0.781 | 0.527 ± 0.014 | 0.567 | 0.343 ± 0.006 | |
| Mordred | 0.613 | 0.791 ± 0.007 | 0.738 | 0.529 ± 0.009 | 0.538 | 0.350 ± 0.005 | |
| Morgan | 0.511 | 0.775 ± 0.009 | 0.831 | 0.547 ± 0.010 | 0.595 | 0.365 ± 0.005 | |
| SVM | MOE | 0.564 | 0.801 ± 0.007 | 0.784 | 0.517 ± 0.010 | 0.550 | 0.339 ± 0.005 |
| RDKit | 0.466 | 0.782 ± 0.011 | 0.868 | 0.541 ± 0.014 | 0.628 | 0.348 ± 0.007 | |
| Mordred | 0.415 | 0.726 ± 0.012 | 0.908 | 0.607 ± 0.014 | 0.661 | 0.406 ± 0.007 | |
| Morgan | 0.504 | 0.605 ± 0.017 | 0.836 | 0.729 ± 0.015 | 0.595 | 0.481 ± 0.007 | |
Notably, the XGBoost model using MOE descriptors achieved the highest overall predictive performance under 10-fold CV (R2 = 0.801 ± 0.010, RMSE = 0.517 ± 0.012, MAE = 0.337 ± 0.005; Table 5), with the SVM–MOE model performing comparably (R2 = 0.801 ± 0.007, RMSE = 0.517 ± 0.010, MAE = 0.339 ± 0.005). These results highlight the strong representational power of curated MOE descriptors, particularly when combined with regularised algorithms such as XGBoost and SVM.
Across all models and descriptor types, performance declined under LOSO compared to 10-fold CV, consistent with the increased difficulty of predicting solubility for unseen solutes. The largest reductions in R2 were observed for SVM using RDKit descriptors (from 0.782 ± 0.011 to 0.466; ΔR2 = 0.316) and SVM using Mordred descriptors (from 0.726 ± 0.012 to 0.415; ΔR2 = 0.311), indicating that kernel methods are particularly sensitive to descriptor redundancy and high-dimensional noise when forced to extrapolate to new solute structures. RF performed most robustly under LOSO, achieving R2 = 0.568 with RDKit and R2 = 0.543 with Morgan Fingerprints. This suggests that RF's ensemble averaging helps stabilise predictions when solute diversity increases. By contrast, RF was less effective with lower-dimensional inputs such as MOE and RDKit under 10-fold CV, where XGBoost and SVM achieved markedly higher accuracy.
XGBoost showed the strongest resilience with Mordred descriptors (R2 = 0.613 under LOSO; 0.791 ± 0.007 under 10-fold), demonstrating that gradient boosting can effectively extract signal from large, heterogeneous descriptor sets. Under 10-fold CV, XGBoost also achieved the highest overall accuracy with all types of descriptors and fingerprints, with particularly strong performance for MOE (R2 = 0.801 ± 0.010).
The higher performance of XGBoost is also consistent with the structure of the dataset. The descriptor sets used here are high-dimensional and contain correlated features, and XGBoost's sequential boosting, residual fitting, and built-in regularisation allow it to exploit such feature spaces more effectively than RF. This is particularly advantageous in the hybrid setting, where the COSMO-RS solubility provides a physics-based baseline and the ML component must learn the remaining structure-dependent deviations. RF, which does not explicitly model residuals, is less able to capture these fine-grained corrections.
Among descriptor sets, MOE provided consistently strong performance for both XGBoost (R2 = 0.801 ± 0.010 in 10-fold; 0.555 in LOSO) and SVM (R2 = 0.801 ± 0.007 in 10-fold; 0.564 in LOSO), reflecting the curated, property-focused nature of MOE features. It is noteworthy that, even with the same descriptor set, the tuned RF–MOE model reached a slightly lower 10-fold performance (R2 = 0.765 ± 0.006), whereas both XGB–MOE and SVM–MOE achieved the highest accuracies observed in this study. This improvement does not arise from differences in descriptor quality, which is held fixed across models, but from differences in model capacity and how each algorithm extracts structure–property relationships from MOE features. RF relies on axis-aligned, shallow tree partitions, which stabilise predictions but limit its ability to model smooth nonlinear interactions among correlated MOE descriptors. In contrast, XGBoost leverages sequential residual fitting to capture higher-order nonlinearities, while the RBF–SVM constructs a continuous similarity landscape that can more fully exploit the physicochemical signal encoded in the MOE feature space. As a result, both XGB and SVM are able to extract more predictive information from MOE features than RF, explaining the systematic improvement from R2 = 0.765 ± 0.006 (RF–MOE, 10-fold CV) to ≈ 0.801 (XGB–MOE and SVM–MOE) under 10-fold CV.
Under LOSO, SVM–MOE markedly higher than SVM trained on RDKit (R2 = 0.466), Mordred descriptors (R2 = 0.415), or Morgan Fingerprints (R2 = 0.504). This pattern reflects differences in descriptor design. MOE descriptors form a compact, property-focused feature set with limited redundancy and well-defined chemical meaning, enabling the RBF kernel to construct smooth similarity functions without overfitting to spurious dimensions. In contrast, the high-dimensional, highly correlated Mordred and Morgan representations create a much more irregular kernel landscape, making the SVM sensitive to descriptor noise and leading to pronounced performance degradation under LOSO. RDKit descriptors lie between these extremes but still lack the curated, physicochemical structure encoded in MOE. Consequently, the SVM benefits most from the balanced dimensionality and targeted chemical relevance of MOE features, which support stable extrapolation to unseen solutes. RDKit descriptors occupied an intermediate position, performing competitively with RF under LOSO and outperforming Morgan and Mordred descriptors under 10-fold CV. Morgan Fingerprints performed best with XGBoost (R2 = 0.775 ± 0.009 in 10-fold) but lagged behind MOE and RDKit overall, while Mordred descriptors exhibited the significant LOSO degradation across all models due to their high redundancy.
Fig. 4 compares prediction performance (R2) across all descriptor-model-CV combinations, using mean R2 over replicated 10-fold CV runs and single-run R2 for deterministic LOSO. For both RDKit descriptor and MOE, their relatively low redundancy and high interpretability make them well-suited to both tree-based and kernel-based models.
A portion of the residual unexplained variance may arise from experimental heterogeneity in the solid-state form of solutes, including differences in polymorphic form, amorphous content, sample history (e.g. cooling or quenching rates), or impurities. Such factors influence the Gibbs free energy of fusion but are seldom reported in the literature sources from which the dataset is constructed. As these effects cannot be represented by structural descriptors or COSMO-RS features, they set an intrinsic upper bound on achievable predictive accuracy.
Under LOSO, RF outperforms XGB and SVM for most descriptor sets, whereas under 10-fold CV it is often surpassed by XGB and SVM. This divergence reflects the validation objective: LOSO enforces generalisation to “unseen solutes” (grouped CV), while 10-fold permits the same solute to appear in both training and test folds. RF's bagging and feature subsampling reduce variance, yielding stable predictions that transfer better across solute identity, whereas the higher capacity of XGB and SVM captures solute-specific interactions that improve 10-fold scores but do not necessarily translate to LOSO. Therefore LOSO should be prioritised for model selection when the deployment target is new solutes.
The ML models, trained exclusively on the original dataset, were frozen and directly applied to the external dataset. For external prediction, we used the ensemble of models obtained from the 10-fold CV: each of the ten fold-specific models was applied to the external systems, and their predictions were averaged to obtain a single estimate for each solute–solvent pair. External predictions were generated using models trained with a fixed random seed to ensure a deterministic and reproducible evaluation. This approach makes full use of the available training data while providing a statistically stable prediction for unseen compounds. LOSO was used solely as an internal validation scheme and was not considered for external prediction. Descriptor calculation and preprocessing steps were defined entirely from the training data and subsequently fixed, so that external compounds were projected into the same descriptor space. COSMOtherm-predicted solubilities for the external systems were generated independently and used solely as input features in the hybrid models.
For the external validation, we focused on descriptor sets that are openly implementable or broadly accessible (Morgan Fingerprints, Mordred, and RDKit descriptors). Although MOE descriptors demonstrated competitive performance under internal CV, their use relies on commercial software that cannot be readily redistributed, and they were therefore omitted from the external benchmark in favour of more reproducible options.
Fig. 5 summarises the external performance of tuned RF, XGB, and SVM models with Morgan Fingerprints, Mordred, and RDKit descriptors. Across all descriptor spaces, SVM models show the most robust generalisation (highest R2 and lowest RMSE), followed by RF. In contrast, XGBoost exhibits a pronounced degradation in external performance, yielding negative R2. This behaviour is consistent with the higher sensitivity of boosted tree ensembles to covariate shift, arising from changes in the distribution of input descriptors between the training and external datasets, as well as to sparse descriptor activation, whereas SVM and RF models show improved robustness when extrapolating beyond the training domain.
P-like), molecular weight (log10(MW)), and polarity (TPSA) are provided in the SI (Section 4). These confirm that PC1 consistently encodes a multivariate gradient of increasing molecular weight, lipophilicity, and polarity. In all cases, solubility increases toward the negative end of PC1, aligning with smaller, less lipophilic, and less polar solutes. While the relatively high solubility of smaller solutes is expected, the apparent favourability of both less lipophilic and less polar solutes may at first appear somewhat counterintuitive. Typically, a more polar solute is less lipophilic and therefore soluble in different solvents than a lipophilic solute. However here, as discussed in Section 1, this set of solvents contains a range of different chemical features. In line with this, the PCA observations suggest that solutes with intermediate properties are most likely to display high solubility across the chemically diverse solvent set.Importantly, the apparent dominance of solute features compared to solvent features in PCA cannot be fully understood without considering the corresponding solvent space. If the solvents are averaged or biased toward one end of the polarity or lipophilicity spectrum, the apparent dominance of solute features in PCA becomes expected, consistent with the classical principle of “like-dissolves-like”. To assess this directly, we compared solute and solvent rankings for polarity (TPSA) and lipophilicity (log
P) generated from MOE descriptors. Spearman's rank correlation coefficient (ρ) was calculated to assess the correspondence between solute and solvent properties. Unlike Pearson's linear correlation coefficient (r), which is sensitive to absolute scales and assumes linearity, Spearman's ρ evaluates the degree of monotonic alignment between ranked variables. This rank-based approach mitigates the impact of differing property ranges and provides a more appropriate measure of whether the solute and solvent spaces are systematically aligned.
The resulting rank–rank plots (Fig. 6 and 7) showed only weak correlations (Spearman ρ = 0.10, 0.14) with horizontal banding that reflects a narrower property range for solvents than for solutes. To further explore the solvent contribution, PCA projections coloured by solvent properties (log
P, MW, and TPSA generated from MOE) are shown in Fig. 8. These reveal that the solvent space occupies a narrower range of lipophilicity and polarity compared with the corresponding solute properties. This compression explains the horizontal banding observed in rank–rank comparisons and why solute descriptors dominate the PCA trends. In other words, while solvents introduce some modulation, the dataset is primarily defined by solute chemical diversity.
As illustrated in SI-Section 4, although PCA revealed broadly similar chemical organisation across RDKit, MOE, and Mordred descriptor spaces, predictive performance varied between models. XGBoost performed best with RDKit (under 10-fold) and Mordred, while SVM performed better with MOE (under LOSO). These differences likely arise from how each algorithm interacts with descriptor characteristics such as dimensionality, collinearity, and feature scaling, rather than differences in the underlying chemical information. For instance, the higher dimensionality and redundancy of Mordred may favour ensemble methods like XGBoost, which can down-weight irrelevant features, whereas the more compact and curated MOE descriptors align better with kernel-based methods such as SVM. This highlights that comparable PCA structures do not necessarily translate into uniform model performance, underscoring the importance of jointly optimising both descriptor representation and modelling approach.
As shown in SI-S10, Fig. S6 and S7, descriptor names are colour-coded by molecular role: solute-derived descriptors (blue) and solvent-derived descriptors (green). Each row represents a solute, ordered by decreasing averaged experimental solubility across available solvents (top to bottom). Columns indicate individual MOE descriptors, selected based on their global SHAP importance. Cell colours represent signed SHAP values, with red indicating a positive influence on predicted solubility and blue indicating a negative influence. Grey cells correspond to near-zero contributions (|SHAP| < 1×10−30). While the top 20 includes both solute and solvent descriptors along with the COSMOtherm feature, the majority originate from the solute. This is not unexpected, since solubility depends on the interactions between the solvent and the solute, and the dataset contains more different solute molecules than solvent molecules. Although there is variation within the solvent set, including in hydrophilicity and lipophilicity, the solvents may be considered to broadly fall within the same category of organic solvents with heteroatom functionality. Meanwhile, the solutes in the dataset cover a wider range of chemical space, as seen in Section 3.2.1. In particular, they include more molecules that are more hydrophilic, with the lowest log
P(o/w) value for a solute being −3.5, although the highest is 7.0, close to the highest value among the solvents. Variations in solvent chemical environment in the dataset are likely to have a smaller impact on solubility than the larger variations between solutes. Descriptors are ranked by their mean absolute SHAP values across all solutes, reflecting their overall contribution magnitude to model predictions regardless of direction. The details of each top-ranked MOE descriptor are listed in SI Sections S6 and S7.
SHAP analysis further identified classic drug-likeness descriptors log
P (solute_h_log_pbo), H-bond donor and acceptor counts (solute_a_donscc), H-bond acceptor counts (Lipinski-style acceptor count: solute_lip_acc and general acceptor count: solute_a_acc), and topological polar surface area (TPSA, solute_TPSA) among the top contributors. This convergence demonstrates that the XGBoost model effectively rediscovered the same physicochemical drivers underlying Lipinski's Rule of 5 (ref. 73) and its related Veber criteria.74 While these rules were originally formulated in the context of aqueous environments, their recurrence here reflects the general importance of polarity, hydrogen bonding, and size-related features across solvent systems. The specific impact of each descriptor is modulated by solvent polarity and lipophilicity, with our analysis focusing on general organic solvent systems.
Beyond these primary descriptors, several related features also appeared among the top-ranked contributors, including additional lipophilicity measures (solute_S
log
P_VSA1, solute_GCUT_S
LOG
P_1), estimated molecular aqueous solubility (solute_h_ema), polar surface metrics (vsa_pol, solute_PEOE_RPC-, solute_GCUT_PEOE_1, and alternative formulations of hydrogen-bonding capacity (total count of donor and acceptor atoms together: solute_a_donacc). The recurrence of multiple log
P-, TPSA-, and H-bond–related descriptors underscores the robustness of hydrophobicity, polarity, and hydrogen bonding as central solubility determinants across descriptor classes.
Finally, although solute descriptors dominate overall, XGBoost also incorporates more solvent-specific features than SVM. This suggests a greater sensitivity of the tree-based model to subtle solvent differences, reinforcing that solubility is governed not only by intrinsic solute properties but also by the solute–solvent match within the studied chemical space.
log
P_VSA6 and solute_S
log
P_VSA1 (surface area contributions partitioned by S
log
P values), together with the classical octanol–water partition coefficient log
P, solute_log
P(o/w), highlighting the contribution of both global hydrophobicity and localised lipophilic surface exposure. Charge-based descriptors such as solute_PEOE_RPC- and solute_PEOE_RPC+ (relative negative and positive partial charges), solute_PEOE_VSA+4 and solute_PEOE_VSA-4, solute_PEOE_VSA_FPPOS (fractional positive polar surface area), and solute_GCUT_PEOE_2 (graph-cut from PEOE charges) reflect detailed electron distribution patterns. Similarly, solute_SMR_VSA4 (van der Waals surface area weighted by molar refractivity) integrates both surface area and polarizability effects.Topological and geometric measures also featured, including solute_a_ICM (atom information content), solute_balabanJ (Balaban connectivity index), solute_radius (molecular size), and solute_chiral_u (number of unconstrained chiral centres), together indicating that SVM is sensitive to molecular complexity, stereochemistry, and global shape. Beyond these physicochemical determinants, the model selected descriptors linked to reactivity and synthetic tractability, including solute_reactive, solute_rsynth, and solvent_rsynth. While not mechanistic solubility drivers per se, these properties correlate with size, functional group composition, and overall polarity, indirectly shaping solubility across solvents.
The broader distribution of influential descriptors suggests that SVM captures a more diffuse representation of solubility, spanning lipophilicity, electronic structure, geometry, and accessibility. This complements the sharper physicochemical emphasis of XGBoost and underscores the multifactorial nature of solubility in organic solvents.
log
P_VSA1, indicating consistent relevance of partial charges, 3D shape, and surface lipophilicity across models.Taken together, the results and patterns in Fig. S6 and S7 suggest that XGBoost concentrated on a tighter set of polarity and H-bonding metrics, while SVM captured a broader, more diffuse spectrum of structural, electronic, and accessibility-related features. SHAP heatmaps revealed clear differences in how the two models attributed importance across solutes, consistent with their distinct learning paradigms. XGBoost, which leverages decision tree ensembles, produced smooth, continuous SHAP gradients, particularly for global descriptors such as solute_TPSA, solute_PEOE_RPC-, and solute/solvent_a_ICM. This reflects its ability to split the feature space using threshold-based decisions recursively and to aggregate weak learners across multiple feature partitions. The resulting heatmaps suggest that XGBoost captures multiscale structure–property relationships, where continuous features contribute to solubility predictions across a broad chemical space.
In contrast, SVM, trained with a Radial Basis Function (RBF) kernel, yielded discrete and clustered SHAP patterns, with contributions sharply concentrated on specific solutes for descriptors such as rsynth and reactive. These descriptors often encode binary or threshold-like properties, which align with the SVM's reliance on support vectors to define decision boundaries in a projected feature space. Because SHAP values for kernel models are computed via background sampling (KernelExplainer), they tend to reflect abrupt shifts in predicted output due to localised descriptor influence. The SVM heatmaps, therefore, highlight class-like behaviour, solutes with distinct physicochemical filters or synthetic characteristics, rather than broad, continuous trends.
Together, these findings suggest that XGBoost identifies solubility determinants driven by continuous physicochemical gradients, particularly polarity, charge distribution, and hydrogen-bonding capacity, while SVM highlights discrete structural or accessibility filters that partition solutes into distinct subgroups. This divergence reflects their underlying learning biases: XGBoost leverages recursive thresholding to capture fine-grained physicochemical variation, whereas SVM relies on support vectors to enforce boundary-driven classification in descriptor space. The convergence on common features such as lipophilicity and 3D shape, coupled with their complementary sensitivities, reinforces the robustness of our conclusions and underscores the value of combining tree-based and kernel methods to map a richer landscape of solubility-relevant features.
Due to hashing and bit folding, the same Morgan Fingerprint bit can correspond to multiple distinct substructures across different molecules. In cases where RDKit could not recover a mapped atom environment for an active bit, we return to highlighting the single atom. S8 and S9 in SI illustrated the top 20 Morgan Fingerprint bits that are visualisable from the solutes and solvents individually.
Fig. 9 highlights recurring solute substructures among the top 20 SHAP-ranked bits, restricted to those appearing in more than two different solutes, thereby emphasising motifs with consistent solubility effects across the dataset. Among the top 20 SHAP-ranked Morgan Fingerprint bits for solutes, several showed consistent directional contributions across multiple molecules. Bits such as 1876, 114, 145, 191, 310, and 1199 are predominantly associated with increased predicted solubility, while the rest of the bits contribute negatively. Both the bits with positive contributions and those with negative contributions include a combination of organic sections and heteroatom-based functional groups. The positive bits feature more contribution from larger parts of aromatic rings than the negative bits. Aromatic groups without heteroatoms can contribute to solubility in solvents with significant organic character, suggesting the extended aromatic sections contribute to solubility in the less polar solvents among this dataset. Meanwhile, aromatic groups containing heteroatoms can contribute to solubility in polar solvents or solvents with polar groups. This environment-dependent behaviour may explain why aromatic groups feature heavily in general over the whole dataset.
For a subset of high-ranking bits (e.g. 378, 935, 114, 656, 1683), a valid atom substructure could not be recovered, likely due to bit collisions or unresolved hashing. In these cases, visualisation defaulted to the central atom. This highlights a key limitation of hashed fingerprints: while they capture predictive patterns, interpretability is constrained by the one-to-many mapping between bits and substructures. Despite this, SHAP analysis at the bit level offers a useful route for prioritising solubility-relevant features, particularly when combined with substructure visualisation.
Among the top SHAP-ranked solvent fingerprint bits, directional contributions varied across solutes, highlighting the context-dependent nature of solvation. Although both solute and solvent substructures are fixed for a given molecule, the influence of solvent fingerprints on solubility is inherently relational; their effect emerges from compatibility with specific solute features, rather than from intrinsic properties alone. For instance, the same solvent bit encoding an ether or halogen may enhance solubility for a polar solute by enabling favourable dipolar or hydrogen bonding interactions, but reduce solubility for a non-polar solute that cannot experience these interactions. Meanwhile, a substantial carbon chain or non-decorated aromatic group may enhance solubility for a non-polar solvent, but reduce it for a solvent which relies on the presence of polar bonds. This variability reflects the principle that solubility emerges from solute–solvent complementarity, rather than intrinsic solvent features in isolation. Although this limits direct attribution of solvent bits to fixed solubility effects, the overall SHAP patterns reveal solvent environments that are more or less compatible with the range of diverse solute classes present in the current dataset.
In this work, we develop an integrated pipeline that unifies descriptor generation, flexible model selection (RF, SVM, XGBoost), and hybridisation with COSMO-RS outputs, together with systematic CV strategies. By incorporating interpretable ML methods, notably SHAP-based feature attribution, the framework not only achieves competitive predictive performance but also reveals the molecular features most strongly governing solubility. The pipeline further supports multiple descriptor types (Morgan Fingerprints, Mordred, MOE, and RDKit descriptors), enabling a comprehensive and modular evaluation across data representations.
In particular, SHAP analysis revealed that the most influential features across the models corresponded to well-established physicochemical determinants of solubility. This alignment with Lipinski's Rule of Five demonstrates that the ML models effectively rediscover classical medicinal chemistry heuristics, while extending them into a broader solubility context. Crucially, the contribution of the present work is in showing that such relationships can be captured, quantified, and generalised at scale across hundreds of solute–solvent combinations. This systematic and data-driven validation provides confidence that intuitive chemical principles hold in diverse contexts, while highlighting where deviations may occur. Such convergence across descriptor types underscores the reliability of our framework and highlights the value of interpretable ML in providing chemically meaningful insights.
Looking ahead, this pipeline demonstrates how mechanistic insight and ML can be synergistically combined into a reproducible and extensible workflow. By coupling COSMO-RS physical descriptors with modern ML architectures and interpretable feature analysis, the framework establishes a robust template for the prediction of solubility with improved accuracy and transparency. Most importantly, the workflow enables scalable, reproducible screening of the solubility across large chemical spaces, supporting practical applications where solubility is a critical determinant of molecular performance. Such advances can accelerate applications in drug discovery, sustainable chemistry, and pharmaceutical design.
| This journal is © The Royal Society of Chemistry 2026 |