Antony D.
Vassileiou
*a,
Murray N.
Robertson
b,
Bruce G.
Wareham
c,
Mithushan
Soundaranathan
c,
Sara
Ottoboni
b,
Alastair J.
Florence
b,
Thoralf
Hartwig
d and
Blair F.
Johnston
*abe
aEPSRC ARTICULAR, University of Strathclyde, Glasgow, G1 1RD, UK. E-mail: antony.vassileiou@strath.ac.uk; blair.johnston@strath.ac.uk
bEPSRC CMAC Future Manufacturing Hub, University of Strathclyde, Glasgow, G1 1RD, UK
cDoctoral Training Centre in Continuous Manufacturing and Crystallisation, University of Strathclyde, Glasgow, G1 1RD, UK
dGlaxoSmithKline, GSK Medicines Research Centre, Gunnels Wood Road, Stevenage, SG1 2NY, UK
eNational Physical Laboratory, Teddington, TW11 0LW, UK
First published on 19th January 2023
We report a single machine learning (ML)-based model to predict the solubility of drug/drug-like compounds across 49 organic solvents, extensible to more. By adopting a cross-solvent data structure, we enable the exploitation of valuable relational information between systems. The effect is major, with even a single experimental measurement of a solute in a different solvent being enough to significantly improve predictions on it, and successive ones improving them further. Working with a sparse dataset of only 714 experimental data points spanning 75 solutes and 49 solvents (81% sparsity), a ML-based model with a prediction RMSE of 0.75log
S (g/100 g) for unseen solutes was produced. This compares favourably with conductor-like screening model for real solvents (COSMO-RS), an industry-standard model based on thermodynamic laws, which yielded a prediction RMSE of 0.97 for the same dataset. The error for our method reduced to a mean RMSE of 0.65 when one instance of the solute (in a different solvent) was included in the training data; this iteratively reduced further to 0.60, 0.57 and 0.56 when two, three and four instances were available, respectively. This standard of performance not only meets or exceeds those of alternative ML-based solubility models insofar as they can be compared but reaches the perceived ceiling for solubility prediction models of this type. In parallel, we assess the performance of the model with and without the addition of COSMO-RS output as an additional descriptor. We find that a significant benefit is gained from its addition, indicating that mechanistic methods can bring insight that simple molecular descriptors cannot and should be incorporated into a data-driven prediction of molecular properties where possible.
This work refers to mechanistic, data-driven and hybrid modelling – it is worth noting the meaning of this terminology. Mechanistic modelling refers to descriptions of phenomena in a system based on theory and first principles:5 in the context of solubility, mechanistic models are generally derived from the laws of thermodynamics. In contrast, data-driven approaches do not innately carry the context of their applied domain, instead fitting a general form to a set of input data to create a model that can be re-applied: linear regression is one of the simplest examples, but this also captures ML methods. Hybrid modelling refers to methods that combine the two approaches. While there is no hard line between these archetypes, framing modelling in these general terms helps to retain clarity over the origins of a model's behaviour.
Much of the recent work in ML-based solubility prediction, at least for drug-like molecules, has focused on aqueous solubility.6–12 Comparatively less has captured other common organic solvents,13,14 in which APIs are typically manufactured.
Several mechanistic models for solubility prediction have been well adopted by the pharmaceutical community.15–17 However, this study examined the role of such models in the overarching strategy for solubility prediction, rather than on their specific features, and so the scope was limited to one: Conductor-like screening model for real solvents (COSMO-RS),16 as implemented in the software package COSMOtherm.18 This was chosen because it is well known to the pharmaceutical industry, and is relatively accessible to non-specialists. It has shown good performance working across solvents and can model non-ideal behaviour (with some limitations due to the application of the Born approximation),19 which is essential for the high concentrations involved in many pharmaceutically relevant solid–liquid equilibria. On a practical note, COSMOtherm performs its own built-in molecule parameterization given only a molecular structure. For solubility prediction, any further parameters required of the operator are experimental values. It produces a final prediction for the vast majority of systems, irrespective of accuracy; this is helpful for constructing complete datasets.
It has previously been suggested that the standard deviation of experimental aqueous solubility values (logS mol L−1) reported in the literature is approximately 0.5–0.7.6,20,21 Palmer and Mitchell reduced this standard deviation for a subset of compounds using a highly controlled experimental method, the purpose being to observe the impact this had on their predictive models.12 Perhaps surprisingly, when switching to these ostensibly “better” experimental values, their models showed little change in performance. This result suggested that experimental data quality was not the limiting factor in prediction accuracy in this case, leaving data quantity, the descriptor set or the ML algorithm as potential barriers. A similar phenomenon was observed later in the results of the second Solubility Challenge,6 where a “tight” set of aqueous solubility data was made available (average interlaboratory reproducibility estimated to be ∼0.17
log units) but the submitted models failed to conclusively improve vs. the “loose” dataset. The authors also conclude that a larger dataset of at least several thousand points would be needed to detect significant improvements in the prediction methodology and feature selection. While these points are based on studies of aqueous data only, the conclusions speak to the limitations of data measurement and subsequent modelling in general; any study based on diverse experimental data curated from the literature (such as the present one) is likely to encounter similar phenomena.
Although no equivalent standard deviation is established for published experimental values across other organic solvents, it is sensible to assume both that such a value (a) exists and (b) is no lower than that of purely aqueous solubility data; indeed, it is likely higher due the greater diversity of experimental methods used to collect the data (dataset fully cited in the ESI†). Regardless of the precise cause, and given that Palmer and Mitchell used a similar descriptor set and the same ML algorithm (described in the Materials and methods), it is likely that an equivalent ceiling to model performance also exists for the work presented here.
An earlier study by Palmer et al. provides a relevant performance benchmark for the present work to be considered against.22 Using a 658-compound training set of aqueous solubility measurements (logS mol L−1) and molecular descriptors, the authors used random forest (RF) to predict the corresponding solubility measures of a 330-compound external test set with a RMSE of 0.69. Both their study and the present one used RF trained on molecular descriptors calculated with molecular operating environment (MOE). In a similar spirit, Boobier et al. recently described ML-based solubility prediction that, notably, did extend beyond water to three common organic solvents.13 The authors screened a range of ML methods for performance against an aqueous solubility dataset of comparable size to Palmer et al., as well as ethanol, acetone and benzene sets. They focused primarily on varying ML algorithms, as well as making rational modifications to their descriptor sets, though they worked with each of their solvent-specific datasets in turn, rather than merge them into one.
Qiu et al. demonstrated the impressive capabilities of even simple heuristic methods for solubility prediction when based upon the vast BMS solubility dataset.23 While this dataset is not available to the public, the study is a rare example that strove to predict solubility across a range of solvents. It also demonstrated the power of the relational information that is available in a cross-solvent data structure. Another recent cross-solvent study by Ye and Ouyang14 used extended-connectivity fingerprints24 as alternative features to molecular descriptors. They produced a RMSE in solubility prediction of 0.77 for unseen solutes – this is an important benchmark for the present study, which begins by exploring this idea. Further studies have exploited cross-solvent data in the generation of hybrid models of other solvent-dependent phenomena.25,26
The foundation for the work presented here was the use of a single, unified model for ML-driven solubility prediction across solvents. The challenges and opportunities of working with this framework were thoroughly explored. The performance of RF trained on this dataset to predict solubility on a cross-solvent basis was evaluated against COSMO-RS, a relatively common mechanistic model for solubility prediction. Significantly, the RF-based method was trialled both excluding and including COSMO-RS in the training data, switching the model from purely data-driven to a hybrid one. Hybrid modelling in this manner by ML has seen success across domains.27–29
RF was selected as the algorithm for the ML portions of this work due to its relative simplicity, high performance and quick execution. Compared to many other ML algorithms, its hyperparameters are few in number and set with robust defaults. This limits the effectiveness of hyperparameter tuning,30,31 and this step is generally not required, at least in exploratory contexts such as this work. RF also handles extraneous features by performing on-the-fly feature selection and has a strong track record of working with datasets based on molecular descriptors.13,32
Second, all collected experimental solubility measurements were reported at 25 ± 1 °C. The ±1 °C tolerance was applied primarily to ease the collection of data whose temperatures had been converted to/from kelvin, e.g. 298 K = 24.85 °C, but also to slightly boost the size of the dataset. When considered against other sources of potential noise in experimental data, it was assumed that this modest temperature tolerance would not meaningfully exacerbate the issue.
Finally, the use of COSMO-RS to perform solubility predictions required input values for solute melting point (Tmelt) and enthalpy of fusion (ΔHfus). These values are fundamentally necessary for COSMO-RS in order to estimate the free energy of fusion. While a last-resort heuristic approximation exists that will bypass this data requirement, it is not recommended.33 Hence, solubility measurements were only included if the corresponding calorimetry data were also available for the solute. The quality of these data is a further source of error for COSMO-RS predictions, though likely a lower one than the use of the approximation.
The final dataset amounted to 714 experimental solubility measures in total, capturing 75 solutes and 49 solvents. Of these, 20 were obtained from historical in-house experiments and 81 were provided by GlaxoSmithKline (details on collection methods given in the ESI†). In terms of solute/solvent combinations, a fully populated grid would amount to 3675 data points; the sparseness of the dataset was therefore 1 − 714/3675 = 81%. All solutes and solvents were numbered on a grid mapping the systems present in the dataset; this is visualised in Fig. 1. Solute/solvent IDs match those in the dataset presented in the ESI.† The most prevalent solvent in the dataset was ethanol, appearing with 51 solutes, and the most prevalent solute was naproxen, appearing with 31 solvents.
![]() | ||
Fig. 1 Grid marking the presence of solute/solvent systems in the dataset (black = present). Solute/solvent IDs match those in the dataset (see the ESI†). |
For each solute/solvent system present, solubility calculations were carried out with COSMOthermX (release 17) using parameterization BP_TZVPD_FINE_17. Molecular descriptors (2D only) were calculated for each solute and solvent using MOE.34 COSMOthermX and MOE each require licences; the licence type does not affect the output. Apart from removing descriptors with zero variance, no feature selection was performed (RF handles this on-the-fly). All RF models were implemented in R35 using the randomForest36 package. Non-default settings: ntree = 1000. This was set manually, rather than as a result of a parameter tuning step. It has been suggested that a minor performance gain could be made by increasing ntree from the default for some datasets37,38 – while this is by no means a generalisation to all datasets, the only downside to this modest raise of ntree was computation time, which was manageable for this work. Further external R packages were utilized to carry out this work: dplyr,39 doParallel,40 and foreach.41 All plots were produced with plotly42 for R.
For comparison to the DrugBank,43 the 2022 dataset was downloaded and the same MOE descriptors were calculated for all 9694 drug molecules that were marked as approved, experimental, illicit, investigational, nutraceutical, vet-approved and withdrawn. Features not in the main solubility set were dropped. Prior to performing principal component analysis (PCA), all features were scaled to unit variance and zero-centred. For projecting the solubility dataset into the same space, these scaling and centering factors, as well as the subsequent matrix rotation, were extracted and applied to the solubility dataset.
This study examined three different approaches to predicting solubility. The first was the use of standalone COSMO-RS, with no supplementary ML elements, serving as a baseline performance to measure the others against. The second used RF to predict solubility using molecular descriptors from MOE alone (termed RF-pure throughout). This represented a purely data-driven approach, most similar to the efforts of Mitchell and Palmer.22 This method did not require output from COSMO-RS, and so did not require calorimetric data. For this reason, while not equivalent to COSMO-RS, it was the closest to a “competitive” method within the context of this study. The final approach merged the previous two, utilizing COSMO-RS solubility predictions as a descriptor for RF, as well as the set of molecular descriptors from MOE (termed RF-hybrid throughout). In practice, this was a trivial modification to the model, though it meant a significant shift from a purely data-driven alternative to COSMO-RS to a hybrid model. To simplify comparisons between approaches, calorimetry data were not included as descriptors in any datasets used for ML; any differences in performance between RF-pure and RF-hybrid could thus be ascribed to the presence/absence of COSMO-RS output. Fig. 2 summarises the data inputs/outputs of each approach.
Four metrics were used to characterize model performance: R2, root mean square error (RMSE), and mean absolute error (MAE), as well as the fraction of cases whose prediction had improved over COSMO-RS, regardless of the magnitude of the improvement (FI). The latter metric was devised as a simple and useful measure of model consistency. By disregarding error magnitude, it contrasted with RMSE, which is by nature particularly sensitive to larger errors. It was intended to provide an indication of risk to the end user for deploying this strategy over simply using COSMO-RS standalone. For example, an FI of 0.5 would mean that as many COSMO-RS predictions were worsened as improved by the applied method (and thus may not be worth applying on an unknown solute), while an FI of 1 would indicate that the method improved all predictions and so could be applied to unknowns with little risk.
As a first step, 10-fold cross validation (CV) of each of the RF-based approaches was performed to compare against COSMO-RS. The predictions of each fold were merged to produce a set of predictions for the full set and plotted against their associated experimental values (Fig. 4). From the results, both RF-based methods showed a significant improvement over COSMO-RS by all metrics. RF-hybrid outperformed RF-pure, as anticipated, indicating that COSMO-RS predictions were a powerful feature in the dataset. However, due to the composition of the dataset spanning multiple solutes and solvents, it must be highlighted that these results in isolation can create a misleading portrayal of model performance, as discussed below. We retain this conventional train/test splitting methodology as a reference for the rest of this study and comparison with other studies.
In randomly partitioning this (or any) dataset, an assumption is made that all data points within are independent of each other, and it is therefore insignificant into which partition they are placed; at least, that is the hypothesis being tested. With this dataset's structure, where each case is a system composed of two distinct molecular property sets (solute and solvent), the assumption is false. If a model has “seen” a solute and/or solvent many times across different systems in its training set, it is likely to make a more accurate prediction for a further system containing it. Equally, for a system of a previously unseen solute and/or solvent, it is likely to make a less accurate prediction. With a validation method based on random partitioning of the dataset, it is overwhelmingly likely that most solvents and solutes spread across both training and testing sets, even with the sparse representation of any individual solute or solvent in this dataset. The 10-fold CV remains a legitimate test, though in this case it models only one scenario that is relevant to an industrial context: where a number of solubility points for a given compound is already known, and the prediction of further points for the same compound is desired. It certainly does not model another, perhaps more likely scenario: where a new compound is being investigated, with zero available solubility data for it.
In order to model the latter scenario, a variant of CV was performed where, rather than randomly splitting the data into k equally sized folds, they were split into folds containing all data points for a single solute only. This could be termed “leave-one-solute-out” cross-validation (LOSO-CV). By this approach, all tested data points were treated as previously unseen solutes, with predictions made based only on the remaining solutes in the dataset. Of course, this did nothing to control the distribution of solvents across partitions, but this was considered acceptable given the relatively less likely occurrence of a new solvent being utilized in an industrial context. The results of LOSO-CV are presented in Fig. 5.
By LOSO-CV, RF-pure achieved barely better performance metrics than the original COSMO-RS predictions. With FI < 0.5, it was actually more likely to return a prediction with higher error than COSMO-RS. Overall, its performance was not a reliable improvement on COSMO-RS. However, a slight improvement over COSMO-RS was seen with RF-hybrid by all metrics, suggesting that it is theoretically worth applying RF-hybrid to enhance COSMO-RS predictions, even for “new” solutes, leaving aside the overhead of setting up this method.
The strong RF performance by 10-fold CV and the subsequent drop by LOSO-CV indicate that, for a given solute, at least some prior knowledge of its solubility in other systems is significantly helpful for enhancing COSMO-RS predictions. While this is not in itself surprising, it is noteworthy that the relatively simple technique of RF-hybrid was indeed sufficient to exploit this relational information.
Feature importance analysis was performed using the built-in method for RF. This operates after model training is complete by permuting each feature in turn and re-predicting all out-of-bag validation data. The % increase in MSE is tracked: the more the feature was utilised by the model, the larger this value should be. A dedicated discussion of this analysis, including its limitations, is provided by Gregorutti et al.44 and references therein. The results for RF-pure and RF-hybrid (both retrained using the full dataset) are shown in Fig. 6. While the results should only be interpreted coarsely and qualitatively, the analysis yields some useful insights. First, COSMO-RS appears to be the most important feature by far in RF-hybrid. To properly quantify the impact of a single feature on model performance, one would have to retrain an equivalent model having excluded it from the beginning and compare errors. While this has not been carried out for all features, this is precisely the difference between RF-pure and RF-hybrid.
COSMO-RS aside, most of the highly ranked features describe the solute rather than the solvent in both cases. This is likely due to there being a greater variety of solutes than solvents (75 vs. 49), and so more variance in solute features. The features themselves describe a range of molecular properties one might expect to be important for characterising solubility, including hydrogen bonding, surface area, partial charges, molecular size and rotatable bonds. The full feature set is described in the MOE manual.34 A deeper analysis of the features' contributions to the models was not entered into here, since the models were far from perfect and likely to significantly change in future as more experimental data are obtained. It was more relevant to real-world applications to focus on the impact of additional experimental data. For interested readers, the highest-error predictions of both COSMO-RS and RF-hybrid are listed in the ESI.†
Given the observed drop in model performance for unseen solutes, it was relevant to ask how much prior knowledge of a solute was needed to bring performance back in line with that achieved by 10-fold CV, since this appeared to be a significant improvement over COSMO-RS.
As a test case, one of the solutes provided by GlaxoSmithKline was selected, termed gsk-B. This choice was due to there being experimental solubility data available for a large number of solvents (20 in total). Four of these – ethanol, acetone, 2-propanol and heptane – were held aside and selected as arbitrary common solvents. The solubility of gsk-B in the remaining 16 solvents was predicted by RF-hybrid: (i) given no prior gsk-B solubility data, (ii) given gsk-B solubility in ethanol, (iii) given gsk-B solubility in ethanol and acetone, (iv) given gsk-B solubility in ethanol, acetone and 2-propanol, and (v) given gsk-B solubility in ethanol, acetone, 2-propanol and heptane. This sequence simulated a scenario where single solubility measurements would be obtained experimentally and fed back to the model in order to refine further predictions. The result (Fig. 7) was a consistent improvement with each successive point, with the RMSE matching that of the original 10-fold CV by the third solvent. This was significant: at least in this one case, only a small number of experiments were needed to significantly improve solubility prediction for the same solute.
Of course, this example investigated (a) only one solute, (b) only one series of four successive solvents, and (c) only one specific sequence of addition to the training set, so was insufficient for drawing general conclusions. In fact, the moderate improvement seen even with no prior solute data suggested that gsk-B was perhaps a soft choice as an example. However, no single example could ultimately suffice: the logical progression was to investigate all possible combinations – every solute, every set of solvents, and every sequence of addition of that set – and aggregate the results.
This was achieved by deploying a comprehensive, brute-force cross-validation, drip-feeding all possible combinations of data points of a given solute to training sets, retraining and retesting each time. The procedure, hereafter referred to as “drip-feeding” CV, is described in further detail in the ESI.† Through manual consideration of both the dataset composition and the computational cost required to process exponentially increasing numbers of models, it was decided to perform the drip-feeding CV only on solutes of 5 of more instances in the dataset (57 out of 75). Solutes with fewer instances were omitted, though were retained in training sets. This totalled 169830 individual RF models.
As a result of this sweep of training/testing combinations, it was possible to aggregate the predictions. For a given value r, where r = the number of times the model had seen solute i in its training set, the results were averaged for each solute/solvent system. This allowed for the production of analogous results to those shown for gsk-B in Fig. 7 for all solutes, all solvent choices and all sequences: the aggregated results are presented in Fig. 8, with a clear trend. When r = 0, the train/test data split resulted in an equivalent test to LOSO-CV, and indeed proved near-identical by all metrics. On increasing to r = 1, where a single solubility measure of the test solute in a different solvent was added to the training set, a significant jump in performance was observed by all metrics. Further increases of r, corresponding to further solubility measures of the test solute present in the training set, resulted in a steady performance enhancement. When r = 4 the model closely matched increases of r, corresponding to further solubility measures of the test solute present in the training set, resulting in a steady performance enhancement. When r = 4 the model closely matched the initial strong performance seen by the 10-fold CV. This finding revealed the merit of an iterative approach to solubility prediction: coordinating with experimentalists and carrying out a small number of targeted measurements could vastly improve modelling capabilities on the remainder.
A small number of outlier points are visible in Fig. 8 that did not improve as more solute data are made available. These were checked for transcription errors, but have not been further investigated here. However, this approach could possibly be exploited in other applications as an outlier detection technique by tracking prediction errors against the number of training data points.
For reference, drip-feeding CV was also performed for RF-pure, with the expected outcome: a similar overall trend to the above, but with a greater overall error by all metrics. This is presented in the ESI.†
True like-for-like performance comparisons with other ML-based solubility models are challenging, given the ability of this approach to exploit data relationships across solvents, including further ones of a tested solute, and differing input requirements (e.g. utilizing COSMO-RS requires calorimetry data). However, the “headline” error rates can at least be broadly compared with those of some recent alternative ML-based approaches (Table 1). The cross-solvent study by Ye and Ouyang14 ought to be considered here as well, which reports a RMSE of 0.77 for unseen solutes; however, their dataset could not be retrieved for comparison at the time of writing. Certainly, it is clear that the approach described here extends the level of solubility prediction typical of exclusively aqueous (or other single solvent) models to a vast range of organic solvents while requiring only a small fraction of the number of data points per solvent.
Ref. | Model domain | RMSE | R 2 | No. same-solvent data points in set | No. same-solute data points in set | Total data points in set |
---|---|---|---|---|---|---|
11 | Water | 0.67 | 0.81 | 829 | 0 | 829 |
12 | Water | 0.88 | 0.45 | 85 | 0 | 85 |
13 | Water (narrow solubility range) | 0.71 | 0.76 | 560 | 0 | 560 |
13 | Water (wide solubility range) | 0.71 | 0.93 | 900 | 0 | 900 |
13 | Ethanol | 0.79 | 0.53 | 695 | 0 | 695 |
13 | Benzene | 0.54 | 0.75 | 464 | 0 | 464 |
13 | Acetone | 0.83 | 0.42 | 452 | 0 | 452 |
This study | 49 solvents | 0.75–0.56 | 0.58–0.77 | Max 51, median 10 | 0–4 | 714 |
The cross-solvent approach also introduced complexity regarding validation techniques, since cases were no longer fully independent of one another. A standard 10-fold CV boasted a significant improvement over COSMO-RS for both RF-pure and RF-hybrid. While legitimate, this test did not control the harnessing of same-solute data points in the training set. The effect of controlling for this was demonstrated by LOSO-CV, where train/test data partitions were chosen to isolate each solute in turn. This forced the RF-based methods to treat all test data as unseen solutes, mimicking a more relevant use case in the pharmaceutical industry. Unsurprisingly, performance was hindered. RF-pure did not convincingly outperform COSMO-RS by LOSO-CV. RF-hybrid retained a minor overall improvement in the same test, indicating some innate capacity for RF to enhance COSMO-RS even with no available solubility data for a given solute. This is significant, since the enhancement could be applied with little risk to any COSMO-RS prediction, even with no prior knowledge of the solute, and will likely only improve with a larger dataset.
The disparity between 10-fold CV and LOSO-CV was interrogated further. 10-fold CV disregarded the effect of possessing training data on the same solute being tested, which was not satisfactory, while LOSO-CV only controlled this by forbidding it entirely. However, it was possible to retain control of this extra information while drip-feeding access to it, allowing the effect to be quantified. It required a novel (to the best of the authors' knowledge), brute force method that, for each solute in turn, tested all combinations of training sets containing it on all further instances of the same solute, up to 4 instances in training sets. The final results after pooling the correct subsets were stark: they showed the stepwise gain in predictive power that could be expected for any given solute as experimental measurements of it in other solvents became available. Even a single experimental solubility measurement in a different solvent to the one of interest was enough to significantly improve RF-hybrid over its “blind” LOSO-CV performance. Crucially, this gain is exclusively available to data-driven approaches working on a cross-solvent basis.
Overall, the use of a sparsely populated, cross-solvent dataset to enhance COSMO-RS solubility predictions proved successful, achieving a RMSE of 0.75log
S for previously unseen solutes but, significantly, reducing to 0.65 when one instance of prior knowledge was available, and continuing down to 0.56 when four were available. For reference, COSMO-RS achieved a RMSE of 0.97 across the same dataset. This finding indicates that, where possible, an iterative approach to solubility screening would be optimal for model accuracy, feeding back results as they are generated in order to refine all remaining predictions. For the dataset used in this study, results saw continuing (though diminishing) improvements for up to four instances of a solute in the training set, predicting a correction for a fifth.
This work focused on truly getting the most out of the “hard currency” of experimental data in terms of predictive capability. Alternative models to COSMO-RS and RF were not investigated, and neither were alternative descriptor sets to MOE. The strategy used was intentionally agnostic with respect to these components: operators could freely swap any/all of them for their preferred techniques and the analyses performed here would remain valid, revealing equivalent parts of the story for the new setup.
Finally, it is worth adding that the very fact that RF-hybrid emerged significantly superior to RF-pure at every turn demonstrated the impact of the only difference between them: the incorporation of COSMO-RS output. It hints at the enduring value of mechanistic models in general; COSMO-RS was selected for this study for practical reasons rather than anything fundamental and is just one example. Models like this encapsulate vast quantities of expert human knowledge, often with years and decades of refinement contained within their outputs. Even when inaccurate for a particular system, the information they contribute to a dataset is quantifiably valuable, and, as demonstrated here, certainly not replaceable by simplistic, ML-friendly molecular descriptors.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d2dd00024e |
This journal is © The Royal Society of Chemistry 2023 |