Mohammad Amin
Ghanavati
a,
Soroush
Ahmadi
ab and
Sohrab
Rohani
*a
aChemical and Biochemical Engineering, Western University, London, Ontario N6A 5B9, Canada. E-mail: srohani@uwo.ca
bDepartment of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
First published on 9th September 2024
The effectiveness of drug treatments depends significantly on the water solubility of compounds, influencing bioavailability and therapeutic outcomes. A reliable predictive solubility tool enables drug developers to swiftly identify drugs with low solubility and implement proactive solubility enhancement techniques. The current research proposes three predictive models based on four solubility datasets (ESOL, AQUA, PHYS, OCHEM), encompassing 3942 unique molecules. Three different molecular representations were obtained, including electrostatic potential (ESP) maps, molecular graph, and tabular features (extracted from ESP maps and tabular Mordred descriptors). We conducted 3942 DFT calculations to acquire ESP maps and extract features from them. Subsequently, we applied two deep learning models, EdgeConv and Graph Convolutional Network (GCN), to the point cloud (ESP) and graph modalities of molecules. In addition, we utilized a random forest-based feature selection on tabular features, followed by mapping with XGBoost. A t-SNE analysis visualized chemical space across datasets and unique molecules, providing valuable insights for model evaluation. The proposed machine learning (ML)-based models, trained on 80% of each dataset and evaluated on the remaining 20%, showcased superior performance, particularly with XGBoost utilizing the extracted and selected tabular features. This yielded average test data Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and R-squared (R2) values of 0.458, 0.613, and 0.918, respectively. Furthermore, an ensemble of the three models showed improvement in error metrics across all datasets, consistently outperforming each individual model. This Ensemble model was also tested on the Solubility Challenge 2019, achieving an RMSE of 0.865 and outperforming 37 models with an average RMSE of 1.62. Transferability analysis of our work further indicated robust performance across different datasets. Additionally, SHAP explainability for the feature-based XGBoost model provided transparency in solubility predictions, enhancing the interpretability of the results.
Recent studies reveal that a significant percentage, approximately 70%, of newly developed drugs suffer from poor aqueous solubility, limiting their utility for patients.3 Recognizing this challenge, the integration of an accurate solubility prediction model serves as a valuable tool for drug developers. This model enables the swift identification of compounds with low water solubility, allowing proactive intervention through various enhancement techniques such as salt and cocrystal formation, micronization, etc.4–6 In the drug development pipeline, various stages encompass optimizing drug properties, selecting lead-to-candidate compounds, formulation and dosage selection; which traditionally necessitate time-consuming and resource-intensive solubility measurements.7 An alternative and expedited route involves utilizing reliable predictive models, to significantly accelerate these steps and the screening of candidate molecules with intended solubility values.
Traditional solubility prediction methods include fragment-based semi-empirical approaches (e.g., general solubility equation,8 UNIFAC,9 UNIQUAC,10 PC-SAFT11). The limitations of these methods stem from their simplifications and assumptions which pose restrictions, especially when confronted with diverse chemical compounds. Additionally, reliance on empirical parameters in some methods hinders their applicability to novel compounds beyond the scope of the experimental data, impacting their generalizability.12–14 On the other hand, methodologies based on molecular dynamics,15,16 and first principle ab initio calculations (such as COSMO-RS12,17,18) offer insights into molecular behavior through detailed simulations and electronic structure calculations. Additionally, other physics-based techniques, such as direct coexistence simulation19 (for studying phase equilibria), chemical potential20 (by analyzing changes in free energy), and density of states21 (density of electronic states at various energy levels), rely on detailed complex molecular simulations and electronic structure calculations that can face challenges related to computational complexity and the need for parameterization for tuning the required simulations.
Machine Learning (ML) has significantly advanced the prediction of molecular properties, particularly for the prediction of solubility, establishing itself as a powerful data-driven method. However, the accuracy of ML prediction results depends on three crucial conditions: having sufficiently large and high-quality data, employing a suitable data representation, and selecting an appropriate learner (model) capable of capturing relevant features for accurate outputs. Achieving a thorough comprehension of the underlying patterns in solubility prediction remains a challenging task, requiring effective handling of these key conditions.22,23
The physicochemical aspect of a solute's dissolution behavior is intricately influenced by the balance between overcoming attractive forces within the compound and disrupting hydrogen bonds between the solid phase and the solvent. Molecular representation must be carefully chosen to effectively capture this concept, considering factors such as polarity, molecular size, and intermolecular interactions of polar and nonpolar molecules.24–26
Regarding molecule representation, molecular tabular descriptors, which encompass a range of physicochemical and cheminformatics-based features, have been utilized in the literature for ML models. Various ML approaches have been employed, ranging from small-scale models such as Ensemble,27 LightGBM,28,29 Random Forest,30–32 to more complex techniques including ANN,26 ResNet,33 as well as incorporation of transfer learning34 and attention transformer.35
More recently, there has been growing interest in utilizing advanced deep learning techniques, benefiting from graph representation of molecules and the utilization of different versions of Graph Convolutional Network (GCN) algorithm for predicting solubility including attention-based GCN,36–38 composite GCN,39 and residual-gated GCN.40
We evaluate an alternative molecular representation derived from Density Functional Theory (DFT) calculations, known as electrostatic potential (ESP) maps. This molecular representation provides a comprehensive input data structure for deep learning algorithms, capturing essential features like the 3D molecular shape and charge distribution. These elements are highly essential to understanding the solubility of a compound. Moreover, we developed two other ML models based on molecular graph and a combined tabular extracted features to evaluate different molecular representations for solubility prediction.41–43
In this study, we employed three distinct approaches to predict intrinsic aqueous solubility of small organic molecules. Firstly, EdgeConv algorithms, originally designed for point cloud 3D data of objects, was implemented as a deep learning technique to simultaneously extract features from ESP maps and predict solubility. Secondly, GCN, the most widely used deep learning algorithm for predicting molecular properties was implemented. Thirdly, a synergistic representation of the molecule was created by utilizing a combination of features extracted from ESP maps and tabular molecular descriptors as the input for an ML regressor. Additionally, a random forest-based feature selection technique was applied to these extracted features and those obtained from a list of molecular descriptors, enhancing prediction accuracy by focusing on the most important features. These techniques underwent training on 80% of four sets of high-quality experimental solubility data, and their efficacy was subsequently evaluated on the unseen and reserved 20% of test data. We also developed an ensemble of three models to incorporate diverse molecular representation techniques, enhancing both performance and robustness in solubility prediction. This ensemble approach was benchmarked against 37 models tested on the Solubility Challenge 2019 data to evaluate its generalization ability. A transferability analysis was conducted across various datasets to assess the consistency of the three individual models and the ensemble. Additionally, we implemented an explainability analysis using the feature-based XGBoost model to provide transparency in model decisions and improve interpretability.
In a recent study, Meng et al.44 introduced a curation workflow to refine seven well-established aqueous solubility datasets. This process focused on removing redundant and conflicting records, particularly those exhibiting variations in solubility across datasets. For precise alignment with drug design goals, they carefully controlled experimental conditions, ensuring that solubility measurements were conducted at temperatures of 25 ± 5 °C and pH values of 7 ± 1. In our study, we utilized four curated and high-quality datasets (AQUA, PHYS, ESOL, OCHEM), with each dataset consisting of 1311, 2001, 1116, and 4218 cleaned aqueous solubility records, respectively. These selections adhered to their curation workflow44 and demonstrated high-quality scores compared to AQSOL, CHEMBL, and KINECT datasets.
Training on a diverse solubility dataset broadens the model's exposure, fostering fairness and unbiasedness by mitigating biases from skewed or limited data, and enhancing generalization to new scenarios.45 Hence, to underscore the significance of model robustness and generalization, we assessed the diversity of molecules attributes within the selected aqueous solubility datasets, as detailed in the results section of the present article.
![]() | ||
Fig. 2 Feature extraction and selection for XGBoost model (third methodology). Adapted with permission from ref. 43. Copyright 2024 American Chemical Society. |
On the other hand, we computed all 2D and 3D tabular Mordred descriptors utilizing the Mordred package in Python. These descriptors encompass a comprehensive set of 1826 features, including ring count, bond count, bond types, polar surface area, and more.
Among the three compared molecular representations, ESP maps (and the tabular extracted features from them) and Mordred descriptors can capture the stereoisomerism. However, molecular graph representation, which is based on atoms connectivity cannot capture isomeric structures. More details and examples are included in ESI (S1).†
The EdgeConv deep learning model was originally used for the classification and segmentation of 3D objects (x, y, z). In our work we have adapted it for solubility prediction (regression problem) by processing ESP maps (x, y, z, ESP). The feature extraction process involves four EdgeConv operations applied to the raw ESP map using kernels of sizes 128, 128, 256, and 512 (Fig. 1). The outputs of these operations are concatenated to form a matrix of dimensions 3000 × 1024. Subsequently, a combination of average and maximum pooling is applied, resulting in a 1024 array. This output from the feature extraction is then connected to the Multi-Layer Perceptron (MLP) regression component, and both parts' parameters are simultaneously trained to enhance the accuracy of the solubility prediction.
The EdgeConv operation as the building block of model architecture begins by constructing a local neighborhood graph for each point in each point cloud (ESP map). This is achieved by identifying the top k nearest neighbor points, determined by the minimum distance in the feature space of points. Following this, a 2D convolution with a kernel size of 1, coupled with a Leaky ReLU activation function, is utilized to calculate edge features for each pair of points within the graph. Then, these edge features, originating from all edges connected to a vertex, are aggregated using permutation-invariant operators such as maximum and average. This process updates the representation of the vertex. The dynamic nature of EdgeConv adjusts the k nearest neighbors after each layer, ensuring the relevance of local neighbors to the updated features (Fig. 1).
In more detail, the embedding h(k)u, corresponding to each node , is updated during each iteration of message-passing by aggregating information from the u's graph neighborhood
. The updating node u at kth message passing iteration can be formulated as follows:
![]() | (1) |
AGGREGATE is a permutation invariant differentiable function that is responsible for passing embeddings (referred to as the “message”) from a target node's neighbors to the target node. The UPDATE function, on the other hand, updates the nodes' features by incorporating the received information from neighboring nodes as well as their own features at previous step. In our work, we employed four message-passing or graph convolution layers with an embedding size of 128, using weighted summation as the ‘AGGREGATE’ function and ReLU as the ‘UPDATE’ function. Subsequently, global pooling is applied to all nodes' updated features after the last convolutional layer, utilizing both mean and max pooling and concatenating the results into a single vector. This concatenated vector undergoes further processing through a fully connected layer to predict solubility.
We utilized feature importance analysis from the Random Forest algorithm to identify and select approximately the top 20% of extracted features from the ESP map and Mordred descriptors. These features were chosen based on their significance and relevance to solubility prediction. These features were carefully chosen due to their significance and relevance in predicting solubility. In this process, decision trees within the algorithm evaluate each feature's capability to divide data into more consistent subsets, resulting in a reduction of disorder and uncertainty in data classification. Features that contribute to a more significant reduction in the disorder are assigned higher importance scores, emphasizing their influential role in predicting solubility. We reduced the number of tabular features from 1905 to 525 (consisting of 500 features selected from the original 1826 Mordred features, and 25 from ESP) to improve computational efficiency in the model and mitigate overfitting during training.
The extracted and selected features were then fed into an XGBoost model. XGBoost, short for eXtreme Gradient Boosting, stands out as a robust predictive model in machine learning.57 It works by combining predictions from multiple weak learners in a step-by-step manner. The algorithm improves its predictions by minimizing errors identified by the group of learners. This improvement is achieved through optimization of an objective function that considers both a loss term and regularization terms, striking a balance between model complexity and accuracy to prevent overfitting. We used 300 trees as learners in XGBoost with a maximum depth of 3 and a learning rate of 0.1 as the best hyperparameters (after a random search for tuning).
In the results section, the performance of the proposed models will be evaluated using three metrics: Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and R-squared (R2). These metrics are calculated using the formulae defined in ESI (S2).†
![]() | ||
Fig. 3 Solubility distribution of molecules in selected datasets of All Data, ESOL, PHYS and AQUA.44 |
![]() | ||
Fig. 4 t-SNE based analysis of chemical space covered in each dataset of All Data, ESOL, PHYS and AQUA. |
For each evaluation, we randomly selected 20% of the molecules in each dataset as unseen or test data for the model. The chemical space distribution in the training and test data is illustrated in Fig. 5, confirming that the test data for the four considered datasets is diverse and capable of validating the trained model across a wide range of chemical space.
![]() | ||
Fig. 6 (a) Frequency of functional groups in all unique molecules in datasets, (b) overlap of functional groups and aromatic rings within the molecules. |
The Venn diagrams in Fig. 6b delve deeper into the co-occurrence of key functional groups with aromatic rings in datasets molecules. Notably, carbonyl, hydroxyl, and halogen groups exhibit higher overlap with aromatic rings compared to ester. Aromatic rings are significantly more prevalent in conjunction with carbonyl and hydroxyl groups compared to ester and halogen groups (more than twice). Furthermore, the diagram highlights a considerably lower overlap between halogen and ester groups compared to their respective associations with aromatic rings. While both hydroxyl and carbonyl groups frequently coexist with aromatic rings more than with each other. Collectively, this analysis illustrates the diverse chemical space within the dataset, emphasizing the prevalence of specific functional groups and their interactions with aromatic rings.
Fig. 7b presents the distribution of logS values for molecules containing specific functional groups. The presence of sulfonic acid groups is associated with significantly higher solubility compared to other groups. Additionally, the presence of halogen or aromatic rings tends to correlate with a lower solubility range, with a median of −6 (log
S).
Fig. 7c and d delve deeper into the relationship between functional group counts and average logS values. Fig. 7c highlights the strong positive correlation between the number of hydroxyl groups and average log
S, particularly when combined with a lower count of halogen and aromatic rings. Fig. 7d suggests that an increasing number of halogen and aromatic rings generally correlates with a decreasing average solubility.
Fig. 7e employs Principal Component Analysis (PCA) to visualize the relationship between molecular features and solubility. The plot reveals a clear trend where molecules with higher solubility tend to cluster in the upper left quadrant, suggesting that the underlying molecular features in this region are associated with increased hydrophilicity.
In addition to structural analysis, we also evaluated key features from the ESP map and Mordred descriptors. Fig. 8a ranks the top 10 features by their SHAP values, which will be discussed in the Explainability analysis section (below). A comprehensive list of all Mordred descriptors is available in their original source,59 but descriptions of the top 10 features, including those from ESP maps, are provided in ESI (S3).† Among these, SLogP, a Mordred estimate of lipophilicity, stands out as the most influential descriptor. It is followed by FilteritLogS, a Mordred estimate of solubility, and Beta_1, the strongest hydrogen bond acceptor parameter. Additionally, several piPC descriptors (piPC2, piPC6, piPC7), which are related to π-electron count, also exhibit significant impact.
The high correlations observed among piPC descriptors indicate potential redundancy (Fig. 8b), while Beta_1 stands out with minimal correlation to other features, highlighting its unique contribution and its role in reducing multicollinearity. Fig. 8c visualizes the sum of absolute correlation values for each descriptor, reinforcing Beta_1's low correlation with other features and highlighting the interconnectedness of piPC descriptors. Overall, SLogP and Beta_1 can be identified as key features for solubility prediction in terms of correlation with solubility and minimal redundancy.
Fig. 8d explores the distribution of SLogP across different solubility ranges. The violin plot reveals a clear trend of increasing SLogP with higher solubility categories, underscoring its role as a key determinant of solubility. The solubility ranges are categorized as follows: low logS (below the 25th percentile), medium log
S (between the 25th and 75th percentiles), and high log
S (above the 75th percentile). The distribution of the remaining top 9 features can be found in ESI (S4).†
Each of the three proposed models was trained independently on 80% of the training split across the four datasets under consideration. Subsequently, the predictive performance of the models on various datasets was assessed using key performance metrics, including MAE, RMSE, and R2. The comparison of the prediction results for the three models on the test split of each dataset (ESOL, AQUA, PHYS, and All Data) is depicted in Fig. 9. The findings reveal that, for the ESOL dataset, both the EdgeConv and GCN models exhibit nearly identical performance, while the feature-based model outperforms them. Assessing the results across the other three datasets, a consistent pattern emerges, with the feature-based model consistently achieving the best results, followed by GCN and EdgeConv as the second and third in rank, respectively. The three models exhibited their optimal predictive performance on the PHYS dataset, achieving RMSE scores of 0.856, 0.731, and 0.577 for EdgeConv, GCN, and the feature-based model, respectively.
The RMSE values above 0.5 can largely be attributed to unavoidable measurement errors, as evidenced by the significant variability in solubility data from two well-known solubility challenges (SC-1 and SC-2),60 where standard deviations were 0.6 and 0.17 log units, respectively. Since interlaboratory deviations are often unreported in experimental datasets, similar levels of measurement error are likely, making it challenging to achieve RMSE values significantly below 0.5.
Among the datasets, all models demonstrated their superior prediction performance when trained on PHYS dataset. Unlike the other datasets, where records had variable weights due to inconsistent solubility measurements, all records in PHYS were assigned a weight of 1,31 indicating high consistency and low uncertainty. Training models with the PHYS dataset, therefore, led to the best predictive and generalization performance, reflecting its higher data quality.
An alternative evaluation of model performance for each dataset is illustrated in Fig. 10. The scatter plot showcases the experimental and predicted values of solubility for the test data. A bold line represents a perfect match, while two dashed lines, with a distance equal to the RMSE, provide a visual indication of the quality of the matching between predictions and experimental results. Additionally, a histogram illustrating the distribution of errors accompanies each scatter plot, providing complementary insights into the predictive performance. The convergence of results from three distinct representations mutually reinforces their validity. In a comprehensive comparison across all datasets, the feature-based model consistently exhibits superior predictive outcomes, showcasing excellence in performance metrics, prediction versus experimental matching (nearly 0.918 on average), and error distribution. Notably, this model demonstrates minimal fluctuations across the four datasets, with an average MAE and RMSE of 0.458 and 0.613, respectively. This highlights the significance of an efficient molecular representation, as evidenced by the superior performance of the less complex XGBoost model when fed with features extracted from a molecular mix with ESP maps. This emphasizes the effectiveness of a simple, yet efficient data structure (tabular extracted features) combined with a machine learning technique. It outperforms the more complex end-to-end deep learning methods in mapping unstructured point-cloud-based ESP maps and molecular graphs to solubility.
![]() | ||
Fig. 10 Plot of predicted vs. experimental solubility and histogram distribution of prediction errors across (a) EdgeConv, (b) GCN and (c) feature-based. |
In more detail, we believe that the feature-based model outperformed EdgeConv and GCN in predicting solubility due to its comprehensive integration of various molecular features. EdgeConv, which relies solely on ESP maps, excels in capturing charge distribution, polarity, and molecular shape. However, ESP maps often struggle with non-polar and hydrophobic molecules, which show minimal charge separation and result in relatively flat ESP maps. This limitation reduces the effectiveness of ESP maps in extracting meaningful interaction information for these types of molecules.
On the other hand, GCNs analyze atomic connectivity and molecular bonds through atom connectivity graphs. While GCNs effectively map the connectivity and topological structure, they lack direct consideration of electronic properties such as charge distribution and molecular shape. This gap can lead to incomplete predictions, especially when detailed electronic information is crucial for accurate solubility predictions.
The feature-based model combines ESP-derived features with a comprehensive array of Mordred descriptors to overcome the limitations of individual methods. By integrating hydrogen bonding parameters and spatial features from ESP maps with a wide range of Mordred descriptors—covering topological indices, molecular connectivity, complexity, geometric properties, and electronic characteristics—the model offers a detailed and multifaceted representation of molecules. This fusion enhances its capability to accurately predict solubility, particularly for non-polar or hydrophobic molecules where ESP maps alone may be insufficient. As a result, the feature-based model provides a more complete and nuanced view of molecular properties, leading to superior predictive performance.
Table 1 summarizes the performance metrics of the Ensemble model, including R2, RMSE, and MAE, across four datasets. Additionally, Fig. 11a illustrates the comparative performance of the Ensemble model with respect to the individual models. The Ensemble model consistently outperforms each individual model in terms of all three metrics across all datasets. This performance improvement is attributed to the Ensemble method's ability to integrate predictions from multiple models, thereby mitigating individual model biases and leveraging their collective strengths. By incorporating predictions from three distinct models, each based on different molecular representations, the Ensemble model offers increased confidence in its predictions. A complete comparison summary of models, where small scale ML models based on all Mordred features is included in the ESI (S6).† Moreover, the mean error distribution based on specific functional groups is also included in the ESI (S7).†
Model | ESOL | AQUA | PHYS | ALL-data (EOAP) | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
RMSE | MAE | R 2 | RMSE | MAE | R 2 | RMSE | MAE | R 2 | RMSE | MAE | R 2 | |
Ensemble | 0.569 | 0.425 | 0.928 | 0.580 | 0.432 | 0.926 | 0.559 | 0.423 | 0.931 | 0.638 | 0.466 | 0.917 |
Models | ESOL | AQUA | All-Data (EOAP) | ||||||
---|---|---|---|---|---|---|---|---|---|
RMSE | MAE | R 2 | RMSE | MAE | R 2 | RMSE | MAE | R 2 | |
EdgeConv | 0.874 | 0.645 | 0.835 | 0.887 | 0.586 | 0.827 | 0.959 | 0.722 | 0.815 |
GCN | 0.683 | 0.492 | 0.899 | 0.752 | 0.464 | 0.876 | 0.802 | 0.567 | 0.871 |
Feature-based | 0.545 | 0.365 | 0.935 | 0.595 | 0.338 | 0.922 | 0.670 | 0.455 | 0.910 |
Ensemble | 0.529 | 0.373 | 0.940 | 0.481 | 0.341 | 0.947 | 0.647 | 0.456 | 0.916 |
The SC-2 dataset comprises intrinsic solubility measurements of 100 druglike compounds, curated from multiple published sources. For our evaluation, we performed DFT calculations and prepared input data for our models. Importantly, we excluded all 100 molecules from our selected training dataset (PHYS) to meet the challenge's requirements. After training our models, we used the Ensemble model to assess its performance. The evaluation metrics of RMSE and MAE for our Ensemble model are 0.865 and 0.670, respectively.
In a comparative study by Llinas et al.,60 37 predictive approaches were compared on the SC-2 dataset, utilizing diverse training data, models (spanning from Multi-Linear Regression to advanced methods like LightGBM, ANN, and GCN), and molecular representations (encompassing RDKit descriptors, Morgan fingerprints, and graph-based representations). Reported RMSE values for these 37 models ranged from 1.06 to 3.00, with an average of 1.62. Notably, our Ensemble model outperformed all these methods, achieving a lower RMSE. Our predictions for SC2 molecules, including compound names, SMILES, and mean experimental solubility, are detailed in the second sheet of the ESIData excel file.†
In the context of feature-based ML models, Shapley values indicate how each input feature contributes to the deviation of a particular prediction from the average prediction. This method is additive, meaning the sum of all features' SHAP values equals the difference between the model's prediction for a specific instance and the average prediction across all instances. This explainability analysis helps build trust in the model's decisions, thereby improving transparency and interpretability.
In Fig. 12a, the beeswarm plot illustrates the overall effect of the top 10 features on the model's predictions using SHAP values. This plot shows how different levels of features influence the model's output. Negative SHAP values on the x-axis indicate a negative impact of the feature on predicted solubility, while positive values indicate a positive effect. The color bar corresponds to low (blue) and high (red) feature values and the range in between.
From Fig. 12a, we can interpret that higher SLogP values lead to a more negative impact on solubility, and vice versa. In contrast, the trends for FilterItLogS and Beta_1 show a consistent effect on predicted solubility, which aligns with chemical intuition. For instance, higher Beta_1 values indicate a stronger hydrogen bond acceptor in molecules, resulting in higher solubility. Additionally, the plot suggests that low sphericity (sph) negatively impacts molecule solubility, whereas high sphericity has a negligible effect.
The SHAP waterfall plots in Fig. 12b and c illustrate the application of our feature-based model to 1,3-dimethyl-naphthalene and 2,4-dimethyl-quinoline, respectively. These molecules were selected to assess the impact of replacing a methine group with a nitrogen atom to form a pyridine ring on the molecular properties predicted by the model. The y-axis in each plot shows the normalized feature values on a scale from 0 to 1. The mean prediction of the model across all data is displayed at the bottom of the figures (E[f(x)]), while the predicted solubility for each molecule is shown at the top (f(x)).
The SHAP values, represented by blue and red bars, indicate the contribution of each feature to the specific solubility prediction relative to the average predicted solubility. SLogP, as the most influential feature, affects solubility by −0.9 and −0.5 for the first and second molecules, respectively. Comparing the contributions of features for the two examples, we observe that with the addition of a pyridine ring, Beta_1 increases significantly from 0 to 0.358 (on a scale of 0 to 1), resulting in a 0.33 increase in average predicted solubility. Similarly, an increase of 0.331 in RNCG, which represents the relative negative charge, contributes to a net increase of 0.28 in solubility. Further explanation analysis of four additional examples of molecule pairs is provided in Fig. S4 and S5 in ESI (S8).†
Through the comprehensive comparison of three methodologies, it becomes evident that the combined utilization of ESP map features and Mordred descriptors, yielding an average RMSE of 0.613, outshines the performance of more intricate deep learning models such as EdgeConv (0.893) and GCN (0.755). This underscores a crucial takeaway: the effectiveness of the input data representation holds greater significance than the complexity of the model. An ensemble of the three models improved the error metrics for all datasets. The Ensemble model achieved an RMSE of 0.865 on the Solubility Challenge 2019, surpassing the average RMSE of 1.62 reported by 37 models. Transferability analysis confirmed the robustness of both the individual models and their ensemble across datasets. The explainability analysis demonstrated the interpretability of key features in solubility predictions.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4dd00065j |
This journal is © The Royal Society of Chemistry 2024 |