Open Access Article
Jiayun
Pang
*,
Alexander W. R.
Pine
and
Abdulai
Sulemana
School of Science, Faculty of Engineering and Science, University of Greenwich, Medway Campus, Central Avenue, Chatham Maritime, ME4 3RL, UK. E-mail: j.pang@gre.ac.uk
First published on 29th November 2023
Hansen solubility parameters (HSPs) have three components, δd, δp and δh, accounting for dispersion forces, polar forces, and hydrogen bonding of a molecule, which were designed to better understand how molecular structure affects miscibility/solubility. HSP is widely used throughout the pipeline of pharmaceutical research and yet has not been as well studied computationally as the aqueous solubility. In the current study, we predicted HSPs using only the SMILES of molecules and utilise the molecular embedding approach inspired by Natural Language Processing (NLP). Two pre-trained deep learning models – Mol2Vec and ChemBERTa have been used to derive the embeddings. A dataset of ∼1200 organic molecules with experimentally determined HSPs was used as the labelled dataset. Upon finetuning, the ChemBERTa model “learned” relevant molecular features and shifted attention to functional groups that give rise to the relevant HSPs. The finetuned ChemBERTa model outperforms both the Mol2Vec model and the baseline Morgan fingerprint method albeit not to a significant extent. Interestingly, the embedding models can predict δd significantly better than δh and δp and overall, the accuracy of predicted HSPs is lower than the well-benchmarked ESOL aqueous solubility. Our study indicates that the extent of transfer learning leveraged from the pre-trained models is related to the labelled molecular properties. It also highlights how δp and δh may have large intrinsic errors in the way they are defined and therefore introduces inherent limitations to their accurate prediction using machine learning models. Our work reveals several interesting findings that will help explore the potential of BERT-based models for molecular property prediction. It may also guide the possible refinement of the Hansen solubility framework, which will generate a wide impact across the pharmaceutical industry and research.
Word embedding in the field of NLP is an important technique in which the meaning of words and sentences can be captured by dense real-valued vectors. A well-established approach to obtain embeddings in NLP is pre-training (Fig. 1). In this approach, such text embeddings are obtained via learning from an extremely large set of unlabelled text sequences, in a fully unsupervised manner, to capture the semantic and syntactic meaning of words. Subsequently, when these pre-trained text embeddings are used in different downstream tasks with or without fine-tuning on smaller sets of labelled data. This concept of reusing a large general pre-trained model for many specific tasks with task-specific data annotations is known as transfer learning in machine learning and NLP. Applied to a large dataset of SMILES (such as ZINC and ChEMBL), the embedding approach could provide a new type of molecular representations that captures the physicochemical properties of molecules.
Molecular embedding approaches have been explored in the past few years as a general concept to predict molecular properties.6–9 There is some evidence that molecular embeddings could surpass molecular descriptors and fingerprints in some tasks, but the improvement may not be significant and still lacks clear interpretability. Hence a better understanding of what the embedding models have learned of the molecular properties will help to better train and finetune them. On the other hand, earlier molecular embedding work usually required extensive coding which makes adaptation difficult for non-experts. In the past couple of years, the development of Hugging Face (https://huggingface.co/), a machine learning and data science platform has lowered the barrier for non-experts to pre-train and finetune deep learning models (transformer models to be more specific). In the present study, we explored the use of molecular embedding approaches to predict Hansen solubility parameters (HSPs) which bridge directly molecular embedding with intrinsic molecular forces. In addition, we have used pre-trained models deposited in Hugging Face for finetuning so that our approaches can be adapted more easily.
Predicting solubility can be a challenging task since it depends on various physicochemical factors. Some of the more important factors to be considered are the interactions between solute and solvent and the nature of intrinsic intermolecular forces of solute. Because of the complexity associated with the term, several types of parameters have been developed to account for different aspects of the solvating ability/miscibility of molecules. For example, the partition coefficient log
P reflects a molecule's hydrophobicity and is widely used to estimate the aqueous solubility of small molecules for drug discovery. On the other hand, Hildebrand and Scott introduced the total solubility parameter δt in 1949, which is the square root of a solvent's cohesive energy density. The total cohesive energy can be measured by evaporating the liquid, i.e., breaking all the “cohesive interactions”. Thus, the total cohesive energy of a compound is considered to be similar to the energy of vaporization. The Hildebrand and Scott total solubility parameter usually is not sufficient to describe molecules with strong polarity and hydrogen bonds. It was further refined by Charles M. Hansen in 1967,12 which became the widely used Hansen solubility parameters (HSPs). Hansen decomposed δt and introduced three variables δh, δp, δd as partial solubility parameters:
| δt = sqrt(δd2 + δh2 + δp2) |
We aim to predict HSPs based on only the SMILES of the molecules and the molecular embedding approaches. In our study, two NLP embedding approaches were employed, namely Word2Vec23 and BERT-based finetuning.24 Word2Vec is a shallow, two-layer neural network to efficiently create high-dimensional vector (usually several hundred dimensions) representations of words and has been widely used since its publication in 2013. Word2Vec takes in a large corpus of text and produces a vector space, with each unique token in the corpus being assigned a corresponding vector. These vectors are positioned in the vector space such that tokens that share similar meanings are located close to one another in the vector space. In the present work, we used Mol2Vec,25 a previously developed unsupervised Word2Vec-inspired chemistry model to assign the vectors. Similar to word embeddings in the Word2Vec approach, the vectors for chemically related substructures occupy the same part of vector space in Mol2Vec. The limit of Word2Vec approach is that it is “context-free” representation, where the embeddings for substructures are static (Fig. S1 in the ESI†). This means the embeddings do not depend on the context of the SMILES, i.e. the same substructure will have the same vector representation even if it is in two completely different SMILES. Static embedding limits the accuracy of the models as it is well known in chemistry that adjacent and neighbouring functional groups may have significant influence over each other's reactivity and chemical properties in molecular structures. In recent years, the power of incorporating context into text embedding learning has been demonstrated by transformer-based models, such as BERT (Bidirectional Encoder Representations from Transformers).24 Similarly, applying contextual representation to SMILES could also lead to the improvement of the models. Several BERT-based models have been developed with different training objectives and strategies along with different SMILES datasets.26 In the current study, we used the ChemBERTa models27,28 as these pre-trained models are available on Hugging Face that enables more straightforward finetuning and adaptation by others. We finetuned the ChemBERTa models to make HSPs prediction.
By using and comparing our two embedding models, we aim to address the following questions relating to the prediction of HSPs: (i) Do the embedding models have an advantage over the more commonly used molecular fingerprint and descriptor-based approaches? (ii) For the embedding approach, does the BERT-based model outperform the simpler Word2Vec model? (iii) By comparing two different types of solubility parameters, namely the ESOL aqueous solubility and Hansen solubility, we will examine what the embedding models have learned of the molecular properties and how it may be related to the intrinsic molecular forces defined by HSPs.
S, that comes from log
P and melting point. It has an average molecular weight of 204 g mol−1 and the log
S values were distributed mostly between −10 and 5. For both datasets, the canonical SMILES was used for the molecules. The functional group distribution in the two datasets was analysed in a similar fashion as used by Boobier et al.20 Functional groups were counted by matching their SMARTS codes to the SMILES strings using pybel/OpenBabel. The total number of occurrences of each functional group was then divided by the number of molecules in the dataset to derive the average occurrence per molecule for each functional group (Fig. S4 in the ESI† and the code available in the GitHub deposit). In addition to being lighter in average molecular weight, the Hansen dataset has fewer alkene and aromatic carbons and hydrogen-bond donor functional groups than the ESOL dataset.
| Method | MAE | RMSE | R 2 | |
|---|---|---|---|---|
| δ d | ||||
| Morgan Fps | XGBoost | 0.65 ± 0.04 | 0.91 ± 0.07 | 0.74 ± 0.04 |
| Mol2Vec | XGBoost | 0.60 ± 0.02 | 0.88 ± 0.04 | 0.76 ± 0.01 |
| Mol2Vec | FFNN | 0.87 ± 0.05 | 1.21 ± 0.08 | 0.53 ± 0.07 |
| Zinc-base | 0.59 ± 0.03 | 0.83 ± 0.04 | 0.73 ± 0.03 | |
| 77M-MTR | 0.67 ± 0.03 | 0.97 ± 0.09 | 0.64 ± 0.08 | |
| Descriptors18 | gpHSP | 0.68 | 1.02 | 0.69 |
![]() |
||||
| δ h | ||||
| Morgan Fps | XGBoost | 2.13 ± 0.11 | 3.46 ± 0.27 | 0.55 ± 0.04 |
| Mol2Vec | XGBoost | 2.15 ± 0.11 | 3.31 ± 0.23 | 0.59 ± 0.03 |
| Mol2Vec | FFNN | 1.92 ± 0.13 | 2.84 ± 0.41 | 0.69 ± 0.07 |
| Zinc-base | 2.03 ± 0.25 | 3.17 ± 0.46 | 0.42 ± 0.15 | |
| 77M-MTR | 1.79 ± 0.25 | 2.70 ± 0.48 | 0.70 ± 0.09 | |
| Descriptors18 | gpHSP | 1.57 | 2.41 | 0.83 |
![]() |
||||
| δ p | ||||
| Morgan Fps | XGBoost | 2.23 ± 0.09 | 3.02 ± 0.16 | 0.51 ± 0.05 |
| Mol2Vec | XGBoost | 2.26 ± 0.10 | 3.12 ± 0.12 | 0.48 ± 0.02 |
| Mol2Vec | FFNN | 2.15 ± 0.14 | 2.92 ± 0.21 | 0.54 ± 0.04 |
| Zinc-base | 2.01 ± 0.16 | 2.83 ± 0.27 | 0.41 ± 0.14 | |
| 77M-MTR | 2.24 ± 0.15 | 3.13 ± 0.21 | 0.42 ± 0.04 | |
| Descriptors18 | gpHSP | 1.93 | 2.83 | 0.71 |
![]() |
||||
| ESOL | ||||
| Morgan Fps | XGBoost | 0.86 ± 0.04 | 1.15 ± 0.05 | 0.70 ± 0.04 |
| Mol2Vec | XGBoost | 0.68 ± 0.03 | 0.92 ± 0.04 | 0.81 ± 0.02 |
| Mol2Vec | FFNN | 0.75 ± 0.03 | 0.98 ± 0.05 | 0.78 ± 0.02 |
| Zinc-base | 0.58 ± 0.02 | 0.79 ± 0.05 | 0.85 ± 0.02 | |
| 77M-MTR | 0.68 ± 0.04 | 0.79 ± 0.05 | 0.85 ± 0.02 | |
![]() |
||||
| Comparison with published ESOL references | ||||
| Morgan Fps25 | XGBoost | 0.88 | 1.20 | 0.66 |
| Mol2Vec25 | XGBoost | 0.60 | 0.79 | 0.86 |
| Mol2Vec32 | FFNN | 0.66 ± 0.01 | ||
| 77M-MTR28 | 0.89 | |||
| MolBERT7 | 0.53 ± 0.04 | |||
| MFBERT6 | 0.42 ± 0.50 | |||
![]() | ||
| Fig. 2 Plots of the predicted δd, δh and δp of the HSPs (top three rows) and ESOL solubility (bottom row) versus their respective experiment values. The solid red lines indicate ideal agreement between the predicted and experimental values. The dashed red lines indicate two standardised residual deviations (SRD) away from the experimental values. Molecules beyond the dashed lines are the outliers in each model and were further analysed in Fig. 4. | ||
![]() | ||
| Fig. 3 Plot of the mean absolute errors (MAEs) of the five models in Table 1. | ||
The predicted values two standardised residual deviations (SRDs) away from the experimental values are selected as the outliers in each model (see Table S2 in the ESI† for the list of outliers). These poorly predicted molecules were analysed to identify possible systematic limitations. The difference in the average occurrence of the functional groups between the outliers and the full dataset is presented in Fig. 4 and S5 in the ESI.† The positive values indicate the increased presence of the functional groups and the negative values indicate fewer functional groups in the outliers. For δd, the most frequent functional groups in the outliers are Halides (F, Cl, Br and I), and to a lesser extent nitrile, P and S. All three models (fps, Mol2Vec and ChemBERTa) performed poorly against Halides. CHEMBERTa-77M-MTR is the worst-performing one primarily because it handles Cl badly. For δh, consistent among all the models, the outliers tend to have more hydrogen bond donors and acceptors. However, CHEMBERTa-77M-MTR performs better than CHEMBERTa-zinc-base-v1 in general, therefore giving the best result for predicted δh. For δp, the functional groups in the outliers are more diverse and no distinct functional groups stand out. It is also interesting that the larger CHEMBERTa model performed worse than the smaller CHEMBERTa model in terms of Cl, while the smaller CHEMBERTa model tends to perform slightly poorly for carbonyl, alkene and H-bond acceptor. As a comparison, for the ESOL dataset, the outliers' functional groups are more diverse, which is similar to δp. In terms of the size of the molecules, outliers from predicted δd have similar molecular weight to the full Hansen dataset while outliners from predicted δh and δp are smaller than the full Hansen dataset (Fig. S6†). On the other hand, outliers from the ESOL set have higher average molecular weight than the full dataset. It appears that Mol2Vec and CHEMBERTa models don't have a general bias over molecular weight, but they may have some limit towards smaller or bigger molecules depending on the predicted molecular properties.
![]() | ||
| Fig. 4 Functional group analysis of the outliers for (A) δd, (B) δh, (C) δp and (D) ESOL. Functional groups were counted by matching their SMARTS codes to the SMILES strings. The total number of each functional group was then divided by the number of molecules in the outliers to derive the average occurrence per molecule (Fig. S5†). For clarity, the difference of the average occurrence of functional groups between the outliers and the full dataset are presented in the above figures. | ||
It is clear that the attention mechanisms are different in the finetuned models compared to the original ChemBERTa model. The attention focuses more on the atoms that give rise to the HSPs in the finetuned models. For example, pyridazine (SMILES: c1ccnnc1) is an aromatic heterocyclic compound.36 It contains a six-membered ring with two adjacent nitrogen atoms. The molecule has a high dipole moment with π–π stacking interactions and dual hydrogen-bonding capacity (both as hydrogen bond donor and as acceptor). Its simple structure and strong characters relating to the HSPs make it easier to interpret the attention mechanism from ChemBERTa (Fig. 5 and Table S1 in the ESI†). In the δd finetuned model, strong attention is distributed among c[token1], 1[token2] and ccnnc[token3]. [c] represents an aromatic carbon and [1] indicates connectivity to form an aromatic ring. [ccnnc] includes the two nitrogen – the hydrogen acceptor in the molecule. All 3 tokens contribute to δd because the dispersion parameter is based on the atomic forces of all the atoms in the molecule. In the δh finetuned model, the attention focuses more on the hydrogen bond acceptor token [ccnnc] in layer_2/(head_4), layer_3/(head_2, 5, 10 and 11), layer_4/(head_0, 1, 7) and layer_5/(head_0). In the δp finetuned model, there is a strong focus on the [c] token that is not seen in δd and δh. Attention to [c] is dominant in layer_1/(head_11), layer_2/(head_0, 2, 4 and 10), and layer_3/(head_1, 2, 3, 4, 5, 6, 8, and 11). The δd finetuned model gave an excellent prediction of 19.2 compared to the experimental value of 20.2. The δh finetuned model performed worse with predicted δh 6.8 (exp δh 11.7) and the δd finetuned model the poorest with predicted δp 7.7 compared to the exp δp 17.4.
L-(−)-Ephedrine (SMILES: CN[C@@H](C)[C@H](O)c1ccccc1) has an aromatic ring, a hydroxyl (OH) and an amine (NH) functional group. The two tetrahedral centres are indicated by the chiral specification simple of @ and @@ in the SMILES (Fig. 6 and Table S1 in the ESI†). In the δd finetuned model, some heads (0, 1, 3, 4, 8 and 11) in layer_4 focus strongly on the [CN] token. In the δh finetuned model, some heads (1, 2, 3, 6, 7, 10 and 11) in layer_4 focus on the [O] token. Attention is generally more evenly distributed in the δp finetuned model with the focus on both [CN] and [O]. Both δd and δp are reasonably predicted (predicted 17.8 vs. exp 18.0 for δd and predicted 7 vs. exp 10.7 δp) while δh is significantly underestimated (predicted 11.8 vs. exp 24.1) for this molecule. Based on the visualization of attention mechanisms, it is plausible that the δh finetuned model only learned OH but not NH functional group as a hydrogen bond donor, therefore underestimating δh.
For strictly nonpolar molecules, the δp and δh terms are zeros by definition in HSPs. Cyclododecane (SMILES: C1CCCCCCCCCCC1) is such a macrocycle molecule that contains a twelve-membered ring (molecular formula (CH2)12). δd of cyclododecane is 16.4 while its δp and δh are both zeroes (Fig. 7). The molecule is tokenised into four tokens [C], [1], [CCCCCCCCCCC] and [1]. 11 carbons from the ring structure are grouped into one token while the two [1] and the last [C] which indicates that the connectivity is set as three separate tokens. Attention is distributed among all the tokens and although the attention mechanisms change in the finetuned models, the way the molecule is tokenised makes them difficult to interpret. The δd finetuned model gave a reasonable prediction of 18.1 (exp 16.4), but δh and δd were both overestimated (7.3 and 5.9). The finetuned models failed to grasp the correlation between nonpolar molecules and zero value for HSPs. However, it is important to point out that HSPs use a simplified way to define the polarity of molecules. Cyclododecane can adopt multiple conformations and has a sizable dipole moment,37 therefore assigning its δd as 0 is a significant underestimation and does not reflect the complexity and conformational flexibility of its structure. Cyclododecane occupies a unique chemical space and has been used in drug discovery. There are several challenges for the prediction of HSPs of such unique molecules: (1) better ways to tokenise molecular structure. The 11-carbon token in our approach may be too long and will be scarce in the training set. Shorter tokens of a few carbons may account for local structural flexibility better. (2) The limit of HSPs' theoretical framework, in particular, how δp and δh terms may have large intrinsic errors for complex molecules.
First of all, our results may be explainable by the underlying nature of these parameters and the experimental errors associated with them. The HSPs were derived from three main types of interactions in common organic molecules. The most common is the nonpolar interactions, which are usually referred to as the atomic dispersion forces (δd). All molecules contain this type of force, therefore δd is usually the predominant component of total solubility while δh and δp has relatively small contributions. δd can be predicted quite well with just Morgan fingerprints, indicating that it is dependent more on the molecular structure. Because the dispersion parameter is based on atomic forces, the size of the atom is important. To determine δd experimentally or theoretically, corrections are usually required for atoms significantly larger than carbon, such as Cl and Br but not for oxygen and nitrogen that are of similar size. The impact of the corrections is larger for small molecules.38 This is consistent with our results that for δd, the most frequent functional groups in the outliers are the halides (F, Cl, Br and I). The fact that the models handled halides poorly could be due to a combination of two factors – the intrinsic errors in the experimental data associated with the halide atoms and the scarcity of this type of molecules in the training set.
δ h arises from the hydrogen bonding capacity of the molecule and has been used to collect the cohesive energy component that is not included in the other two parameters. δh was most often found by subtracting the polar and dispersion components from the total solubility. Therefore, δh is less well defined physicochemically, and the experimental errors associated with δh are bigger than those of δd. This may explain why the predicted accuracy of δh is lower than δd.
δ p arises from dipole–dipole interactions between molecules. Most molecules (except few strictly nonpolar molecules) contain these inherent molecular forces to some extent, in a way similar to the ubiquitous atomic dispersion forces in all molecules. Therefore, it is intriguing that there is a significant difference in our models' accuracy of δd over δp, despite both parameters involving all atoms of the molecules. In a previous study using descriptor-based machine learning approach,18 it was found that δd appeared to be more dependent on molecular structure, while δp depended more on the electrostatic descriptors (dipole moment, polarizability, polarity and hydrogen-bonding moments). Electrostatic properties are more complex as they are description of how atoms influence each other within a molecule. To improve model accuracy, an increased number of labelled data may be needed to finetune BERT for it to learn the more complex interactions within molecules. More labelled data can be easily generated for future studies because δp is well-defined and can be derived from dipole moment of molecules calculated using quantum mechanical calculations.
Comparison with the well-benchmarked ESOL aqueous solubility dataset also provided useful insights. Dissolution is a complex process that involves solute–solute, solvent–solvent and solute–solvent interactions. Aqueous solubility is dominated by solvation energy and solvent–solute interactions, due to water's high polarity and its capability for hydrogen bonding.20,39 In contrast, HSPs mainly account for solute–solute interactions i.e. cohesive/sublimation energy, which has always been more challenging to predict.40 Therefore the ESOL ≈ δd > δh > δp trend of prediction accuracy is in agreement with our understanding of the physical aspects of the dissolution process.
Finally, it is important to point out that experimental HSP values exhibit significant uncertainties. δp has the largest uncertainties. For example, the value derived from the dipole moment could be much smaller than values derived from the group contribution methods.38,41 Large inconsistencies between δh values have been observed for compounds with strong hydrogen bonds, such as urea.41 Similarly, it has also been pointed out that experimental data quality could be a limiting factor in predicting the aqueous solubility of molecules.22
Footnote |
| † Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d3dd00119a |
| This journal is © The Royal Society of Chemistry 2024 |