Using natural language processing (NLP)-inspired molecular embedding approach to predict Hansen solubility parameters †

Hansen solubility parameters (HSPs) have three components, d d , d p and d h , accounting for dispersion forces, polar forces, and hydrogen bonding of a molecule, which were designed to better understand how molecular structure a ﬀ ects miscibility/solubility. HSP is widely used throughout the pipeline of pharmaceutical research and yet has not been as well studied computationally as the aqueous solubility. In the current study, we predicted HSPs using only the SMILES of molecules and utilise the molecular embedding approach inspired by Natural Language Processing (NLP). Two pre-trained deep learning models – Mol2Vec and ChemBERTa have been used to derive the embeddings. A dataset of ∼ 1200 organic molecules with experimentally determined HSPs was used as the labelled dataset. Upon ﬁ netuning, the ChemBERTa model “ learned ” relevant molecular features and shifted attention to functional groups that give rise to the relevant HSPs. The ﬁ netuned ChemBERTa model outperforms both the Mol2Vec model and the baseline Morgan ﬁ ngerprint method albeit not to a signi ﬁ cant extent. Interestingly, the embedding models can predict d d signi ﬁ cantly better than d h and d p and overall, the accuracy of predicted HSPs is lower than the well-benchmarked ESOL aqueous solubility. Our study indicates that the extent of transfer learning leveraged from the pre-trained models is related to the labelled molecular properties. It also highlights how d p and d h may have large intrinsic errors in the way they are de ﬁ ned and therefore introduces inherent limitations to their accurate prediction using machine learning models. Our work reveals several interesting ﬁ ndings that will help explore the potential of BERT-based models for molecular property prediction. It may also guide the possible re ﬁ nement of the Hansen solubility framework, which will generate a wide impact across the pharmaceutical industry and research.


Molecular embedding
Deep learning techniques have revolutionised various elds in recent years.One of the most successful areas is Natural Language Processing (NLP) where deep learning models are applied to understand huge volumes of raw text to extract meaning and generate new content.The deep learning NLP techniques are increasingly applied to other domains where the domain data has a similarity with text.One example is SMILES (Simplied Molecular Input Line Entry System), a form of line notation to describe molecular structures using a string of chemical elements and symbols.Through SMILES, it is possible to adopt powerful NLP algorithms to process molecular structures to predict their properties and generate new molecular structures.][3][4][5] Word embedding in the eld of NLP is an important technique in which the meaning of words and sentences can be captured by dense real-valued vectors.A well-established approach to obtain embeddings in NLP is pre-training (Fig. 1).In this approach, such text embeddings are obtained via learning from an extremely large set of unlabelled text sequences, in a fully unsupervised manner, to capture the semantic and syntactic meaning of words.Subsequently, when these pre-trained text embeddings are used in different downstream tasks with or without ne-tuning on smaller sets of labelled data.This concept of reusing a large general pretrained model for many specic tasks with task-specic data annotations is known as transfer learning in machine learning and NLP.Applied to a large dataset of SMILES (such as ZINC and ChEMBL), the embedding approach could provide a new type of molecular representations that captures the physicochemical properties of molecules.7][8][9] There is some evidence that molecular embeddings could surpass molecular descriptors and ngerprints in some tasks, but the improvement may not be signicant and still lacks clear interpretability.Hence a better understanding of what the embedding models have learned of the molecular properties will help to better train and netune them.On the other hand, earlier molecular embedding work usually required extensive coding which makes adaptation difficult for nonexperts.In the past couple of years, the development of Hugging Face (https://huggingface.co/), a machine learning and data science platform has lowered the barrier for non-experts to pre-train and netune deep learning models (transformer models to be more specic).In the present study, we explored the use of molecular embedding approaches to predict Hansen solubility parameters (HSPs) which bridge directly molecular embedding with intrinsic molecular forces.In addition, we have used pre-trained models deposited in Hugging Face for netuning so that our approaches can be adapted more easily.

Solubility and Hansen solubility parameters
Solubility can be dened as the maximum quantity of a chemical that can be fully dissolved in a given volume of solution. 10It is applied to numerous applications in the areas of environmental chemistry, chemical process design, and pharmaceutical sciences and informs molecular design and optimisation in a wide range of tasks such as drug design and the development of lithographic materials in the semiconductor industry. 11redicting solubility can be a challenging task since it depends on various physicochemical factors.Some of the more important factors to be considered are the interactions between solute and solvent and the nature of intrinsic intermolecular forces of solute.Because of the complexity associated with the term, several types of parameters have been developed to account for different aspects of the solvating ability/miscibility of molecules.For example, the partition coefficient log P reects a molecule's hydrophobicity and is widely used to estimate the aqueous solubility of small molecules for drug discovery.On the other hand, Hildebrand and Scott introduced the total solubility parameter d t in 1949, which is the square root of a solvent's cohesive energy density.The total cohesive energy can be measured by evaporating the liquid, i.e., breaking all the "cohesive interactions".Thus, the total cohesive energy of a compound is considered to be similar to the energy of vaporization.The Hildebrand and Scott total solubility parameter usually is not sufficient to describe molecules with strong polarity and hydrogen bonds.It was further rened by Charles M. Hansen in 1967, 12 which became the widely used Hansen solubility parameters (HSPs).Hansen decomposed d t and introduced three variables d h , d p , d d as partial solubility parameters: where d d , d p , d h account for dispersion forces, polar forces, and hydrogen bonding of a molecule, respectively.HSPs were designed to better understand how the nature of intermolecular forces affect solubility, thus have vast applications in the pharmaceutical, paint and material science-related industries. 13hile experimental methods can be used to determine HSPs, it is oen not feasible when the quantity of the chemicals available is limited and costly and impossible for the vast number of hypothetical molecules that are routinely used for virtual screening.Several theoretical approaches have been developed to determine HSPs, notably the group contribution methods (GCMs) [14][15][16] and methods based on Quantitative Structure-Property Relationship (QSPR) and machine learning models. 17,18n GCMs, molecules are divided into basic functional groups (UNIFAC) and polyfunctional and polycyclic groups and then linear regression models are used to determine the group contribution to the partial solubility of the molecules.GCMs are usually less accurate for large molecules with multi-functional groups that make signicant positive or negative contributions to the HSPs. 16QSPR methods use molecular ngerprints and physicochemical descriptors to build regression models to predict the HSPs.These approaches are well established and oen give a satisfactory prediction, but usually involve computing of the descriptors which requires expert knowledge of the molecules and can be time-consuming.There is a need to explore new ways to predict HSPs and more broadly to understand solubility from a molecular structure-based and datadriven perspective that would be more rigorous and efficient.We aim to predict HSPs based on only the SMILES of the molecules and the molecular embedding approaches.In our study, two NLP embedding approaches were employed, namely Word2Vec 23 and BERT-based netuning. 24Word2Vec is a shallow, two-layer neural network to efficiently create highdimensional vector (usually several hundred dimensions) representations of words and has been widely used since its publication in 2013.Word2Vec takes in a large corpus of text and produces a vector space, with each unique token in the corpus being assigned a corresponding vector.These vectors are positioned in the vector space such that tokens that share similar meanings are located close to one another in the vector space.In the present work, we used Mol2Vec, 25 a previously developed unsupervised Word2Vec-inspired chemistry model to assign the vectors.Similar to word embeddings in the Word2-Vec approach, the vectors for chemically related substructures occupy the same part of vector space in Mol2Vec.The limit of Word2Vec approach is that it is "context-free" representation, where the embeddings for substructures are static (Fig. S1 in the ESI †).This means the embeddings do not depend on the context of the SMILES, i.e. the same substructure will have the same vector representation even if it is in two completely different SMILES.Static embedding limits the accuracy of the models as it is well known in chemistry that adjacent and neighbouring functional groups may have signicant inuence over each other's reactivity and chemical properties in molecular structures.In recent years, the power of incorporating context into text embedding learning has been demonstrated by transformer-based models, such as BERT (Bidirectional Encoder Representations from Transformers). 24Similarly, applying contextual representation to SMILES could also lead to the improvement of the models.Several BERT-based models have been developed with different training objectives and strategies along with different SMILES datasets. 26In the current study, we used the ChemBERTa models 27,28 as these pre-trained models are available on Hugging Face that enables more straightforward netuning and adaptation by others.We netuned the ChemBERTa models to make HSPs prediction.
By using and comparing our two embedding models, we aim to address the following questions relating to the prediction of HSPs: (i) Do the embedding models have an advantage over the more commonly used molecular ngerprint and descriptor-based approaches?(ii) For the embedding approach, does the BERTbased model outperform the simpler Word2Vec model?(iii) By comparing two different types of solubility parameters, namely the ESOL aqueous solubility and Hansen solubility, we will examine what the embedding models have learned of the molecular properties and how it may be related to the intrinsic molecular forces dened by HSPs.

Datasets
Two labelled datasets were used.The rst is a set of 1183 common organic molecules with experimentally determined HSPs curated by Steven Abbott. 29Abbott's dataset was checked for possible duplications and the chemicals that are a mixture and do not correspond to a clearly dened molecular structure (e.g., pine oil) were removed from the dataset.The chemicals were converted into SMILES using https://cactus.nci.nih.gov/chemical/structure.This will be referred to as the Hansen dataset.The Hansen dataset has an average molecule weight of 131 g mol −1 (Fig. S2 in the ESI †).d d ranges between 10 and 20, d h ranges mainly between 0 and 30 while d p ranges between 0 and 20 (Fig. S3 in the ESI †).As a comparison with other similar studies, we have also applied our models to the ESOL dataset to predict aqueous solubility. 30ESOL is a dataset of 1143 organic molecules (Fig. S2 †) and shares 117 molecules with the Hansen set.ESOL contains the experimentally determined aqueous solubility parameter log S, that comes from log P and melting point.It has an average molecular weight of 204 g mol −1 and the log S values were distributed mostly between −10 and 5.For both datasets, the canonical SMILES was used for the molecules.The functional group distribution in the two datasets was analysed in a similar fashion as used by Boobier et al. 20 Functional groups were counted by matching their SMARTS codes to the SMILES strings using pybel/OpenBabel.The total number of occurrences of each functional group was then divided by the number of molecules in the dataset to derive the average occurrence per molecule for each functional group (Fig. S4 in the ESI † and the code available in the GitHub deposit).In addition to being lighter in average molecular weight, the Hansen dataset has fewer alkene and aromatic carbons and hydrogen-bond donor functional groups than the ESOL dataset.

Molecular ngerprints
Morgan ngerprints are xed-length vectors that encode the presence of specic molecular functional groups.In the present study, they were generated from SMILES using RDkit 31 where the radius was set at 8 and the vector size set to 2048.This means for each atom, molecular patterns up to a connectivity distance of 8 angstroms were identied, indexed, and hashed to a vector of size 2048.

Mol2Vec model
We have adapted the published Mol2Vec model, 25 which was trained using the genism implementation of Word2Vec and based on 19.9 million molecules from the ZINC and ChEMBL databases.Consistent with steps in the Mol2Vec paper, SMILES in our datasets were tokenised by the extended-connectivity ngerprints (ECFP)-based tokenisation process and the embedding size of 300 was used.Embeddings of tokens (substructures) were summed to form the molecular embedding.Subsequently, the data was trained and tested through a 6-fold cross-validation.The feed-forward neural network (FFNN) and XGBoost regression models were applied, respectively.The FFNN models were built using Pytorch.A few sets of hyperparameters were tested based on a previous study. 32,33For the reported results, the number of hidden units used was [300, 200, 100, 10] and the dropout rates were set as 0.25, 0.1 and 0.05 at each layer with ReLU as the activation function.The learning rate was set as 0.0001, and the Adam optimizer and a batch size of 64 were used.The model was trained for between 50 and 100 epochs.The XGBoost model was used alongside Scikit-learn. 33The performance of the models was evaluated by root-mean-squared errors (RMSEs) and Mean absolute errors (MAEs) which inform the error distribution and coefficient of determination (R 2 ), which captures how well the predicted solubility values agree with the experimental values.The nal reported RMSEs, MAEs and R 2 are based on the testing data (∼200 molecules per fold).

Finetuning ChemBERTa
Two ChemBERTa models were used for netuning: the seyonec/ ChemBERTa-zinc-base-v1 model 27 and the DeepChem/Chem-BERTa-77M-MTR, 28 both available on the Hugging Face repository.Hugging Face is a machine learning and data science platform hosting git-based code repositories, models, and datasets under a unied API, which simplies data transformation and coding syntax. 34The Hugging Face hub stores many pretrained transformers/BERT models for inference and netuning in a variety of machine learning tasks.ChemBERTa-zinc-base-v1 was trained on 100k SMILES from the zinc database via the masked language model (MLM) while ChemBERTa-77M-MTR was trained using 77 million SMILES via multi-task regression (MTR).These two models appeared to give more accurate prediction aer initial testing to assess the various CHEMBERTa models for solubility prediction and therefore has been used in the present study.We also aimed to compare the performance between the two models to understand the impact of the size of the BERT training datasets.The Trainer class in Hugging Face provides an API for feature-complete training in PyTorch and was used to netune the ChemBERTa model.As before, the data was trained, and tested with a 6-fold cross-validation to ensure all data was used for testing.The performance of the models was evaluated by RMSEs, MAEs and R 2 from the testing data (∼200 molecules per fold).It is worth noting that tokenisation in CHEMBERTa is different from that of the Mol2Vec model.The default Byte-Pair Encoder (BPE) from the Hugging Face tokenizers library was used which nds the tokens by iteratively merging frequent pairs of characters.In addition, we used BertViz 35 to visualise the attention heads of the ChemBERTa model on the HSPs.

Results and discussion
We assessed the performance of the two NPL-based models against experimental values and the baseline Morgan ngerprint approach.We further compared the quality of the prediction of HSPs against aqueous solubility from the widely benchmarked ESOL dataset.Overall, ve combinations of molecular representation and machine learning methods will be discussed: Morgan ngerprints with XGBoost, Mol2Vec embeddings using XGBoost and FFNN, respectively and the two ChemBERTa netuned models.Each of them has been applied to d d , d h , and d p and the ESOL aqueous solubility parameters (Table 1, Fig. 2 and 3).

Comparison with baseline Morgan ngerprints
The RMSEs, MAEs and R 2 of all the predicted solubility parameters are presented in Table 1.In terms of MAEs and RMSEs, both Mol2Vec and ChemBERTa netuning models outperform the Morgan ngerprint baseline models albeit not by a signicant extent and can be sensitive to ML methods (i.e., the FFNN and XGBoost of Mol2Vec embeddings tend to give varied accuracies).This is consistent with what has been observed in several previous studies 32 and is expected as the embeddings derived from the two NLP models capture a latent representation of the physicochemical properties of molecules, therefore should be more informative than the simple encoding of the presence of specic molecular functional groups in the Morgan ngerprint approach.

Accuracy of the predicted HSPs
To benchmark, we compared our results with the well-studied ESOL dataset.The predicted ESOL is in excellent agreement chemical descriptors and the Gaussian process, a Bayesian machine learning approach to predict the HSPs of 193 small organic molecules. 18There may be intrinsic limits to how machine learning approaches can predict this type of solubility parameters.

Outliers
In general, the models are more likely to overestimate lowervalue HSPs and underestimate higher value HSPs (Fig. 2).For the lower value HSPs, this may be in part because the models failed to learn the d h and d p which are zero in value.For the higher range of HSPs, it may be because these data points are scarce and tend to be under-represented in the training set.In addition, the models were netuned using three separate sets of labelled data (d d , d h and d p ).It is clear that the accuracy of the predicted HSPs of the same molecule is not correlated, i.e. d d of the molecule may be predicted poorly while d h and d p may be predicted with good accuracy.The predicted values two standardised residual deviations (SRDs) away from the experimental values are selected as the outliers in each model (see Table S2 in the ESI † for the list of outliers).These poorly predicted molecules were analysed to identify possible systematic limitations.The difference in the average occurrence of the functional groups between the outliers and the full dataset is presented in Fig. 4   Fig. 3 Plot of the mean absolute errors (MAEs) of the five models in Table 1.
functional groups in the outliers.For d d , the most frequent functional groups in the outliers are Halides (F, Cl, Br and I), and to a lesser extent nitrile, P and S. All three models (fps, Mol2Vec and ChemBERTa) performed poorly against Halides.CHEMBERTa-77M-MTR is the worst-performing one primarily because it handles Cl badly.For d h , consistent among all the models, the outliers tend to have more hydrogen bond donors and acceptors.However, CHEMBERTa-77M-MTR performs better than CHEMBERTa-zinc-base-v1 in general, therefore giving the best result for predicted d h .For d p , the functional groups in the outliers are more diverse and no distinct functional groups stand out.It is also interesting that the larger CHEMBERTa model performed worse than the smaller CHEMBERTa model in terms of Cl, while the smaller CHEM-BERTa model tends to perform slightly poorly for carbonyl, alkene and H-bond acceptor.As a comparison, for the ESOL dataset, the outliers' functional groups are more diverse, which is similar to d p .In terms of the size of the molecules, outliers from predicted d d have similar molecular weight to the full Hansen dataset while outliners from predicted d h and d p are smaller than the full Hansen dataset (Fig. S6 †).On the other hand, outliers from the ESOL set have higher average molecular weight than the full dataset.It appears that Mol2Vec and CHEMBERTa models don't have a general bias over molecular weight, but they may have some limit towards smaller or bigger molecules depending on the predicted molecular properties.

Attention visualisation
We used BertViz to visualise the attention mechanisms of outliers from the original CHEMBERTa_zinc-base-v1 model and the netuned models to understand what they have learned from the molecular structures and their labels.The ChemBERTa model has 6 hidden layers and 12 attention heads in every layer.Since model weights are not shared between layers, the model has 72 different attention mechanisms.Attention is visualised as lines connecting the position being updated (le) with the position being attended to (right).The colours identify the corresponding attention heads while the thickness of the lines reects the attention score.General attention patterns are present including attention to the previous/next token, attention to identical/related tokens, and attention to the delimiter token </s> when the attention head can't nd anything meaningful in the input molecule to focus on (Table S1 in the ESI †).
It is clear that the attention mechanisms are different in the netuned models compared to the original ChemBERTa model.The attention focuses more on the atoms that give rise to the HSPs in the netuned models.For example, pyridazine (SMILES: c1ccnnc1) is an aromatic heterocyclic compound. 36It contains a six-membered ring with two adjacent nitrogen atoms.The molecule has a high dipole moment with p-p stacking interactions and dual hydrogen-bonding capacity (both as hydrogen bond donor and as acceptor).Its simple structure and strong characters relating to the HSPs make it easier to interpret the attention mechanism from ChemBERTa (Fig. 5 and Table S1 in  L-(−)-Ephedrine (SMILES: CN[C@@H](C)[C@H](O)c1ccccc1) has an aromatic ring, a hydroxyl (OH) and an amine (NH) functional group.The two tetrahedral centres are indicated by the chiral specication simple of @ and @@ in the SMILES (Fig. 6 and Table S1 in  (Fig. 7).The molecule is tokenised into four tokens [C], [1], [CCCCCCCCCCC] and [1].11 carbons from the ring structure are grouped into one token while the two [1] and the last [C] which indicates that the connectivity is set as three separate tokens.Attention is distributed among all the tokens and although the attention mechanisms change in the netuned models, the way the molecule is tokenised makes them difficult to interpret.The d d netuned model gave a reasonable prediction of 18.1 (exp 16.4), but d h and d d were both overestimated (7.3 and 5.9).The netuned models failed to grasp the correlation between nonpolar molecules and zero value for HSPs.However, it is important to point out that HSPs use a simplied way to dene the polarity of molecules.Cyclododecane can adopt multiple conformations and has a sizable dipole moment, 37 therefore assigning its d d as 0 is a signicant underestimation and does not reect the complexity and conformational exibility of its structure.Cyclododecane occupies a unique chemical space and has been used in drug discovery.There are several challenges for the prediction of HSPs of such unique molecules: (1) better ways to tokenise molecular structure.The 11-carbon token in our approach may be too long and will be scarce in the training set.Shorter tokens of a few carbons may account for local structural exibility better.(2) The limit of HSPs' theoretical framework, in particular, how d p and d h terms may have large intrinsic errors for complex molecules.

General discussion
In the present study, we have predicted different components of solubility using two molecular embedding approaches.The signicantly varied prediction accuracy for the three parameters of HSPs along with ESOL aqueous solubility parameters is intriguing.It raises interesting questions as to how netuning tasks may benet from the improvement of pretraining, e.g., increased pretraining dataset size.More importantly, it indicates that how much the pre-trained model can be leveraged for transfer learning may relate to the nature of the molecular properties in the labelled dataset, a key area to investigate further to enable more effective use of BERT-based models for molecular property prediction.
First of all, our results may be explainable by the underlying nature of these parameters and the experimental errors associated with them.The HSPs were derived from three main types of interactions in common organic molecules.The most common is the nonpolar interactions, which are usually referred to as the atomic dispersion forces (d d ).All molecules contain this type of force, therefore d d is usually the predominant component of total solubility while d h and d p has relatively small contributions.d d can be predicted quite well with just Morgan ngerprints, indicating that it is dependent more on the molecular structure.Because the dispersion parameter is  based on atomic forces, the size of the atom is important.To determine d d experimentally or theoretically, corrections are usually required for atoms signicantly larger than carbon, such as Cl and Br but not for oxygen and nitrogen that are of similar size.The impact of the corrections is larger for small molecules. 38This is consistent with our results that for d d , the most frequent functional groups in the outliers are the halides (F, Cl, Br and I).The fact that the models handled halides poorly could be due to a combination of two factorsthe intrinsic errors in the experimental data associated with the halide atoms and the scarcity of this type of molecules in the training set.
d h arises from the hydrogen bonding capacity of the molecule and has been used to collect the cohesive energy component that is not included in the other two parameters.d h was most oen found by subtracting the polar and dispersion components from the total solubility.Therefore, d h is less well dened physicochemically, and the experimental errors associated with d h are bigger than those of d d .This may explain why the predicted accuracy of d h is lower than d d .
d p arises from dipole-dipole interactions between molecules.Most molecules (except few strictly nonpolar molecules) contain these inherent molecular forces to some extent, in a way similar to the ubiquitous atomic dispersion forces in all molecules.Therefore, it is intriguing that there is a signicant difference in our models' accuracy of d d over d p , despite both parameters involving all atoms of the molecules.In a previous study using descriptor-based machine learning approach, 18 it was found that d d appeared to be more dependent on molecular structure, while d p depended more on the electrostatic descriptors (dipole moment, polarizability, polarity and hydrogen-bonding moments).Electrostatic properties are more complex as they are description of how atoms inuence each other within a molecule.To improve model accuracy, an increased number of labelled data may be needed to netune BERT for it to learn the more complex interactions within molecules.More labelled data can be easily generated for future studies because d p is well-dened and can be derived from dipole moment of molecules calculated using quantum mechanical calculations.
Comparison with the well-benchmarked ESOL aqueous solubility dataset also provided useful insights.Dissolution is a complex process that involves solute-solute, solvent-solvent and solute-solvent interactions.Aqueous solubility is dominated by solvation energy and solvent-solute interactions, due to water's high polarity and its capability for hydrogen bonding. 20,39In contrast, HSPs mainly account for solute-solute interactions i.e. cohesive/sublimation energy, which has always been more challenging to predict. 40Therefore the ESOL z d d > d h > d p trend of prediction accuracy is in agreement with our understanding of the physical aspects of the dissolution process.
Finally, it is important to point out that experimental HSP values exhibit signicant uncertainties.d p has the largest uncertainties.For example, the value derived from the dipole moment could be much smaller than values derived from the group contribution methods. 38,41Large inconsistencies between d h values have been observed for compounds with strong hydrogen bonds, such as urea. 41Similarly, it has also been pointed out that experimental data quality could be a limiting factor in predicting the aqueous solubility of molecules. 22

Conclusions
In conclusion, we predicted HSPs based on only the SMILES of molecules using the so-called molecular embedding approaches.Upon netuning, the ChemBERTa model "learned" relevant molecular features and shied attention to functional groups that give rise to the relevant HSPs.The netuned ChemBERTa model outperforms both the Mol2Vec model and the baseline Morgan ngerprint method and gives accuracy slightly lower or on a par with the more computing-intensive chemical descriptors-based models.In general, the embedding models can predict d d signicantly better than d h and d p and overall, the accuracy of predicted HSPs is lower than the well-benchmarked ESOL aqueous solubility.The ESOL z d d > d h > d p trend of prediction accuracy is in agreement with our understanding of the physical aspects of the dissolution process, i.e., HSPs mainly account for solute-solute interactions from cohesive/sublimation energy, which has always been more challenging to predict than the aqueous solubility which is dominant by solvent-solute interactions.In addition, our study indicates that the labelled molecular properties in the netuning datasets may determine how much the pre-trained model can be leveraged for transfer learning.This is most likely due to the limit of HSPs' theoretical framework, and in particular, how the d p and d h terms may have large intrinsic errors in the way they are dened and derived, therefore introducing inherent limitation to the accuracy of their prediction from data-driven approaches.It would be worthwhile to consider rening the Hansen solubility framework using a combination of standardised experimental measurement, quantum mechanical calculations and machine learning models.

Fig. 1 (
Fig. 1 (A) Illustrate pretraining of the Word2Vec and BERT models and how they are used for molecular property prediction.(B) Illustration of how a molecule is converted to SMILES, tokenised into tokens (substructures), then derived into real-valued vectors, i.e. embeddings in NLP.The vectors can come from pre-training, chemical descriptors or molecular fingerprints.
and S5 in the ESI.† The positive values indicate the increased presence of the functional groups and the negative values indicate fewer

Fig. 2
Fig. 2 Plots of the predicted d d , d h and d p of the HSPs (top three rows) and ESOL solubility (bottom row) versus their respective experiment values.The solid red lines indicate ideal agreement between the predicted and experimental values.The dashed red lines indicate two standardised residual deviations (SRD) away from the experimental values.Molecules beyond the dashed lines are the outliers in each model and were further analysed in Fig. 4.

Fig. 4
Fig. 4 Functional group analysis of the outliers for (A) d d , (B) d h , (C) d p and (D) ESOL.Functional groups were counted by matching their SMARTS codes to the SMILES strings.The total number of each functional group was then divided by the number of molecules in the outliers to derive the average occurrence per molecule (Fig. S5 †).For clarity, the difference of the average occurrence of functional groups between the outliers and the full dataset are presented in the above figures.
the ESI †).In the d d netuned model, strong attention is distributed among c[token1], 1[token2] and ccnnc[token3].[c] represents an aromatic carbon and [1] indicates connectivity to form an aromatic ring.[ccnnc] includes the two nitrogenthe hydrogen acceptor in the molecule.All 3 tokens contribute to d d because the dispersion parameter is based on the atomic forces of all the atoms in the molecule.In the d h netuned model, the attention focuses more on the hydrogen bond acceptor token [ccnnc] in layer_2/(head_4), layer_3/(head_2, 5, 10 and 11), layer_4/(head_0, 1, 7) and layer_5/(head_0).In the d p netuned model, there is a strong focus on the [c] token that is not seen in d d and d h .Attention to [c] is dominant in layer_1/(head_11), layer_2/(head_0, 2, 4 and 10), and layer_3/(head_1, 2, 3, 4, 5, 6, 8, and 11).The d d netuned model gave an excellent prediction of 19.2 compared to the experimental value of 20.2.The d h netuned model performed worse with predicted d h 6.8 (exp d h 11.7) and the d d netuned model the poorest with predicted d p 7.7 compared to the exp d p 17.4.
Fig. 5 Attention analysis of pyridazine using BertViz based on the original CHEMBERTa_zinc-base-v1 model and the finetuned models.(A) Molecular structure and SMILES of pyridazine and its experimental and predicted HSPs.Selective visualisation of attention is presented for the original model and the d d finetuned model (B), d h finetuned model (C) and d p finetuned model (D).

Fig. 6
Fig. 6 Attention analysis of L-(−)-ephedrine using BertViz based on the original CHEMBERTa_zinc-base-v1 model and the finetuned models.(A) Molecular structure and SMILES of L-(−)-Ephedrine and its experimental and predicted HSPs.(B) Visualisation of attention of layer_4 is presented for the original model, and the d d , d h and d p finetuned models.

Fig. 7
Fig. 7 Attention analysis of cyclododecane using BertViz based on the original CHEMBERTa_zinc-base-v1 model and the finetuned models.(A) Molecular structure and SMILES of cyclododecane and its experimental and predicted HSPs.(B) Tokens used in the model.(C) Visualisation of attention of all layers is presented for the original model, the d d , d h and d p finetuned models.

Table 1
Comparison of the five models in predicting HSPs and ESOL with regards to the RMSEs, MAEs and R 2 .The errors were computed based on 6-fold cross-validation with each testing dataset containing ∼200 molecules.The model highlighted in bold gave the best prediction for that component of solubility (note: the descriptorsbased result is taken from ref. 18 using a different set of 193 small organic molecules) p Morgan Fps XGBoost 2.23 AE 0.09 3.02 AE 0.16 0.51 AE 0.05 Mol2Vec XGBoost 2.26 AE 0.10 3.12 AE 0.12 0.48 AE 0.02 Mol2Vec FFNN 2.15 AE 0.14 2.92 AE 0.21 0.54 AE 0.04 Zinc-base 2.01 AE 0.16 2.83 AE 0.27 0.41 AE 0.14 77M-MTR 2.24 AE 0.15 3.13 AE 0.21 0.42 AE 0.04 with the published values, which validates our approaches (Table 1).There is a considerable variation of the models' prediction power over d d , d h , and d p .The accuracy of predicted d d (MAE 0.59, RMSE 0.83, and R 2 0.73 for the best model) is the highest, signicantly higher than d h , and d p for all models (Table 1).This is followed by predicted d h (MAE 1.79, RMSE 2.70, and R 2 0.7) while predicted d p has the lowest accuracy (MAE 2.01, RMSE 2.83, and R 2 0.4).The d d > d h > d p trend for the predicted accuracy and the range of MAEs and RMSEs are consistent with results from a previous study which used