Hyuntae
Lim
* and
YounJoon
Jung
*
Department of Chemistry, Seoul National University, Seoul 08826, Korea. E-mail: ht0620@snu.ac.kr; yjjung@snu.ac.kr
First published on 20th August 2019
Prediction of aqueous solubilities or hydration free energies is an extensively studied area in machine learning applications in chemistry since water is the sole solvent in the living system. However, for non-aqueous solutions, few machine learning studies have been undertaken so far despite the fact that the solvation mechanism plays an important role in various chemical reactions. Here, we introduce Delfos (deep learning model for solvation free energies in generic organic solvents), which is a novel, machine-learning-based QSPR method which predicts solvation free energies for various organic solute and solvent systems. A novelty of Delfos involves two separate solvent and solute encoder networks that can quantify structural features of given compounds via word embedding and recurrent layers, augmented with the attention mechanism which extracts important substructures from outputs of recurrent neural networks. As a result, the predictor network calculates the solvation free energy of a given solvent–solute pair using features from encoders. With the results obtained from extensive calculations using 2495 solute–solvent pairs, we demonstrate that Delfos not only has great potential in showing accuracy comparable to that of the state-of-the-art computational chemistry methods, but also offers information about which substructures play a dominant role in the solvation process.
There have been various molecular descriptors proposed to represent structural features of compounds efficiently. For example, we can feature a given molecule with simple enumerations of empirical properties like molecular weights, rotatable bonds, the number of hydrogen bond donors and acceptors, or some pre-experimental or pre-calculated properties.16 On the other hand, molecular fingerprints, which are another option, are commonly used in cheminformatics to estimate the chemical ‘difference’ between more than two compounds.17 They usually have a fixed size of a binary sequence and are easily obtainable from SMILES with pre-defined criteria. Graphical representation of molecules based on graph theory is another major encoding method in QSAR/QSPR which has received great attention in recent days.18,19 It has exhibited outstanding prediction performances in diverse chemical or biophysical properties.20
The mapping function extracts properties which we want to know about from encoded molecular features of the given compound via a classification or regression method. We can use any suitable machine learning method for mapping functions15,20 such as random forests (RF), support vector machines (SVM), neural networks (NNs), and so on. Among these diverse technical options, NN seems to be the method which has shown the most rapid advances in recent years,16,21–25 on the strength of the theoretical advances26 and evolution of computational power. Many studies have already been performed to show that various chemical or biophysical properties of compounds are obtainable from the QSAR/QSPR combined with machine learning techniques.16,20–25,27,28
Solvation is one of the most fundamental processes occurring in chemistry, and many theoretical and computational studies have been performed to calculate solubilities or solvation free energies using a variety of methodologies.29,30 For example, we can roughly guess solubilities using solvation parameters, but solvation parameters only provide the relative order, not the quantitative value.31 The general solubility equation (GSE) enables us to calculate solubilities from some empirical parameters, but it only provides solubilities for aqueous solutions.32Ab initio1–7 or MD simulations10–13 provide us with more concrete, accurate results and more in-depth knowledge about the solvation mechanism, but they have practical limitations due to high usage of computational resources as mentioned before.
Recent studies demonstrated that QSPR with ML successfully predicts aqueous solubilities or hydration free energies of diverse solutes.16,20,21,25,33,34 They also proved that ML guarantees faster calculations than computer simulations and more precise estimations than GSE estimation; a decent number of models showed accuracies comparable to ab initio solvation models.20 However, the majority of QSPR predictions for solubilities have been limited to cases of aqueous solutions. For non-aqueous solutions, few studies have been undertaken to predict the solubility despite the fact that predicting solubilities plays an important role in the development of varied fields of chemistry, e.g., organic synthesis,35 electrochemical reactions in batteries,36 and so on.
In the present work, we introduce Delfos (deep learning model for solvation free energies in generic organic solvents), which is a QSPR model combined with a recurrent neural network (RNN) model. Delfos is specialized in predicting solvation free energies of organic compounds in various solvents, and the model has three primary sub-neural networks: the solvent and solute encoder networks and the predictor network. For basic featurization of a given molecule, we use the word embedding technique.34,37 We calculate solvation energies of 2495 pairs of 418 solutes and 91 solvents38 and demonstrate that our model shows a performance as good as that of the best available quantum chemical methods2,6,8,11 when the neural network is trained with a sufficient chemical database.
The rest of the present paper is outlined as follows: Section 2 describes the embedding method for the molecular structure and overall architecture of the neural network. In Section 3, we mainly compare the performance of Delfos with both MD and ab initio simulation strategies3,6,8,11 and discuss database sensitivity using the cluster cross-validation method. We also visualize important substructures in solvation via the attention mechanism. In the last section, we conclude our work.
It is worthwhile to note that we can employ the embedding technique for chemical or biophysical processes if we consider an atom or a substructure to be a word and a compound to be a sentence.33,34,43 In this case, positions of molecular substructures in the embedded vector space represent their chemical and physical properties, instead of linguistic information. There are already bio-vector models43 that have been developed which encode sequences of proteins or DNAs, and atomic-vector embedding models have been introduced recently to encode structural features of chemical compounds.33,34 Mol2Vec is one of such embedding techniques, and it generates vector representations of a given molecule from the molecular sentence.34 To make molecular sentences, Mol2Vec uses the Morgan algorithm44 that classifies identical atoms in the molecule. The algorithm is commonly used to generate ECFPs,45 which are the de facto standard in cheminformatics,17 and it creates identifiers of the given atom from the chemical environment in which the atom is positioned. An atom may have multiple identifiers depending on the pre-set maximum value of the radius rmax, which denotes the maximum topological distance between the given atom and its neighboring atoms. The atom itself is identified by r = 0, and additional substructure identifiers for adjacent atoms are denoted by r = 1 (nearest neighbor), r = 2 (next nearest neighbor), and so on. Since Mol2Vec has demonstrated promising performances in several applications of QSAR/QSPR,34 Delfos uses Mol2Vec as the primary encoding means. The schematic illustration of the embedding procedure for acetonitrile is shown in Fig. 1.
The primary architecture of the encoder is based on two bidirectional recurrent neural networks (BiRNNs).46 The network is designed for handling sequential data and we consider the molecular sentence [x1, …, xN] to be a sequence of embedded substructures, xi. RNNs may have a failure when input sequences are lengthy; gradients of the loss function can be diluted or amplified because of accumulated precision error from the backpropagation process.47 The excessive or restrained gradient may cause a decline in learning performance, and we call these two problems vanishing or exploding gradients. To overcome these limits which stem from lengthy input sequences, (copy) one may consider using both the forward-directional RNN and backward-directional RNN within a single layer:
(1a) |
(1b) |
(1c) |
In eqn (1), xi is the embedded atomic vector of a given molecule, and are the hidden state outputs of each recurrent unit, and represents the concatenation of two hidden states, respectively. More advanced versions of RNN, like the long short-term memory48 (LSTM) or gated recurrent unit49 (GRU) networks, are widely used to handle lengthy input sequences. They introduce gates in each RNN cell state to memorize important information of the previous cell state and minimize vanishing and exploding gradient problems.
After the RNN layers, the molecular sentences of both the solvent X = [x1, …, xN] and the solute Y = [y1, …, yM] are converted to hidden states, H = [h1, …, hN] and G = [g1, …, gM], respectively. Each hidden state is then inputted into the shared attention layer and weighted. The attention mechanism, which was originally proposed to enhance performances of a machine translator,40 is an essential technique in diverse NLP applications nowadays.41,42 Principles of the attention start from the definition of the score function of hidden states and its normalization with the softmax function:
(2a) |
(2b) |
score(hi,gj) = hi·gj | (2c) |
There are various score functions that have been introduced to achieve efficient predictions,40–42 and among them we use Luong's dot-product attention42 in eqn (2c) as a score function since it is computationally efficient. The solvent context, P = αG denotes an emphasized hidden state H with the attention alignment, α. We also obtain the solute context Q using the same procedure. The context weighted from the attention layer is an L × 2D matrix, where L is the sequence length and D is the dimensions of two RNN hidden layers since we use a bidirectional RNN (BiRNN). Two max-pooling layers, which are the last part of each encoder, reduce contexts H, G, P, and Q to 2D-dimensional feature vectors u and v:42
u = MaxPooling([h1;p1,…,hN;pN]) | (3a) |
v = MaxPooling([g1;q1,…,gM;qM]) | (3b) |
The predictor has a single fully connected perceptron layer with a rectifier unit (ReLU) and an output layer. It uses the concatenated feature of the solvent and solute [u;v] as an input. The overall architecture of our model is shown in Fig. 2. We also consider encoders without RNNs and attention layers in order to quantify the impact of these layers on prediction performances of the network; each encoding network contains only the embedding layer and is directly connected to the MLP layer. The solvent and solute features are simple summations of atomic vectors, and , respectively. This model was initially used for gradient boosting (GBM) regression analysis for aqueous solubilities and toxicities.34
For implementation of the neural networks, we use the Keras 2.2.4 framework51 with TensorFlow 1.12 backend.52 At the very first stage, the Morgan algorithm for r = 0 and r = 1 generates molecular sentences of the solvent and solute from their SMILES strings. Then the given molecular sentence is embedded into a sequence of 300-dimensional substructure vectors using the pre-trained Word2Vec model available at https://github.com/samoturk/mol2vec, which contains information on ∼20000000 compounds and ∼20000 substructures from ZINC and ChEMBL databases.34 We consider BiLSTM and BiGRU layers in both solvent and solute encoders to compare their performances. Since our model is a regression problem, we use mean squared error (MSE) as the loss function.
We employ 10-fold cross-validation (CV) for secure representation of the test data because the dataset we use has a limited number of experimental measures; the total dataset is uniformly and randomly split into 10 subsets, and we iteratively choose one of the subsets as a test set and the training run uses the remaining 9 subsets. Consequentially, a 10-fold CV task performs 10 independent training and test runs, and relative sizes of the training and test sets are 9 to 1. We use Scikit-Learn library53 to implement the CV task and perform an extensive grid search for tuning hyperparameters: learning algorithms, learning rates, and dimensions of hidden layers. We select the stochastic gradient descent (SGD) algorithm with Nesterov momentum, whose learning rate is 0.0002 and momentum is 0.9. Optimized hidden dimensions are 150 for recurrent layers and 2000 for the fully connected layer. To minimize the variance of the test run, we take averages for all results over 9 independent random CVs, split from different random states.
Solvation free energies calculated from the MNSOL using attentive BiRNN encoders are exhibited in Fig. 3 and 4. Prediction errors for the BiLSTM model are ±0.57 kcal mol−1 in RMSE and ±0.30 kcal mol−1 in MAE, and the Pearson correlation coefficient R2 = 0.96, while results from the BiGRU model indicate that there is no meaningful difference between the two recurrent models. The encoder without BiRNN and attention layers produces much more inaccurate results, whose error metrics are ±0.77 kcal mol−1 in RMSE and ±0.43 kcal mol−1 in MAE, and the R2 value is 0.92, respectively.
We cannot directly compare our results with those of other ML models because Delfos is the first ML-based study using the MNSOL database. Nonetheless, several studies on aqueous systems have previously calculated solubilities or hydration free energies using various ML techniques and molecular descriptors.16,20,21,25,33,34 For comparison, we have tested our neural network model for the hydration free energy. A benchmark study from Wu et al.20 provides hydration energies of 642 small molecules in a group of QSPR/ML models. Their RMSEs were up to 1.15 kcal mol−1 while our prediction from the BiLSTM encoder attains 1.19 kcal mol−1 for the same dataset and split method (see the ESI†). This result suggests that our neural network model guarantees considerably good performances even in a specific solvent of water.
Meanwhile, for studies which are not ML-based, there are several results from both classical and quantum-mechanical simulation studies that use the MNSOL as the reference database.3–6,8,11,13 In Table 1, we choose two DFT studies which employ several widely used QM solvation models3,8 for comparison with our proposed ML model: solvation model 8/12 (SM8/SM12), the solvation model based on density (SMD), and the full/direct conductor-like screening model for realistic solvation (COSMO-RS/D-COSMO-RS). While all of these QM methods exhibited excellent performances when considering a chemical accuracy of 1.0 kcal mol−1, full COSMO-RS is a noteworthy solvation model since it is believed to be a state-of-the-art method which shows the best accuracy.9 This is realized by statistical thermodynamics treatment on the polarization charge densities, which helps COSMO-RS with making successful predictions even in polar solvents where the key idea of the dielectric continuum solvation collapses.1,7,9 Resultingly, COSMO-RS calculations with the BP86 functional and TZVP basis set achieved 0.52 kcal mol−1 for 274 aqueous solvents, 0.41 kcal mol−1 for 2072 organic solvents, and 0.43 kcal mol−1 for the full dataset in mean absolute error.8
Solvent | Method | N data | MAE | Ref. |
---|---|---|---|---|
Aqueous | SM12CM5/B3LYP/MG3S | 374 | 0.77 | Marenich et al.3 |
SM8/M06-2X/6-31G(d) | 366 | 0.89 | Marenich et al.3 | |
SMD/M05-2X/6-31G(d) | 366 | 0.88 | Marenich et al.3 | |
COSMO-RS/BP86/TZVP | 274 | 0.52 | Klamt and Diedenhofen8 | |
D-COSMO-RS/BP86/TZVP | 274 | 0.94 | Klamt and Diedenhofen8 | |
Delfos/BiLSTM | 374 | 0.64 | ||
Delfos/BiGRU | 374 | 0.68 | ||
Delfos w/o RNNs | 374 | 0.90 | ||
Non-aqueous | SM12CM5/B3LYP/MG3S | 2129 | 0.54 | Marenich et al.3 |
SM8/M06-2X/6-31G(d) | 2129 | 0.61 | Marenich et al.3 | |
SMD/M05-2X/6-31G(d) | 2129 | 0.67 | Marenich et al.3 | |
COSMO-RS/BP86/TZVP | 2072 | 0.41 | Klamt and Diedenhofen8 | |
D-COSMO-RS/BP86/TZVP | 2072 | 0.62 | Klamt and Diedenhofen8 | |
Delfos/BiLSTM | 2121 | 0.24 | ||
Delfos/BiGRU | 2121 | 0.24 | ||
Delfos w/o RNNs | 2121 | 0.36 |
For the proposed ML models, Delfos with BiLSTM shows a comparable accuracy in the water solvent, for which MAE is 0.64 kcal mol−1. Delfos makes much better predictions in non-aqueous organic solvents; machine learning for 2121 non-aqueous systems results in 0.24 kcal mol−1, which is 44% that of SM12CM5 and 59% that of COSMO-RS. However, one may argue that K-fold CV from random split does not produce the real prediction accuracy of the model. That is, the random-CV results only indicate the accuracy for trained or practiced chemical structures. Accordingly, one may ask the following questions. For example, will the ML model ensure comparable prediction accuracy in “structurally” new compounds? What happens if the ML model cannot learn sufficiently varied chemical structures? We will discuss these questions in the next section.
Results from the solvent and the solute cluster CV tasks shown in Table 2 exhibit generalized expectation error ranges for new solvents or solutes which are not in the dataset. Winter et al.55 reported that the split method based on the clustering exhibits an apparent degradation of prediction performances in various properties; we find that our proposed model exhibits a similar tendency as well. For the BiLSTM encoder model, increments of MAE are 0.52 kcal mol−1 for the solvent clustering and 0.69 kcal mol−1 for the solute clustering. The reason why the random K-fold CV exhibits superior performances is obvious; if we have a pair of solvents and solutes in the test set and the training set has and pairs, then both and could enhance the prediction accuracy of . However, the clustering limits the location of a specific compound, and pairs of specific solvents or solutes should be either in the test set or the training set.
Solvent | Method | N data | MAE | RMSE | Ref. |
---|---|---|---|---|---|
All | COSMO/BP86/TZVP | 2346 | 2.15 | 2.57 | Klamt and Diedenhofen8 |
COSMO-RS/BP86/TZVP | 2346 | 0.42 | 0.75 | Klamt and Diedenhofen8 | |
SMD/PM3 | 2500 | — | 4.8 | Kromann et al.6 | |
SMD/PM6 | 2500 | — | 3.6 | Kromann et al.6 | |
Delfos/random CV | 2495 | 0.30 | 0.57 | ||
Delfos/solvent clustering | 2495 | 0.82 | 1.45 | ||
Delfos/solute clustering | 2495 | 0.99 | 1.61 | ||
Toluene | MD/GAFF | 21 | 0.48 | 0.63 | Mohamed et al.11 |
MD/AMOEBA | 21 | 0.92 | 1.18 | Mohamed et al.11 | |
COSMO/BP86/TZVP | 21 | 2.17 | 2.71 | Klamt and Diedenhofen8 | |
COSMO-RS/BP86/TZVP | 21 | 0.27 | 0.34 | Klamt and Diedenhofen8 | |
Delfos/random CV | 21 | 0.16 | 0.37 | ||
Delfos/solvent clustering | 21 | 0.66 | 1.10 | ||
Delfos/solute clustering | 21 | 0.93 | 1.46 | ||
Chloroform | MD/GAFF | 21 | 0.92 | 1.11 | Mohamed et al.11 |
MD/AMOEBA | 21 | 1.68 | 1.97 | Mohamed et al.11 | |
COSMO/BP86/TZVP | 21 | 1.76 | 2.12 | Klamt and Diedenhofen8 | |
COSMO-RS/BP86/TZVP | 21 | 0.50 | 0.66 | Klamt and Diedenhofen8 | |
Delfos/random CV | 21 | 0.35 | 0.56 | ||
Delfos/solvent clustering | 21 | 0.78 | 0.87 | ||
Delfos/solute clustering | 21 | 1.14 | 1.62 | ||
Acetonitrile | MD/GAFF | 6 | 0.43 | 0.52 | Mohamed et al.11 |
MD/AMOEBA | 6 | 0.73 | 0.77 | Mohamed et al.11 | |
COSMO/BP86/TZVP | 6 | 1.42 | 1.58 | Klamt and Diedenhofen8 | |
COSMO-RS/BP86/TZVP | 6 | 0.33 | 0.38 | Klamt and Diedenhofen8 | |
Delfos/random CV | 6 | 0.29 | 0.39 | ||
Delfos/solvent clustering | 6 | 0.74 | 0.82 | ||
Delfos/solute clustering | 6 | 0.80 | 0.94 | ||
DMSO | MD/GAFF | 6 | 0.61 | 0.75 | Mohamed et al.11 |
MD/AMOEBA | 6 | 1.12 | 1.21 | Mohamed et al.11 | |
COSMO/BP86/TZVP | 6 | 1.31 | 1.42 | Klamt and Diedenhofen8 | |
COSMO-RS/BP86/TZVP | 6 | 0.56 | 0.73 | Klamt and Diedenhofen8 | |
Delfos/random CV | 6 | 0.41 | 0.44 | ||
Delfos/solvent clustering | 6 | 0.93 | 1.19 | ||
Delfos/solute clustering | 6 | 0.91 | 1.11 |
For an additional comparison, Table 2 also contains results taken from SMD with semi-empirical methods,6 pure COSMO, COSMO-RS,8 and classical molecular dynamics11 for four organic solvents: toluene (C6H5CH3), chloroform (CHCl3), acetonitrile (CH3CN), and dimethyl sulfoxide ((CH3)2SO), respectively. Although the MD is based on classical dynamics, the results of the generalized amber force field (GAFF) tell us that an explicit solvation model with a suitable force field could make considerably good predictions. The bottom line of cluster CV is if the dataset for training contains at least one side of the solvent–solvent pair we want to estimate its solvation free energy, the expectation error of Delfos lies within a chemical accuracy of 1.0 kcal mol−1, which is the general error of the computer simulation scheme. Also, results for four organic solvents demonstrate that predictions from the cluster CV have an accuracy that is comparable with that of MD simulations using an AMOEBA polarizable force field.11
Results from the cluster CV highlight the necessity for discussion on the importance of database preparation. As described earlier, the cluster CV causes a considerable increase in prediction error, and we suspect that the degradation mainly comes from the decline in the diversity of the training set. Namely, the number of substructures that the neural network learns in the training process is not as many as the random CV if we use the cluster CV. To prove this speculation, we define unique substructures, which are substructures that only exists in the test cluster. As shown in Fig. 5, in the solute cluster CV, the MAE for 1226 pairs which don't have any unique substructures in solutes is 0.54 kcal mol−1, while the prediction error for the remaining 1269 solutions is 1.64 kcal mol−1. The solvent cluster CV shows more extreme results: the MAE for 374 aqueous solvents is 2.48 kcal mol−1, while non-aqueous solvents exhibit 0.52 kcal mol−1 in contrast. We believe that the outlying behavior of water is due to its distinctive nature. Water has only one unique substructure since the oxygen atom does not have any neighbors. So the solvent clustering makes the network unable to learn the structure of water in indirect ways, resulting in prediction failure. This logic tells us that the most critical thing is securing of the training dataset which contains as many kinds of solvents and solutes as possible. We believe that computational approaches would be as helpful as experimental measures for enriching structural diversity of the training data, given recent advances on QM solvation models2,3,8 such as COSMO-RS. Furthermore, since there are 418 solutes and 91 solvents in the dataset used,38 which make up 38038 possible pairs, we expect Delfos and MNSOL to guarantee similar precision levels with the random CV for numerous systems.
Overall the results in Fig. 6 imply that the chemical similarity taken from the attention layer has a significant connection to a fundamental knowledge of chemistry like polarity or hydrophilicity. Each alcoholic solvent has one hydrophilic –OH group, and this results in increasing contributions of the nitro group in the solute as hydrocarbon chains of alcohols shorten. For the acetonitrile–nitromethane solution, the attention mechanism reflects the highest contributions of –NO2 groups due to the strong polarity and aprotic nature of the solvent. Although the attention mechanism seems to reproduce molecular interactions in a faithful way, we find that there is a defective prediction which does not agree with chemical knowledge. Two oxygen atoms O and –O− in the nitro group are indistinguishable due to the resonance structure; thus they must have equivalent contributions in any solvent, but we find that they show different attention scores in our model. We believe that these problems occur because the SMILES string of nitromethane (C[N+](O)[O–]) does not encode the resonance effect in the nitro group. Indeed, the Morgan algorithm generates different identifiers for two oxygen atoms in the nitro group, [864942730, 2378779377] for O and [864942795, 2378775366] for –O−. The absence of resonance might be a problem worth considering when one intends to use word embedding models with SMILES strings,33,34,55 although estimated solvation energies for nitromethane from the BiLSTM model are within a moderate error range as shown in Fig. 6.
We performed extensive calculations on 2495 experimental values of solvation energies taken from the MNSOL database.38 From the random-CV task, we obtained mean averaged errors in solvation free energy of Delfos using BiLSTM as 0.64 kcal mol−1 for aqueous systems and 0.24 kcal mol−1 for non-aqueous systems. Our results demonstrate that the proposed model exhibits excellent prediction accuracy which is comparable with that of several well-known QM solvation models3,8 when the neural network is trained with sufficiently varied chemical structures, while the MLP model which does not contain recurrent or attention layers showed relatively deficient performances. A decline in performances of about 0.5 to 0.7 kcal mol−1 at the cluster CV tasks represents the accuracy for a structurally new compound, suggesting the importance of preparation of ML databases even though Delfos still demonstrates comparable predictions with some theoretical approaches such as MD using the AMOEBA force field11 or DFT with pure COSMO.8 The score matrix taken from the attention mechanism gives us an interaction map between the atoms and substructure; our model not only provides a simple estimation of target property but also offers important pieces of information about which substructures play a dominant role in solvation processes.
One of the most useful advantages of ML is flexibility; a single model can be used to learn and predict various databases.20 Also, our model may be applied to predict various chemical, physical, or biological properties especially focused on interactions between more than two different chemical species. One of the possible applications that we can consider is the prediction of chemical affinity and the possibility of various chemical reactions.56 Room-temperature ionic liquids might be another potential research topic became the interplay between cations and anions dominates their various properties, e.g., toxicity57 or electrochemical properties in supercapacitors.58,59 Thus, we expect that Delfos will be helpful in many further studies, and not only localized to the prediction of solvation energies.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: 10.1039/c9sc02452b |
This journal is © The Royal Society of Chemistry 2019 |