I-Ting
Ho
^{a},
Milena
Matysik
^{b},
Liliana Montano
Herrera
^{c},
Jiyoung
Yang
^{d},
Ralph Joachim
Guderlei
^{b},
Michael
Laussegger
^{e},
Bernhard
Schrantz
^{e},
Regine
Hammer
^{f},
Ramón Alain
Miranda-Quintana
^{g} and
Jens
Smiatek
*^{hi}
^{a}Boehringer Ingelheim Pharma GmbH & Co. KG, Global Computational Biology and Digital Sciences, D-88397 Biberach (Riss), Germany
^{b}Boehringer Ingelheim GmbH & Co. KG, IT RDM Services, D-88397 Biberach (Riss), Germany
^{c}Boehringer Ingelheim Pharma GmbH & Co. KG, Global Innovation & Alliance Management, D-88397 Biberach (Riss), Germany
^{d}Boehringer Ingelheim Pharma GmbH & Co. KG, Analytical Development Biologicals, D-88397 Biberach (Riss), Germany
^{e}Boehringer Ingelheim RCV GmbH & Co. KG, IT RDM Services, A-1121 Vienna, Austria
^{f}Boehringer Ingelheim GmbH & Co. KG, OneHP Business Unit, D-88397 Biberach (Riss), Germany
^{g}Department of Chemistry and Quantum Theory Project, University of Florida, FL 32611, USA
^{h}Boehringer Ingelheim GmbH & Co. KG, Development NCE, D-88397 Biberach (Riss), Germany
^{i}Institute for Computational Physics, University of Stuttgart, Allmandring 3, D-70569 Stuttgart, Germany. E-mail: smiatek@icp.uni-stuttgart.de
First published on 10th November 2022
We present explainable machine learning approaches for the accurate prediction and understanding of solvation free energies, enthalpies, and entropies for different salts in various protic and aprotic solvents. As key input features, we use fundamental contributions from the conceptual density functional theory (DFT) of solutions. The most accurate models with the highest prediction accuracy for the experimental validation data set are decision tree-based approaches such as extreme gradient boosting and extra trees, which highlight the non-linear influence of feature values on target predictions. The detailed assessment of the importance of features in terms of Gini importance criteria as well as Shapley Additive Explanations (SHAP) and permutation and reduction approaches underlines the prominent role of anion and cation solvation effects in combination with fundamental electronic properties of the solvents. These results are reasonably consistent with previous assumptions and provide a solid rationale for more recent theoretical approaches.
The detailed analysis of data fitting approaches is not new and was already discussed in the context of multifactorial linear regression methods^{4} and the more advanced analysis of complex parametric models in terms of global sensitivity analysis.^{11} Recent examples of analysis algorithms include Shapley additive explanations (SHAP),^{12} local interpretable model-agnostic interpretations (LIME)^{12,13} or the ‘explain like I am 5’ (ELI5) approach which can be regarded as a feature reduction and permutation method.^{14} Thus, as was shown in previous examples, the model-agnostic interpretation of feature importances provide deeper insights into the governing correlations.^{1–10} Notably, these concepts were applied for a broad range of problems,^{1} but their use for chemistry and physics-related topics is rather scarce. This is even more surprising as a lot of non-linear correlations in nature are still not fully understood, such that machine learning was discussed as a promising tool for the detection of hidden patterns and interactions.^{15–17}
A valuable application of explainable machine learning is in particular the study of interactions in solutions. Despite the easily observable effects, most prominent but still often discussed mechanisms include the solvation of ions, solute–solvent interactions, properties of complex mixtures and co-solute and specific ion effects among others.^{18–28} It was often argued that the complexity of interactions hampers a consistent theoretical description such that only approximate solutions can be introduced.^{19,20,27,29–31} In recent articles, we developed a conceptual density functional theory (DFT) of solutions^{32–34} which focuses on the electronic properties of species in order to rationalize specific ion effects,^{33} ion pair stabilities,^{27,34} calculation of donor numbers,^{35} beneficial combinations of solvent–ion pairs for battery applications^{28} and the influence of co-solutes and ions on the stability of macromolecules.^{26} In addition to electronic properties and expressions for energy changes, further chemical reactivity descriptors including the chemical hardness, the electronegativity and electrophilicity were also introduced. The corresponding descriptors already gained their merits in the explanation, evaluation and understanding of various chemical reaction principles.^{36–44} In a previous article, various descriptors from conceptual DFT in combination with experimental parameters were also used for the prediction of solvation energies by artificial neural networks.^{45} Notably, the previous study mainly focused on high predictive accuracies such that explainable machine learning concepts in accordance with feature analysis were ignored.
In this article, we evaluate various machine learning methods for the prediction of free solvation energies, solvation enthalpies and solvation entropies of distinct ion pairs in various solvents. The corresponding feature contributions for the machine learning models are limited to electronic properties and expressions from the conceptual DFT approach for solutions. The results of the best models achieve a rather high predictive accuracy for the experimental validation set which provides an in-depth analysis of feature importances in terms of SHAP, Gini importance criteria and ELI5 values. As main outcomes, the obtained results for the most important features underline recent theoretical descriptions and highlight the validity of the conceptual DFT approach.
The article is organized as follows. The theoretical background for the conceptual DFT approach in addition to basic remarks on SHAP, ELI5 and Gini importance criteria are presented in the next section. The computational details as well as properties of the training and validation data set are discussed in Section 3. All results will be presented and discussed in Section 4. We conclude and summarize in the last section.
(1) |
As a prerequisite for chemical reactions, the energy change of isolated molecules upon electronic perturbation can be written as a truncated Taylor series according to
(2) |
(3) |
(4) |
η ≃ I − A = E_{LUMO} − E_{HOMO}, | (5) |
A_{S} + B_{S} ⇌ (A·B)_{S} | (R1) |
(6) |
(7) |
(8) |
(9) |
The Gini importance value is a statistical measure to estimate the inequality of a distribution. The concept is widely applied for decision-tree based machine learning approaches. Here, the Gini coefficient estimates the weighted amount of a feature by the number of samples used to split a node into different trees.^{58} It typically takes values between 0 and 1, so a value of 1 can be interpreted as only one corresponding feature being important for all splits in the decision tree.
ELI5 can be considered as a feature reduction and permutation method.^{14} Thus, the validated model is iteratively reduced in terms of the individual features and the predictive accuracy is computed. With regard to this approach, one can identify the most important features from a gobal analysis point of view.
Thus, we took all chemical reactivity descriptors and derived values as features from the conceptual density functional theory of solutions into consideration.^{27} In addition, using the differences in the cation- and anion solvation energies as features is motivated by recent experimental studies on the law of matching solvent affinities.^{24} Such an approach allows us to predict the corresponding free energies, enthalpies and entropies for new solvent-salt combinations even without any experimental values and a priori knowledge. All values can be computed from standard DFT calculations, such that our approach is also broadly applicable for the identification of beneficial new solvent–salt cominations. In consequence, we chose these features according to our theoretical models of charge transfer and solvation, since we have provided ample evidence that these quantities can be used to rationalize solvation processes.^{27} The corresponding correlation coefficients between the individual features are shown in the ESI.†
(10) |
Potential reasons for this observation were recently discussed in ref. 93. In more detail, decision-tree based models do not overly smooth the solution in terms of predicted target values. Most other machine learning models such as artificial neural networks (ANNs) transform the training data by smoothing the output with a kernel function like a Gaussian kernel for varying length-scale values. This reduces the impact of irregular patterns for the accurate learning of the predictive function. For small length scales, smoothing the target function on the training set significantly reduces the accuracy of tree-based models but hardly impacts that of ANNs. Moreover, it was shown that uninformative features do not affect the performance metrics of decision-tree based models as much as for other machine learning approaches. In terms of such findings, it becomes clear that decision-tree based models often outperform kernel-based approaches in terms of predictive accuracies.
The detailed predictions in terms of the best model are shown in Fig. 2. For all predictions, one can observe a maximum nRMSE value of 0.22 which demonstrates the high accuracy of all models. Most of the predicted values are located within one σ, such that one can notice only one outlier value related to ΔG_{sol} with an RMSE > 2σ.
In more detail, all larger nRMSE deviations with nRMSE > 1.2 for ΔG_{sol} can be attributed to LiBr in PC (nRMSE = 1.50), LiCl in water (nRMSE = 1.65), LiI in PC (nRMSE = 1.28), CsF in methanol (nRMSE = 1.21) and LiF in methanol (nRMSE = 2.22). Thus, it can be concluded that predictions for highly polar solvents including the ions Li^{+} and F^{−} are more challenging when compared to other combinations of species. Comparable conclusions can also be drawn for some outliers related to ΔH_{sol} predictions. The two outliers with nRMSE > 1.2 can be attributed to LiI in acetonitrile (nRMSE = 1.25) and LiCl in PC (nRMSE = 1.56). With regard to these slight deviations, it was already discussed that fluoride and lithium ions differ from simple charge transfer assumptions which can be rationalized by the small size of the ions and the resulting low polarizability.^{33}
Based on these values, one can conclude that the chosen features from our conceptual DFT approach allow us to predict key thermodynamic values with reasonable accuracy. Noteworthy, the highest accuracy is achieved for the ET model and the corresponding predictions for the solvation entropy. Such findings become even more important with regard to the fact that the conceptual DFT approach for solutions ignores all solution, bulk thermodynamic and multicomponent effects. For a more detailed understanding, we evaluated the individual models in terms of their feature importances.
Corresponding conclusions can also be drawn for the net SHAP values as shown in Fig. 4. The features with the highest SHAP values and thus the highest importance are identical with the Gini importance criteria for all three models. Notably, one can observe slight differences in the ordering of the remaining features. In more detail, the frontier molecular orbitals are located at the second and third rank for the predictions of ΔG_{sol}, while E^{S}_{HOMO} also becomes important with regard to the predictions for ΔH_{sol}. In addition to the net SHAP and the Gini importance values, the corresponding ELI5 importance values are presented in Table 1. As can be seen, the most important features like ΔΔE_{ASCS} for predictions of ΔG_{sol} and ΔH_{sol} and η_{S} for predictions of ΔS_{sol} are identical for all analysis methods. Slight differences can be observed for the remaining features. With regard to the rather low Gini, SHAP and ELI5 values for these features and the dominance of the most important feature, the slight discrepancies in terms of different ranking positions become understandable. Hence, clear conclusions on the importance can only be drawn for the most important features. Whereas the comparison between SHAP, ELI5 and Gini values allows us to estimate the net influence of features, the detailed evaluation of SHAP values as shown in Fig. 5 also provides the identification of the functional correlations.
Feature | XGB ΔG_{sol} | XGB ΔH_{sol} | ET ΔS_{sol} |
---|---|---|---|
ΔΔE_{ASCS} | 0.3852 (1) | 0.4709 (1) | 0.0189 (13) |
E ^{S}_{LUMO} | 0.1629 (2) | 0.0226 (10) | 0.0896 (4) |
E ^{S}_{HOMO} | 0.1320 (3) | 0.0681 (4) | 0.1698 (2) |
ΔE_{CA} | 0.0771 (4) | 0.0183 (12) | 0.0490 (6) |
E ^{A}_{LUMO} | 0.0633 (5) | 0.0599 (6) | 0.0117 (17) |
ΔΔE_{sol} | 0.0553 (6) | 0.0287 (9) | 0.0430 (7) |
ΔE_{AS} | 0.0378 (7) | 0.0353 (8) | 0.0297 (9) |
E ^{C}_{LUMO} | 0.0265 (8) | 0.0669 (5) | 0.0266 (11) |
ΣΔE_{ASCS} | 0.0174 (9) | 0.0794 (3) | 0.0619 (5) |
η _{A} | 0.0166 (10) | 0.0909 (2) | 0.0056 (18) |
E ^{A}_{HOMO} | 0.0152 (11) | 0.0378 (7) | 0.0170 (15) |
ΔE_{CS} | 0.0073 (12) | 0.0198 (11) | 0.0401 (8) |
E ^{C}_{HOMO} | 0.0034 (13) | 0.0013 (13) | 0.0181 (14) |
χ _{A} | 0 (14) | 0 (14) | 0.0154 (16) |
η _{S} | 0 (15) | 0 (15) | 0.1998 (1) |
χ _{C} | 0 (16) | 0 (16) | 0.0271 (10) |
η _{C} | 0 (17) | 0 (17) | 0.0235 (12) |
χ _{S} | 0 (18) | 0 (18) | 0.1532 (3) |
With regard to the predictions of ΔG_{sol} for the XGB model, it can be seen that high values of ΔΔE_{ASCS} are correlated with negative SHAP values and vice versa. Hence, it can be concluded that high ΔΔE_{ASCS} values have a strong positive impact on model predictions. The correlations between high and low values of E^{S}_{HOMO} and E^{S}_{LUMO} with negative and positive SHAP values are less obvious. For high values of these features, one can observe at least a clear separation between positive and negative SHAP values. In addition, the values for the solvation energy show a reasonable positive correlation, such that high values are correlated with positive SHAP values and vice versa. This correlation is not surprising, as the solvation energy was introduced with the main intention to compute thermodynamic solvation energies.
A comparable outcome in terms of antilinear correlations for ΔΔE_{ASCS} can also be observed for the model predictions of ΔH_{sol}. Moreovoer, one can observe that ΣΔE_{ASCS} shows a positive correlation between high values and positive SHAP values. The remaining features show a rather ambiguous correlation. Also the SHAP values for the model predictions of ΔS_{sol} show no clear trends. However, it has to be noted that high values of χ_{S} lead to negative SHAP values and thus lower values of ΔS_{sol}. Corresponding clear correlations can also be seen for E^{S}_{LUMO}, where low values lead to negative SHAP values. Notably, the most important feature η_{S} shows a rather ambiguous distribution of low and high values in terms of corresponding SHAP values. As can be seen, the standard deviations for all features in terms of the corresponding SHAP values are significantly broader when compared to the other models. The individual interpretation of these contributions is discussed in the next subsection.
Furthermore, we performed additional calculations for the same training data and models but with reduced feature sets. In more detail, we ignored the most important feature from the SHAP analysis for all considered models. The corresponding results are shown in the ESI.† As can be seen, the predictions for the reduced models slightly change. For the free energy of solvation it becomes clear that the prediction accuracy decreases slightly. In addition, the most important feature is now the cation–solvent solvation energy. This finding is valid for all three reduced models and accordingly for the predictions for the solvation free energy, enthalpy, and entropy, and can be explained by the large energetic differences between the solvent and the cation. Accordingly, the solvation energies of these species dominate the feature set. Only for the predictions of the free enthalpy a slight increase in the prediction accuracy after reduction of the feature set can be observed, while the results for the free entropy of solvation show a lower prediction accuracy. Therefore, one cannot conclude that by ignoring the most important feature, the second most important feature automatically becomes dominant. In addition, it can be seen that reducing the feature space for the most important contributor most often decreases the prediction accuracy. Since one is usually not aware of the relevant characteristics beforehand, such a manual check is definitely not useful, as it introduces a bias. Accordingly, these feature set results highlight the nonlinear correlations between the input features and the output target values.
Further conclusions can also be drawn for the importance of E^{S}_{HOMO} and E^{S}_{LUMO} in terms of predictions for ΔG_{sol}. Notably, the corresponding frontier molecular orbital energies can be interpreted as vertical electron affinity and ionization potential of solvent molecules which thus point to charge redistribution effects upon interaction.^{37} As was recently discussed,^{32,34} charge transfer is an essential contribution to solvation interactions and the formation of solvation bonds. In consequence, the HOMO and LUMO energies are pointing to the importance of the electronic solvation properties for solvents as also discussed in the context of donor and acceptor numbers.^{20,28,32} Based on these findings, we conclude that the predictions for the free solvation energy and enthalpy are highly dominated by the electronic properties of the solvent. Such assumptions were already discussed in the conceptual DFT approach for solutions^{26} and the corresponding results of our analysis fully validate this concept. Despite the fact that conceptual DFT calculations show some slight deviations to experimental results due to crucial approximations, we showed that the underlying framework reflects a reasonable qualitative agreement in terms of highly accurate model predictions in combination with the corresponding feature analysis.
A slightly different conclusion can be drawn for the model analysis of solvation entropy predictions. The high accuracy of the predictions is remarkable with regard to the fact that only the electronic properties of the molecules in combination with solvation energies are taken into consideration. Hence, further potentially important features reflecting bulk thermodynamic properties, the composition of the solution as well as environmental factors like temperature and pressure are not considered in our models. As can be seen, the most important features η_{S}, E^{S}_{HOMO} and χ_{S} for the model predictions are strongly related to the electronic properties of the solvent molecules. Notably, solvation entropies are crucially dominated by the local arrangement of the species in the solution in combination with kinetic restrictions.^{19} As already discussed, the chemical hardness of the solvent as the most relevant feature is an inverse measure for the polarizability α. It was recently shown that larger molecules usually reveal lower chemical hardnesses and thus higher polarizabilities.^{27} One can thus assume that the importance of this feature for the model predictions points to the impact of the size and strength of molecular interactions between the solute species and the solvent molecules. Hence, it can be assumed that the thermodynamic solvation entropy is crucially affected by the electronic properties of the solvent molecules. As already known, molecules with varying polarizabilities differ in their position and their orientation around solute species.^{26} In addition, it was shown that the polarizabilities also affect the dipole moment which thus influences the orientation of the solvent molecules around the solute species. Furthermore, the influence of the solvent HOMO energy E^{S}_{HOMO} can also be interpreted as an influence of the ionization potential, whereby this feature describes the electron donating properties of the solvent. As already shown in earlier publications,^{32,35} this behavior can also be classified by the donor numbers, which allow the binding energies of the solvent to be estimated accordingly. In consequence, different binding strengths can also be translated into different flexible local orientations of the solvent molecules around the solute species as well as differences in the dynamic binding times. Comparable conclusions can also be drawn for the electronegativity χ_{S}, which estimates the Lewis acidity or basicity of the solvent molecules. With regard to previous discussions,^{34} it can be concluded that these properties of the solvent play an important role for certain SWAB principles and thus provide an estimate for the strength of cation–anion interactions as well as the individual solvation energies between the solvent molecules and the ions. Notably, it was proven^{94} that the HSAB principle can be derived from the “|Δμ| big is good” (DMB) rule. In a recent paper, we were also able to show that the SWAB principles can be derived from the DMB condition.^{95} Hence, one can conclude that the SWAB and the HSAB principles are closely connected to the DMB condition which is known as a cardinal reactivity principle. In summary, it can be concluded that the electronic properties of the solvent molecules as well as their polarizability play an important role for the resulting entropy values in terms of distinct ion pair solvent combinations.
The high accuracy of the machine learning methods allowed us to shed light on the most important features. The corresponding results for the free solvation energy and enthalpy highlighted a prominent influence of the differences in the solvation energies for the cations and the anions. These findings are in reasonable agreement with recent discussions in terms of the ‘law of matching solvent affinities’.^{23,24} This concept can be generalized to the SWAB principle, which states that ions with comparable electronegativity differences to the solvent form the strongest ion pairs with the highest solvation enthalpies.^{32–34} Thus, one can directly conclude from our feature analysis that this principle governs the solvation of ion pairs in solution as the most important parameter. Moreover, we have shown that the ionization potential and the electron affinity of the solvent further dominate the solvation enthalpies and energies. Such results can be brought into relation with recent discussions about solvent influences and the role of donor and acceptor numbers which further provide the identification of reasonable solvent–ion combinations.^{28,32}
Finally, it should be noted that even entropies of solvation can be predicted with high accuracy in terms of the conceptual DFT approach. In more detail, solvation entropies are bulk thermodynamic properties which rely on the composition, orientation and dynamics of the species in solution. Our explainable machine learning approach unraveled the reasons for the high accuracy of the predictions. In more detail, the chemical hardness, the HOMO energy as well as the electronegativity of the solvent play a key role. The chemical hardness can be interpreted as an inverse polarizability which can be attributed to the molecular size of the solvent species. It was discussed that differences in the solvent molecular size and shape crucially affect the solution entropies.^{19} Corresponding conclusions are also valid for all electron-donating or accepting properties of solvents in terms of electronegativity values. The pronounced importance of these effects highlights the crucial role of local intermolecular interactions in terms of charge transfer mechanisms. The outcomes of our explainable machine learning analysis thus validate recent theoretical assumptions.^{26–28,32–35}
Our results show that explainable machine learning approaches in combination with theoretical frameworks provide deeper insights into fundamental principles. In addition to accurate predictions, one can analyze the main features with the highest importance which gives insights into the dominant contributions and governing mechanisms. In consequence, it can be concluded that the viable combination of machine learning methods with experimental data, theoretical considerations and feature importance analysis may lead to the exploration of unknown relations and a deeper understanding of correlations. These conclusions are not unique to our study, so they can be generalized to all aspects of science where a reasonable amount of data, highly predictive models, and some basic knowledge of relationships and feature meanings are available.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d2cp04428e |
This journal is © the Owner Societies 2022 |