Liquid electrolyte informatics using an exhaustive search with linear regression

Exploring new liquid electrolyte materials is a fundamental target for developing new high-performance lithium-ion batteries. In contrast to solid materials, disordered liquid solution properties have been less studied by data-driven information techniques. Here, we examined the estimation accuracy and efficiency of three information techniques, multiple linear regression (MLR), least absolute shrinkage and selection operator (LASSO), and exhaustive search with linear regression (ES-LiR), by using coordination energy and melting point as test liquid properties. We then confirmed that ES-LiR gives the most accurate estimation among the techniques. We also found that ES-LiR can provide the relationship between the ‘‘prediction accuracy’’ and ‘‘calculation cost’’ of the properties via a weight diagram of descriptors. This technique makes it possible to choose the balance of the ‘‘accuracy’’ and ‘‘cost’’ when the search of a huge amount of new materials was carried out.


Introduction
Computational material design with a data-driven information technique has become popular for materials research recently. 1he materials for next-generation lithium-ion batteries (LIBs) are the representative targets.Future LIBs require a higher voltage, a higher capacity, and a longer cycle life and need to be safer. 2,35][6] However, new ''electrolyte'' materials, typically consisting of liquid solvents and Li-salts, have not appeared since 1991 for commercial use.This is because the search for liquid materials is more difficult compared to that for solid materials due to the disordered structure of liquid.][9] In order to discover new liquid electrolytes with desirable properties, virtual screening with a data-driven information technique is one possible option.In this screening, a database of the features of materials called descriptors is first constructed with data from first-principles calculations or molecular dynamics simulations and/or experiments.Next, we determine the estimation rule (fitting equation) to predict the target properties based on the selected descriptors in the database by using the information techniques.Finally, we handle a huge number of candidate materials under the rule.][16] To extract the estimation rule for predicting the target properties, we have to select descriptors using data-driven techniques.It is called the variable selection problem.In general, multiple linear regression (MLR), 17 in which all the descriptors are used for the estimation, is the most standard treatment for the estimation of the properties of materials.However, irrelevant and redundant descriptors from data do not contribute to the accuracy of a predictive model or may in fact decrease the accuracy of the model.Thus, we have to remove these descriptors.Moreover, fewer descriptors are desirable because it reduces the complexity of the model, and a simpler model is simpler to understand and explain.
0][21] Although the ES method comes at the expense of computational complexity of at least O(2 N ), we can use the ES method within the compass of N = 30, and the ES method can select the best descriptors for predicting the target properties.In this study, we apply the ES method for linear regression and propose a set of descriptor combinations that can produce better estimations.For comparison, we also apply least absolute shrinkage and selection operator (LASSO) 22 using an L1-norm regularization term as a standard approximate method for the sparse variable selection, for which the computational complexity is O(N 3 ).
In the search for LIB liquid electrolytes, the evaluation of the properties of ion transport and electrochemical stability is indispensable.For the transport, solvation to and desolvation from Li-ions at the electrolyte/electrode interface plays a crucial role, and thus the coordination energy of the solvent to Li-ions is an important measure.In order to keep the liquid state for the fast Li-ion transport, the melting point of the electrolyte is also a fundamental property.For the electrochemical stability, the quantities such as ionization potential and electron affinity are significant.Here, however, we focus on the quantities related to the Li-ion transport as the first target.
In this study, we investigated the estimation accuracy of the MLR, LASSO, and ES-LiR techniques in the search for liquid electrolyte materials.We estimated the coordination energies and melting points as the required properties of the LIB liquid electrolytes and discussed the extracted descriptors by LASSO and ES with linear regression (ES-LiR).The strategy of the ES-LiR method will be useful and applicable in the search for liquid electrolytes with other desired properties.

Database
To predict novel LIB liquid electrolytes with desired properties by the information techniques, we constructed a database of known liquid electrolytes.We selected 103 solvent molecules which were commercialized as battery grade materials from KISHIDA Chemical Co., Ltd. 23We adopted the values of melting point, boiling point, flash point, density of solvent, and molecular weight from the catalogue data.Representative solvent molecules are shown in Scheme 1 and the complete list is shown in Scheme S1 of the ESI.†

Cluster model calculations
To make the database of the electrolytes more substantial, we added the following values obtained by density functional theory (DFT) calculations of the molecular systems using the Gaussian 09 code: 24 the coordination energy between a Li-ion and a solvent molecule, the Mulliken charge of the atom (typically oxygen atom) that is coordinated to a Li-ion, the distance between a Li-ion and the coordinated atom (typically Li-O distance) (R(Li-O)), the HOMO energy, the LUMO energy, and the dipole moment values of the 103 solvent molecules.The calculated data of the representative solvent molecules are shown in Table 1, and the complete data are listed in Table S1 in the ESI.† The coordination energies (E coord ) are evaluated by the difference between the ''total energy of a Li-solvent complex'' and ''the total energies of a solvent molecule and that of a Li-ion'' (E coord = E(Li-solvent) À {E(solvent) + E(Li-ion)}).We adopted the B3LYP functional 25 with cc-pVDZ basis sets. 26The Mulliken charges and the dipole moments are obtained from the DFT calculations of pure solvent molecules without Li-ions.Geometry optimizations of the Li-solvent complexes and the pure solvent molecules were also carried out.In this study, totally 10 descriptors (explanation variables) were adopted for the database.There are several missing data in the catalogue.We omitted them for the prediction.When the data have no specific value but a range of values, we averaged them.

Data-driven information techniques
We applied the data-driven information techniques of MLR, LASSO, and ES-LiR to the electrolyte materials search.MLR is a typical supervised machine learning technique to predict certain values of the properties.The method tries to represent the relationship between the set of the given values of the properties, called explanation variables, and the target values for the prediction, called dependent variable, by constructing a model of the linear equation.We set a target value and an i-th explanation variable as z and x i (i = 1,. .., 10), respectively.We then assume that the relationship between them is linear and derive it from minimizing eqn (1), ; (1) where w i (i = 1,. .., 10) is the coefficient of the i-th explanation variable.
As descriptors x i , we adopted the following sets of features, x 1 = boiling point, x 2 = density, x 3 = dipole moment, x 4 = flash point, x 5 = HOMO, x 6 = LUMO, x 7 = melting point, x 8 = molecular weight, x 9 = Mulliken charge, and x 10 = distance between the Li-ion and the coordinated oxygen atom for the prediction of the coordination energies.In the case of the melting point prediction, x 7 is redefined to the coordination energy and the other descriptors are the same as in the former case.
LASSO is also the supervised machine learning method.The linear equation of the fitting is the same as that of the MLR method, while LASSO involves a penalty term as expressed in the second term of eqn (2).
In eqn (2), l is the penalty parameter and the order of the penalty term is linear.This method is a sparse estimation technique and can minimize the error function with extracted descriptor sets.If l is sufficiently large, some of the coefficients are driven to zero, leading to a sparse model in which the corresponding coefficients play no role.On the other hand, in the case where l = 0, the results are the same as the results of MLR.The penalty term allows complex models to be trained on the data sets of limited size without severe over-fitting.
To determine a suitable value of the penalty parameter, l, we use cross validation (CV), which approximately extract the prediction error from the limited data.For the CV, the given data from the database are divided to training data and validating data to evaluate the prediction accuracy.After the iteration of this training and validating process with different dividing positions, the CV error is obtained with less variability.We carried out the 10-fold (10 times iterations) cross validation and choose an optimal based on when the CV error was at its minimum.In this study, the CV error of LASSO is derived from the coefficients in eqn (2), which are affected by the optimal penalty parameter.
We then consider the proposed sparse estimation technique, ES-LiR.Assuming that the coefficients are sparse, namely, the coefficients have a small number of non-zero elements, we estimate which coefficient of the explanatory variable is nonzero.To be more precise, let us consider that the number of explanatory variables is N.In ES-LiR, in contrast to LASSO, whether each coefficient is zero or not is determined by exhaustively evaluating all combinations of N explanatory variables, 2 N À 1.To evaluate each combination, each value of the non-zero coefficient is determined by the least squares method and we calculate the CVE for each combination.Finally, we obtain optimal non-zero elements.This approach requires a longer calculation time compared with MLR and LASSO.In this study, the size of the data is not large and we can easily apply the ES-LiR method for the estimation.
We formulate exhaustive search for the linear regression problem (ES-LiR) by using an indicator variable that represents a combination of non-zero explanatory variables.The indicator is defined as an N-dimensional binary vector, Each variable c i takes 0 or 1: c i = 1 if the i-th variable belongs to the combination and c i = 0 if it does not.Using the indicator, c, we can write the linear regression problem by minimizing where p is the number of samples.This formulation makes the essence of the problem more explicit, and the best c for modeling and predicting a target variable, z, is searched by minimizing the CVE in ES-LiR.
It is easy to imagine that the ES method becomes intractable for a large size.To reduce the computational load, it is effective to use sampling methods, such as the Markov chain Monte Carlo (MCMC) method and the replica exchange Monte Carlo (REMC) method.In our previous study, 21 to deal with the difficulty, we proposed the approximate exhaustive search (AES) method for linear regression, using the above sampling method.

Coordination energy prediction
The correlation between the calculated coordination energies and estimated ones by MLR, LASSO, and ES-LiR is shown in Fig. 1, and their predicted values are shown in Table 2.In these data, the estimated values have a good correlation with the true values (DFT calculated data).For the samples with the lowest coordination energies of around À100 kcal mol À1 (true value), the estimation accuracy is not high.The solvents are 12-crown 4-ether and 18-crown 6-ether as shown in Tables S1 and S2 (ESI †).They coordinate to Li-ions by four or more oxygen atoms of the solvent.Thus, the coordination manner is different from the other solvents, and it can be affected to the low estimation accuracy of the coordination energy.
The CV errors of the MLR, LASSO and ES-LiR methods were calculated to be 10.2, 9.18, and 8.78 kcal mol À1 , respectively (Table 3).This suggests that the prediction accuracy of ES-LiR is the best among the three methods.The accuracy is mainly affected by the quality of the descriptor choice and the selection of the data-driven technique.Regarding the choice of descriptors, we can generate the descriptors from first-principles calculation results to improve the prediction accuracy, though too many descriptors may cause over-fitting in some information techniques and decrease the accuracy, especially the MLR case.The ES-LiR method can consider the whole combination patterns of the descriptors, and the over-fitting is easily detected by the result of the less prediction accuracy of the combinations.This indicates that we are not suffered from the selection of the information techniques.Remaining treatment for improving the prediction accuracy is by increasing the amount of descriptors.Fig. 2 shows the histogram of the CV errors of descriptor combinations calculated by the ES-LiR method.The histogram can extract not only the optimal solution but all the solutions, which enable us to map the solutions of various machine learning and data-driven methods and scientists' hypotheses.Then, we can evaluate these methods and hypotheses. 21As shown in Fig. 2, the CV errors of MLR and LASSO and the best value of ES-LiR are depicted.This suggests that LASSO, which   has been widely used in recent studies, is not a best prediction method and the extracted descriptors are not a best combination (Table 3) from the combinations of the small CVE data.
The ES-LiR method not only minimizes the CVE but also derives the CVE in all combinations, so you can see the whole picture of them.Using the whole pictures, the ES-LiR method can be used to construct the weight diagram, which shows the top 25 best combinations of the descriptors, as shown in Fig. 3.The weight diagram reveals the stability of the important descriptors for the estimation, even if the error is at the same level as the other methods.Each colour represents the fitted coefficient of each descriptor, which shows the importance for the coordination energy prediction.The white-blocks of the map correspond to the descriptors which are not adopted for the prediction.From this data, the Mulliken charge is the significant descriptor for the coordination energy prediction and flash point, and R(Li-O) can also contribute to it.The coordination energy is highly affected by the Coulomb interaction between the Li cation and the oxygen atom that has a negative electron charge.Thus, the extraction of the Mulliken charge as a good descriptor fits our chemical intuition, even if the Mulliken charge values are sometimes quantitatively not stable with the basis functions.The R(Li-O) is also a trivial descriptor for the estimation of the solvation energy because the distance corresponds to the strength of the interaction between Li and O. On the other hand, the flash point is not a trivial descriptor.It might be a weak relationship between ''the oxygen radical reaction for burning'' and ''the Li cation-solvent interaction'', though the number of the samples should be increased for such a discussion.
In materials informatics, proper combinations of descriptors change depending on the purpose of data analysis.In this paper, our goal is both to accurately predict the coordination energy and to reduce the calculation cost.Using the weight diagram (Fig. 3), we realize our purpose.As shown in Fig. 3, the 11th accurate combination does not include the descriptor of R(Li-O).To obtain the distance between Li and oxygen, additional Li-solvent complex calculations are required, though the other descriptors, density, flash point, and Mulliken charge, are obtained by catalogue data and only solvent calculations.The difference in the first and 11th CV errors is quite small, 0.126 kcal mol À1 .The value is not a significantly big difference for comparing the coordination energies of various solvents.According to Table 1, the 10 À1 kcal mol À1 order is the target accuracy for coordination energies.Then, if we choose the 11th best combination of descriptors (''Flash point'' and ''Mulliken charge''), we can reduce the calculation cost to a half because the extra calculation for obtaining R(Li-O) is omitted.This indicates that we can choose the balance of the ''prediction accuracy'' and the ''calculation cost for obtaining the descriptors'' for the combinatorial material search when we employ the ES-LiR method and calculate the histogram and weight diagram.

Melting point
Fig. 4 shows the correlation between the melting point from the catalogue data and the estimated data by MLR, LASSO, and ES-LiR.The CV errors of them were obtained to be 30.06,29.75, and 28.49 1C, respectively (Table 4).Although the CV error is still large in ES-LiR, the error of ES-LiR is smaller than the LASSO and MLR results.From the extraction of the descriptors by LASSO, density is one of the significant descriptors for the melting point.It matches the chemical intuition because the  density is highly related to the interaction between the solvent molecules in the liquid state, and the melting point is also highly affected by the interaction between the solvent molecules.Since LASSO is an approximation method, even if the choice of the descriptors matches the scientific background, it may be just a coincidence.There is a possibility that the completely different set of descriptors can reproduce a more accurate estimation.In contrast, the ES-LiR method can propose a reliable set of descriptors from the best to worst estimations.Fig. 5 shows the histogram of the whole combination patterns of descriptors obtained by ES-LiR.Fig. 6 confirms that from at least the top 25 combinations, density is one of the most important descriptors and flash point, molecular weight and Mulliken charge have also big contributions for the melting point prediction.

Statistical significance of proposed methods about the CV error
Let us consider the statistical significance of the difference in the CV errors of MLR, LASSO, and ES-LiR.For the evaluation of the CV errors, we calculated the CV error for each data set in ES-LiR, just like the condition of LASSO.As a result of applying it to the coordination energy prediction, the CV errors of MLR, LASSO, and ES-LiR are respectively 10.20, 9.18, and 6.34.We conducted a paired sample t-test to the data of 10-fold CV errors of ''MLR and ES-LiR'' and ''LASSO and ES-LiR'', and the p value was less than 0.001, which was a significant result.

Conclusions
In order to explore new LIB electrolyte materials, we investigated the estimation procedure by data-driven information techniques.We predicted the coordination energies and melting points of solvents by information techniques such as MLR, LASSO, and ES-LiR.ES-LiR reproduced the most accurate estimation of the properties among them.We found that ES-LiR chose the balance of ''prediction accuracy'' and the ''calculation cost to obtain the descriptors'' when the combinatorial material search by virtual screening was carried out.This feature is general for all the material exploring studies with virtual screening.This treatment can be a key technique to future material searches.

Fig. 1
Fig. 1 Coordination energies of 103 solvent molecules with true values (calculated by the first-principles method) and estimated values (calculated by data-driven techniques) of MLR, LASSO, and ES-LiR (the least error combination of the descriptors).

Fig. 2 Fig. 3
Fig. 2 Histogram of the CV errors of descriptor combinations obtained by the ES-LiR method for the coordination energy prediction.The smallest CV error values of ES-LiR and the CV errors of LASSO and MLR are also shown.

Fig. 4
Fig. 4 Melting points of 103 solvent molecules with true values (calculated by first-principles method) and the estimated values (calculated by data-driven technique) of MLR, LASSO, and ES-LiR which is the least error combination.

Fig. 5
Fig. 5 Histogram of the CV error of descriptor combinations obtained by the ES-LiR method for the melting point prediction.The smallest CV error values of ES-LiR and the CV errors of LASSO and MLR are also shown.

Fig. 6
Fig. 6 Weight diagram of descriptors based on the accurate top 25 combinations of descriptors for the melting point prediction.

Table 1
Calculated values of the coordination energy (E coord ), the HOMO energy, the LUMO energy, the dipole moment, the Mulliken charge of the oxygen (nitrogen) atom, and the distance between the Li-ion and the oxygen (nitrogen) atom (R(Li-O)) of 25 solvent molecules for the database This journal is © the Owner Societies 2018

Table 2
Estimated and first-principles calculation values of the coordination energies of solvents (kcal mol À1 )

Table 3
Cross-validation errors of the coordination energies and the extracted combination of descriptors of MLR, LASSO, and ES-LiR Data-driven technique Combination of descriptors CV error (kcal mol À1 ) MLR x 1 -x 10 10.2 LASSO x 4 , x 8 , x 9 , x 10 9.18 ES-LiR x 4 , x 9 , x 10 8.78

Table 4
Cross-validation errors of the melting points and the extracted combination of descriptors of MLR, LASSO, and ES-LiR Data-driven technique Combination of descriptors CV error (C) MLR x 1 À x 10 30.06 LASSO x 2 À x 10 29.75 ES-LiR x 2 , x 3 , x 4 , x 5 , x 8 , x 9 28.49