Learning the laws of lithium-ion transport in electrolytes using symbolic regression †

High-throughput experiments (HTE) enable fast exploration of advanced battery electrolytes over vast compositional spaces. Among the multiple properties considered for optimal electrolyte performance, the conductivity is critical. An analytical expression for ionic transport in electrolytes, accurate for practical compositions and operating conditions, would accelerate the process of (i) co-optimizing conductivity alongside other desirable electrolyte properties, and (ii) learning fundamental physical laws from data, which is one of the paramount goals of scienti ﬁ c big-data analytics. Here, we used symbolic regression with an HTE-acquired dataset of electrolyte conductivity and discovered a simple, accurate, consistent and generalizable expression. Notably, despite emerging from a purely statistical approach, the expression re ﬂ ects functional aspects from established thermodynamic limiting laws, indicating our model is grounded on the fundamental physical mechanisms underpinning ionic transport. We demonstrate the potential of using machine learning with HTE to ﬁ nd accurate and physically-sound models in complex systems without established physico-chemical theories.


Introduction
Non-aqueous aprotic formulations are state-of-the-art electrolytes for Li-ion batteries (LIBs) as they comply with the strict operation requirements for safety, life, reliability and performance.These electrolytes consist of a Li salt dissolved in a mixture of organic solvents, and complemented with performance-enhancing functional additives.][9][10][11] In this multi-objective optimization scenario, researchers in the eld would greatly benet from a predictive, thermodynamic model for electrolyte conductivity, enabling quick exploration of how a promising formulation would affect the electrolytes ionic conductivity without additional experiments.Such a model would ideally be denoted as a simple and universal closed-form expression; i.e., an equation with few algebraic terms, relating easily measurable variables with fundamental physical constants, and without tting parameters.
Despite signicant progress in the thermodynamic description of ionic transport, 12 such a "utopic" model only exists for highly dilute electrolytes.At innite dilution, the conductivity is simply directly proportional to the ion concentration in solution c. 13,14 However, this model fails at the dilute domain (0 < c < 10 À3 mol L À1 ) since the conductivity depends additionally on a squared root term of the conducting salt concentration. 15ohlrausch formulated these ndings into an empirical law with an adjustable parameter, 15,16 later addressed by Onsager by considering that ions are dragged not only by hydrodynamic effects, but also by electrophoretic and relaxation phenomena as in the Debye-Hückel theory.The Debye-Hückel-Onsager (DHO) theory effectively upgrades Kohlrausch's law into a fully theoretical law, without adjustable parameters: where k 0 is the limiting conductivity, A 1-3 enclose multiple constants, and T, 3 and h represent the solution's temperature, permittivity and viscosity, respectively. 18Despite the success of DHO theory on strong electrolytes, it fails at describing the concentrated (c > 1 mol L À1 ) and weak electrolyte formulations used in Li-ion batteries.In its place, researchers formulate expressions following two main approaches.Semi-empirical approaches extend non-electrolyte thermodynamic theories by including long-ranged ion-ion interactions from DHO theory. 19,20Instead, phenomenological approaches assume the conductivity to depend on electrolyte formulation and temperature via an arbitrarily-chosen functional expression (e.g.2][23] While these models might t the data well, they are ill-posed to generalize and provide little physical insight, given the arbitrary choice of functional expression and all the parameters that need to be adjusted for every new system.Alternatively, a new paradigm of electrolyte engineering employs machine learning run alongside HTE, capable of handling optimization in high dimensional spaces. 24,25However, these methods are usually not transferable, hence optimizing for other formulations requires performing new experiments; in addition, little can be learned from a scientic standpoint due to the black-box nature of the underlying process. In this work we propose an alternative approach -Symbolic Regression (SR)to nd an explainable and accurate model describing the transport of ions in non-aqueous electrolytes.While machine learning is being increasingly applied to battery research, [26][27][28] SR had remained largely unexplored despite promising results in other areas of materials science. 29In essence, SR simultaneously learns both adjustable parameters and the functional form relating electrolyte conductivity with its formulation.We make use of a HTE setup 30 to collect thousands of conductivity measurements of LiPF 6 -based electrolytes with ethylene carbonate (EC), propylene carbonate (PC) and ethyl methyl carbonate (EMC) as solvents at different temperatures.With a simple SR approach, we train multiple candidate expressions and show that a particular expression emerges as a clear candidate, complying with strict and oen competing criteria of accuracy, simplicity and consistency.

Electrolyte formulation
Lithium hexauorophosphate (LiPF 6 ), ethylene carbonate (EC), ethyl methyl carbonate (EMC) and propylene carbonate (PC) were used as received from E-Lyte Innovations (battery grade purity) without further purication.As electrolyte solvents, EC, PC and EMC were used, keeping the ratio of cyclic to linear organic carbonates constant at (EC + PC) : EMC 3 : 7 by weight.The PC (PC À1 + EC) fraction was varied between 9 and 100 mol%.The concentration of the conducting salt LiPF 6 was varied between 0.19 and 2.11 mol kg À1 .The composition of the formulations was varied systematically using a fully automated robotic high-throughput screening (HTS) system, operating in a N 2 -lled glovebox (MBraun, H 2 O and O 2 < 1 ppm).

Conductivity determination
Conductivity cells (ESI Fig. 1 †) were lled with the prepared electrolyte formulations and sealed in the glovebox under N 2 atmosphere (MBraun, H 2 O and O 2 < 1 ppm).Cell constants were determined using a 0.01 M solution of KCl at 20 C (VWR, known conductivity of 1.276 mS cm À1 ) and averaged over ve measurements.Disposable Eppendorf Safe-Lock Tubes (V ¼ 2 mL) were used as sample containers and lled with 750 mL of electrolyte each.Impedance measurements were conducted on a Metrohm Autolab/M204 potentiostat/galvanostat with 12 channels and 8-channel multiplexer for a total of 96 channels in the range of 50 Hz to 20 000 Hz using in-house developed electrodes. 31The conductivity cells were placed in a temperature chamber (Memmert TTC256, 0.1 C temperature setting accuracy) and each temperature was held for 2 h prior to measurement for equilibration.Ionic conductivities were determined in 10 C steps in the temperature range from À30 C to 60 C. The impedance spectra were tted using a model specied with set parameters for resistors R s and R p , as well as for the constant phase element (CPE) with the Metrohm Nova soware.The t was carried out aer each additional measuring point.Electrolyte conductivities were obtained from the quotient of cell constant and determined electrolyte resistance.

Data pre-processing
The initial dataset was parsed into an array of 3626 measurements with 3 predictorsthe electrolyte's temperature [K], salt concentration [mol kg À1 ] and PC ratioand the corresponding ionic conductivity [mS cm À1 ].Repeated measurementsat the same PC ratio, temperature and conducting salt concentrationwere aggregated into mean conductivities and used as target values.The corresponding standard deviations were used as a proxy for measurement uncertainty.Neither imputation nor outlier processing was performed.The 859 data points resulting aer aggregation were split into 60% for training (515), 20% for validation (172) and 20% for testing (172).Data correlations and distributions are presented in ESI Fig. 2 and 3, † respectively.

Feature generation
For the feature generation step we use the algorithm implemented in the Autofeat library. 32Autofeat constructs an initial pool of thousands of candidate features in multiple feature engineering steps, where the initial predictors are combined and transformed using non-linear operators.The pool is then iteratively reduced during successive selection runs by (i) discarding features not complying with valid physical dimensions, (ii) selecting features that best correlate to the target and (iii) sparsifying coefficients via a Lasso LARS regression.The training data was scaled to unit variance without subtracting the mean, to avoid negative predictor values that cannot be discovered by some operations (e.g.log(x)).Additionally, we specied Kelvin units for temperature and mol kg À1 units for conducting salt concentration, in order to leverage Autofeat's Buckingham's Pi Theorem implementation to lter out terms with non-physical dimensions.From all the operators available in the AutoFeat Library, we did not consider abs(x) since we expect the conductivity to be differentiable; likewise we do not consider neither sin(x) nor cos(x) since we do not expect the conductivity to be periodic with respect to any of the predictors.ESI Table 1 † summarizes the hyperparameters used for the feature generation step with Autofeat.

Feature selection
Despite Autofeat carrying a rigorous feature selection, it oen yielded large candidate expressions; hence we further performed a feature selection step using the Lasso estimator as implemented in the Python package Scikit Learn. 33The Lasso estimator nds a L1 norm solution to the linear model, which not only minimizes the prediction error but also promotes model sparsity.In this step each candidate expression is regressed using Cross-Validated Lasso (ESI Table 3 †), choosing the regularization parameter a by the one-standard-deviation rule. 34The chosen a was used to retrain the expression using a simple Lasso estimator and so obtain a sparse solution.Candidate features were discarded when their coefficients b i were statistically insignicant, i.e. when their t-statistic were below 2. The Lasso regression with the chosen a, the thresholding and discarding were all iteratively repeated until arriving to an expression with no discarded terms, which was used as a discovered expression.An additional, physics-based constraint is implemented during the feature selection step: all models are trained constraining the intercept to 0. In this way, the discovered expressions comply with the expected physical behaviour of zero conductivity when all predictors are equal to zero.

Model architecture and training strategy
In our SR approach, we apply non-linear operations to the original predictors to produce more informative candidate features.Formally: where k is the electrolyte conductivity (i.e. the regression target), b k is the k-th regression coefficient and Q k the k-th operation on the predictors: temperature T, conducting salt concentration c and PC : EC molar ratio r.The conductivity is assumed to depend not on all possible candidate features, but on a muchreduced set of these; i.e., the solution of eqn (2) is sparse.Fig. 1 illustrates the methodology, split into feature generation and selection steps.Briey, the training process involves dening a set of operators (e.g.inverse, logarithms, exponentials), then applying these to the initial predictors to generate a library of candidate features, a few of which are then selected to form a candidate expression.The discovered expressions are not unique: candidate features might combine in multiple ways to result in similarly accurate expressions.Consequently, instead of using all training samples, we train on subsamples of 50, 100, 250 and 400 data points, each randomly initialized 5 times to give a total of 20 independent training sessions, in order to evaluate whether a discovered expression is consistent.We use the validation set to evaluate the performance of the discovered expressions and compare them to four benchmark models (ESI, Table 2 †) using the three initial predictors, 3rd-order polynomial expansions as in phenomenological models, 35 exponential operations as in Arrhenius-based models, and exponential operations on 3rd-order polynomial expansion as in the extended Castel-Amis model. 6

Evaluation of models
During the evaluation, we search for an expression being not only (i) accurate, i.e., yielding a low mean squared error (MSE), but also (ii) parsimonious, quantied as the number of terms in an expression, and (iii) consistent, represented by the number of times the expression repeats across training sessions.Fig. 2a presents the accuracy vs. complexity trade-off from the expressions found.Each data point represents an expression, whose colour references its parent operator set.As expected, larger expressions t the data better, however, at the expense of increased model complexity; this is the case of the expressions originating from exponential and logarithmic operations (MSE < 2 but 10+ terms).Interestingly, the expressions populating the Pareto-frontier of the gure originate from sets including square-root operations; i.e., they offer the best compromise between MSE and the number of terms.
Note that most expressions only appear once, highlighting these to be highly sensitive to the training subsample and that there is no unique solution.Fig. 2b shows the most frequent expressions across the training sessions, where expressions with square-root operations are highlighted in green.Unlike most expressions, the model: is by far the most frequent and was discovered 15 times out of 20 training sessions.While there are expressions with higher prediction accuracy, these not only have more terms but also repeat only once throughout the training sessions and thus are not consistent (ESI Table 4 †).We, therefore, select eqn (3) as it clearly stands out from the other competing models, for being not only consistent (discovered 15 times) but also parsimonious (four terms), comparatively accurate in the training set (MSE < 0.75), and generalizable, as evidenced by a good accuracy in the validation set.Table 1 summarizes the coefficients and performance metrics of the selected expression eqn (3).

Model constraints
Enforcing model constraints is an effective way to improve model consistency.To illustrate why, we repeat the 20 training sessions with the same operator this time allowing the intercept to vary freely.The corresponding training errors and stability histograms are shown in ESI Fig. 4 and 5. † Expectedly, removing the intercept constraint results in slightly improved accuracy but signicantly deteriorates model consistency, since no viable model repeats more than twice.Our choice of enforcing y 0 ¼ 0 has effectively ltered out expressions that became inaccurate under the constraint.We obtained as a result not only a smaller pool of more consistent candidate expressions, but also the guarantee they all comply with the imposed boundary condition y 0 ¼ 0. However, constrained models become less expressive, i.e. less capable of capturing the variability of the data.Fig. S6 † shows the learning curves of the discovered expression, retrained on subsamples of different sizes with and without the intercept constraint.The unconstrained expression converges to the optimal accuracy already with 100 samples; in contrast, the constrained model fails at almost all samples sizes and only approaches the optimal accuracy when using all 515 training samples.The enforcement of constraints needs to be balanced with the limitations in model expressiveness, especially when modelling the oensmall datasets available from experiments.

Selected model: accuracy and overtting
Fig. 3a compares the accuracy of the selected constrained expression on the validation set, relative to the measurement dispersion and along with benchmark models.We use the root mean squared error (rMSE) to describe the prediction accuracy in the same units [mS cm À1 ] as the conductivity measurements.
As expected, the simpler benchmarks such as Linear and Simple Arrhenius models are less accurate.Instead, the more complex models (Polynomial and Arrhenius Polynomial) are prone to overtting, as their prediction errors are smaller than a nonnegligible fraction of measurement dispersion values.Notably, the selected model stands in the middle with a validation-set rMSE of 1.1 mS cm À1 , indicating that it is accurate up to the measurement noise and so it does not overt the dataset.At rst glance, eqn (3) seems to yield only a minor improvement (0.3 mS cm À1 ) compared to the basic linear model; however, (i) the square-root dependence in eqn (3) reproduces the curvature and maxima in the data and (ii) by having no intercept, it complies with the physical constraint of no conductivity at c, T, r ¼ 0. Fig. 3b illustrates that the selected model generally ts well the data not used in the training (i.e., validation and testing sets).However, the t generally underestimates the measurements.The same expression trained with an intercept (Fig. S7 †) ts the withheld data without such bias, indicating that the underestimation in Fig. 3b is a result of imposing the physically-  0.90 motivated y 0 ¼ 0 constraint.However, we highlight that in most of the experimental range, the t from eqn (3) reproduces the concentration-and temperature-dependent conductivity maxima observed in the data and in previous studies, which is a key attribute for implementing our discovered model as part of multi-target optimization and/or active learning frameworks. 25lected model: deviations at low temperature and high salt concentration Fig. 3b also shows that the model is not expressive enough to describe the conductivities measured at À30 C and concentrations above 1 mol kg À1 .Further comparisons between the predicted and measured conductivities (ESI Fig. 8 †) show that the model predicts signicantly lower conductivities compared to the measurements (down to 2.5 mS cm À1 ) in concentrated and low temperature regions.Within this regime, the conductivity seems to decay exponentially with concentration (Fig. 3b) instead of following the square-root trends that our expression learned from the rest of the predictors' space.Expectedly, as temperatures drop and salt concentration increases, the electrolyte structure changes signicantly 36 and its viscosity grows exponentially, 37 which overall inuence the functional dependency of conductivity.Notably, the effect is especially pronounced in PC-pure solutions (ESI Fig. 8, † top), indicating that our discovered expression missed certain properties of the solvent mixture 38 that inuence the ionic transport within such viscous regimes.While in this work we assume that a single expression can describe the complete dataset, the observations of several regimes of conduction raises the potential need for either one expression per regime, or for an overarching, more sophisticated symbolic expression that collapses to the right functional behaviour in each regime.Increased accuracy within the viscous regime could be critical for specic applications such as low-temperature electrolyte engineering. 39lected model: interpretations Assigning a physical meaning to the discovered expression is not straightforward.For one, any comparison to the thermodynamically-derived DHO law would require to explicitly account for the solution's viscosity and dielectric constant, measurements that are not available in the dataset.Second, there are no constraints to avoid unphysical values, like the negative conductivities at sub-zero temperatures and high conducting salt concentrations (Fig. 3b).Third, the solution to our symbolic regression approach is generally not unique, i.e., there are multiple expressions similarly accurate to t the dataset.Despite these limitations, we observe that expressions sharing square-root operations achieve the best compromise between simplicity and accuracy.Therefore, we believe that our method is learning square-root trends inherent to the data manifold, which indicates that some functional aspects of the DHO lawi.e. its square-root trends on temperature and concentration (see eqn (1))are still valid to describe electrolyte conductivity in concentrated formulations.Physical insights can be drawn not only from the expression itself but also from its predictions.Fig. 4 illustrates the conductivity trends from our selected model within the space of electrolyte formulations used for the training.As expected, at higher temperatures, the conductivity increases and the conductivity maxima shi towards higher salt concentrations (0.74 mol kg À1 at À30 C to 1.70 mol kg À1 at 60 C).However, the role of the cyclic carbonate is subtler.Note rst that all conductivities peak when the electrolyte formulation is EC-pure (PC : EC ratio ¼ 0).Second, the tails along the salt concentration axis elongate at higher concentrations as the formulations become increasingly EC-pure.From a fundamental standpoint, conductivity depends on a compromise between the ionic mobility, mainly inuenced by viscosity, and the number of charge carriers available for migration, mainly controlled by the electrolyte's dielectric constant (c.f.see Bjerrums criterion 1 for ionic association). 5,6As EC has a higher dielectric constant compared to PC, 40 EC-pure solutions are more effective at preventing ion association and so enhance electrolyte conductivity.This effect should be especially pronounced at high conducting salt concentrations, where ionic association becomes a critical limiting factor for ion transport in the electrolyte. 5,41Such ECdriven improvement of conductivity, which has been observed experimentally, 42 is indeed predicted by our selected model as the extended tails along the salt concentration axis in Fig. 4, even if the effect is not evident neither in the correlation maps in ESI Fig. 2 † nor the pair-plots ESI Fig. 3. † While eqn (3) performs poorly within the viscous regime, it still manages to capture the subtle effects related to ionic association via the r 1/2 term.The predictions in Fig. 4 generally align with our current understanding of the interplay between the solvent's dielectric properties and ionic transport.
At this point, we emphasize we have only implemented two domain-knowledge decisions -(i) exclude non-differentiable and periodic operators and (ii) constrain the intercept to zero on an otherwise purely statistical approach.Yet, we observe the emergence of an expression clearly outstanding from competing models, for being accurate without overtting, parsimonious, consistent, with a square-root functional structure resembling the DHO law, and generally agreeing with our understanding of ionic transport.In other words, our expression is not only an appropriate model from a machine-learning standpoint but also seems grounded on the physical-chemical mechanisms underpinning ion transport in electrolytes.Our work opens multiple avenues to pursue further the data-driven discovery of accurate models capable of bridging the existing gap 20 in the understanding of concentrated electrolyte formulations.To start with, atomistic descriptors can be incorporated in order to generalize to solvent mixtures other than PC/EC/ EMC and conducting salt chemistries beyond conventional Liion technology. 22In addition, using other promising SR algorithms 43 and implementing domain-knowledge constraints in the feature selection step 44 could alleviate the issue with expression consistency and yield physically-sound expressions; i.e. even more rigorous to known boundary conditions (e.g.k(c ¼ 0) ¼ 0) and to asymptotic behavior on key limits (e.g.lim c/0 kfc).These constraints will have to be carefully balanced, given our observations of the inexible nature of constrained models.

Conclusions
In this work we apply symbolic regression as a data-driven method to learn the effects of temperature, conducting salt concentration and solvent composition on the conductivity of a concentrated electrolyte.We use a dataset of 859 experimental measurements on a LiPF 6 in EC, PC and EMC electrolyte at temperatures, conducting salt concentrations and EC-to-PC ratios within the practical ranges of operation of Li-based battery electrolytes.Our approach generates thousands of derived features from the initial predictors using a set of nonlinear operators.Few of the derived features are then selected using cross-validated Lasso regression to discover candidate expressions, which are then compared in terms of accuracy, parsimony, and consistency.We nd that expressions within the accuracy vs. parsimony Pareto-frontier share a square-root functional form, which we believe reects an underlying data manifold resembling the Debye-Hückel-Onsager equation.Out of these expressions, we singled out a 4-term expression for being not only parsimonious and accurate but also consistent.The discovered expression does not overt the data, ts the withheld set well, and reproduces the conductivity behaviour expected from similar theoretical and experimental studies.The discovered expression is a promising model to be used in multitarget electrolyte optimization.More broadly, the presented methodology can be used to nd analytical models of physicochemical systems where no fundamental, closed-form solution exists.Implementing phenomenological constraints in the feature selection step, while appropriately balancing model expressiveness, would signicantly support the search for physically-sound expressions using symbolic regression.

Fig. 1
Fig.1Representation of the symbolic regression method.The conductivity k is represented as a combination of non-linear operations applied on the original predictors: temperature T, conducting salt concentration c, and PC : EC molar ratio r.Based on multiple criteria, only few of the thousands of derived features are selected and used to build a 'discovered' expression.

Fig. 2
Fig. 2 (a) Accuracy vs. parsimony of discovered expressions throughout multiple training sessions.Each data point represents an expression, whose color indicates its parent operator set.(b) The five most consistent expressions found across 20 training sessions; the frequency of expressions with square-root operations are highlighted in green.All expressions were trained with the constrain k 0 ¼ 0.

Fig. 3
Fig. 3 (a) Root mean square error of selected model (green) and benchmarks (red) on the validation set, compared to measurement dispersion (grey).(b) Fit of the selected model on the withheld (validation and test) set at r ¼ 1.0.

Fig. 4
Fig. 4 Contour maps of electrolyte conductivity versus PC : EC molar ratio and conducting salt concentration, as predicted by our selected and trained model (eqn.(3)) at (a) low and (b) high operating temperatures.The insets show the maximum conductivity and where is reached.

Table 1
Coefficients of eqn (3) and associated performance metrics after training on the full training set of 515 samples