Making thermodynamic models of mixtures predictive by machine learning: matrix completion of pair interactions

Predictive models of thermodynamic properties of mixtures are paramount in chemical engineering and chemistry. Classical thermodynamic models are successful in generalizing over (continuous) conditions like temperature and concentration. On the other hand, matrix completion methods (MCMs) from machine learning successfully generalize over (discrete) binary systems; these MCMs can make predictions without any data for a given binary system by implicitly learning commonalities across systems. In the present work, we combine the strengths from both worlds in a hybrid approach. The underlying idea is to predict the pair-interaction energies, as they are used in basically all physical models of liquid mixtures, by an MCM. As an example, we embed an MCM into UNIQUAC, a widely-used physical model for the Gibbs excess energy. We train the resulting hybrid model in a Bayesian machine-learning framework on experimental data for activity coefficients in binary systems of 1146 components from the Dortmund Data Bank. We thereby obtain, for the first time, a complete set of UNIQUAC parameters for all binary systems of these components, which allows us to predict, in principle, activity coefficients at arbitrary temperature and composition for any combination of these components, not only for binary but also for multicomponent systems. The hybrid model even outperforms the best available physical model for predicting activity coefficients, the modified UNIFAC (Dortmund) model.

The combinatorial part ln γ C i is calculated by: where r i and q i are the relative Van der Waals volume and surface area of component i, respectively, x i is the mole fraction of i in the mixture, and z is the coordination number, which is usually set to 10; z = 10 was also used throughout this work. Hence, ln γ C i depends, besides on the composition, only on pure-component parameters of i and the ones of the remaining components j = 1..J that make up the mixture that is considered.
The residual part ln γ R i is calculated by: where ∆U ij and ∆U ji are pair-interaction parameters describing the pairwise interaction between the two components i-j in the mixture, R is the universal gas constant, and T is the thermodynamic temperature in Kelvin. Hence, in contrast to the combinatorial part ln γ C i , the residual part ln γ R i depends on both pure-component parameters (q i ) and pair-interaction parameters between all combinations of components in the mixture (∆U ij , ∆U ji ).
The pure-component parameters (r i , q i ) are reported for a large number of relevant components, and can also be estimated with the approach introduced in combination with the group-contribution method UNIFAC. 3,4 The pair-interaction parameters (∆U ij , ∆U ji ), on the other hand, which are required for all binary combinations i-j of the components that make up the considered mixture, are usually fitted to experimental data on the respective binary subsystem i-j, oftentimes to data on the vapor-liquid equilibrium (VLE). Since experimental mixture data are, however, in general scarce, ∆U ij and ∆U ji have so far only been reported for a small fraction of the practically relevant systems.
In basically all cases when UNIQUAC is used in practice, ∆U ij and ∆U ji are individually fitted to data for the respective binary system i-j. Following the derivation of UNIUQAC 1,2 from the lattice theory, these parameters are, however, not independent across binary systems, but are linked via pair-interaction energies U : where U ii and U jj are like pair-interaction energies of the components i and j, respectively, i.e., they describe the pairwise interaction of molecules of the same type (i-i or j-j), whereas U ij is an unlike pair-interaction energy that describes the pairwise interaction of different molecules (i-j here). U ij is, in contrast to ∆U ij , by definition symmetric, i.e., in general it holds that By fitting ∆U ij and ∆U ji for multiple systems individually to data for each binary systems i-j, the resulting parameters will almost certainly not comply with the physical constraint given in Eq. (S.7). As an example, let us consider three different components, which we simply number consecutively (1,2,3), and all binary combinations of them (1-2, 1-3, 2-3). According to Eq. (S.7) and using Eq. (S.8), the resulting six pair-interaction parameters of UNIQUAC are correlated via:

Data and Data Preprocessing
Data set All data points used in this work were taken from the Dortmund Data Bank (DDB) 2020. 5 For training the model, only data for binary mixtures were used. For evaluating the performance, also mainly data for binary mixtures were used, but, in addition, also some data sets for ternary mixtures were considered as described below. In general, only data for mixtures of molecular components were used, while data for mixtures containing salts, ionic liquids, pure metals, or unspecified components were excluded. Also excluded were all data sets that were labeled as questionable or of poor quality in the DDB. Further, only components i were considered for which the relative van der Waals volume and surface area, r i and q i , respectively, were either reported in the DDB or could be calculated with the approach described in connection with original UNIFAC 3 in its present parameteriziation.
Two types of data for binary mixtures reported in the DDB were used: first, data on activity coefficients at infinite dilution in binary mixtures γ ∞ ij and second, data on the vaporliquid equilibrium (VLE) of binary mixtures up to a pressure of 10 bar. From the VLE data, activity coefficients γ ij (at finite concentration) were calculated using extended Raoult's law assuming an ideal gas phase and neglecting the pressure dependence of the chemical potential in the liquid phase: where p s i is the vapor pressure of the pure component i at the considered temperature T , x i and y i are the mole fraction of i in the liquid phase and in the vapor phase, respectively, γ ij is the activity coefficient of i in the liquid mixture (with j), and p is the pressure. p s i was calculated with the Antoine equation using Antoine parameters reported in the DDB, if available. A VLE data point from the DDB was therefore only used if information on T , x i , y i , and p were reported and if Antoine parameters for the considered component i and in the relevant temperature range were available in the DDB. Also VLE data for mixtures with formic acid and acetic acid were excluded from the data set as these components can usually not be modeled with an ideal vapor phase even at low to moderate pressures. Eq. S.14 and the named pure-component parameters were also used for the prediction of VLE phase diagrams from the predicted γ ij after training the model.
The γ ij calculated from the VLE data according to Eq. S.14 were, together with the data for γ ∞ ij adopted from the DDB, further processed as described in the following. If multiple data points for the same system i-j, the same concentration x i , and same temperature T were available, the median of the activity coefficients was calculated and used. Further, only the data for those systems i-j were used for which at least two distinct data points, i.e., at different temperatures and/or concentrations, were available. The latter constraint is not necessary for the hybrid model MCM-UNIQUAC introduced here, but it is required for the classical system-wise fitting of UNIQUAC, which was used as benchmark here, cf. Figure 2 in the manuscript.
Finally, each component for which data in only a single binary system were available, was rejected, resulting in a data set in which: • for each component i, j data for at least two different binary systems i-j, and • for these systems, at least two data points (at different temperature and/or concentration) were available.
The resulting data set comprises 363181 experimental data points for 12199 binary systems i-j including 1146 components i, j and covers a temperature range from 183 K to 638 K.
This data set was split into a training set containing all data points for 80% (9759) of the binary systems i-j, a validation set containing all data points for 10% (1220) of the binary systems, and a test set containing all data points for the remaining 10% (1220) of the binary systems. This data split was done randomly, but ensuring that for each component i, j, data on at least one binary system i-j were available in the training set. This is necessary as the

Model Details Bayesian Matrix Completion
As in our previous works, 6,7 we use a Bayesian approach to matrix completion for the prediction of logarithmic activity coefficients ln γ ij in unmeasured binary mixtures i-j in the present work. The approach consists of three steps. In the first step, a generative probabilistic model for the variable of interest, i.e., ln γ ij here (the logarithm was used for scaling purposes), is specified as a nonlinear function of the components i, j, the temperature T , and the composition of the mixture given by the mole fraction of component i x i , as well as relative van der Waals volumes q i and q j and surface areas r i and r j of the components i and j (taken from the DDB 5 if available or calculated with the approach introduced in combination with original UNIFAC 3,4 otherwise, cf. above), initially unknown (latent) fea- For inferring the model parameters, the probabilistic model defines a probability distribution over all ln γ ij from the training set by specifying a stochastic process that generates hypothethical data on ln γ ij conditioned on the initially unknown (latent) parameters of the components (θ i , θ j , β i , β j , U ii , U jj ), known descriptors of the pure components (q i , q j , r i , r j ), and specified conditions (T, x i ), which are subsequently "compared" to the true experimental data (at the same conditions). Therefore, the generative process first draws latent parameters from a normal prior distribution with zero mean and a standard deviation of one; mean and standard deviation of this prior distribution are also hyperparameters of the model and were set based on the scores on the validation set. The generative process then models the probability of the experimental activity coefficients ln γ exp ij from the training set as a Cauchy likelihood distribution with scale λ centered around the results of the UNIQUAC equations with the latent parameters drawn from the prior, the given known descriptors of the components, and the specified conditions; the scale λ, which was set to λ = 0.2, as well as the choice of a Cauchy distribution are also hyperparameters of the model and were selected based on the scores on the validation set. The likelihood can be written as follows: where f represents Eqs. S.1-S.7, U ij is calculated according to Eq. 4 in the manuscript, and the random variable ij captures both inaccuracies of the experimental data and the model.
In the second step, the latent parameters inferred for all considered components during the training of the specified generative model on all experimental data for ln γ ij from the training set by inverting the generative model.
For this purpose, we resorted to Gaussian mean-field Variational Inference 8-10 as this has shown good results in our previous works. 6,7 Since the generative model is probabilistic, all inferred latent parameters are random variables, and for each latent parameter, a probability distribution, called posterior, is obtained.
In the third step, we use the means of the approximate posterior distributions over θ i , θ j , β i , and β j to predict the unlike pair-interaction energies U ij according to Eq. 4 in the manuscript, which we then use together with the means of the approximate posterior distributions over U ii and U jj to predict the pair-interaction parameters ∆U ij and ∆U ji of UNIQUAC according to Eq. S.7, which we, finally, use to predict temperature-and concentration-dependent ln γ ij with Eqs. S.1-S.6 by also using the known geometric pure- For all data points in the test set, the predictions for ln γ ij are compared to the respective experimental values to evaluate the predictive performance of the model. For performing the above described steps, we use the Stan framework 11 that allows specifying user-defined generative models and automates the task of Bayesian inference. 10   ; Delta_u12 = (u12 -u22) * Factor; Delta_u21 = (u12 -u11) * Factor; V1 = r1 / (conc1 * r1 + conc2 * r2); F1 = q1 / (conc1 * q1 + conc2 * q2); tau12 = exp(-Delta_u12 / (T * 8.314)); / / universal gas constant R = 8.314 J/(mol*K) tau21 = exp(-Delta_u21 / (T * 8.314)); Data[r, 3] ~ cauchy( 1 -V1 + log(V1) -5*q1*(1-V1/F1+log(V1/F1)) + q1*(1-log((q1*conc1+q2*conc2*tau21)/(q1*conc1+q2*conc2))... -(q2*conc2*tau12)/(q2*conc2+q1*conc1*tau12) -(q1*conc1)/(q1*conc1+q2*conc2*tau21)), lambda ); } } The performance of MCM-UNIQUAC for the prediction of ln γ ij for those systems is slightly worse than that for the systems that can also be modeled with UNIFAC, cf.  Information on the 20 systems from the test set with the largest MAE is additionally summarized in Table S.1. Most of these systems include water, which is the most frequently studied component in our data set, and a second component that is very rarely studied and therefore rather poorly represented in the training set. We can therefore attribute the poor performance of MCM-UNIQUAC for some systems to two reasons: first, there are rather few training data points for at least one of the components, and second, water is one of the components (note that even for the system hexane + water, where for both components many training data were available, relatively large deviations are observed, cf. labeled point in Figure S.6). While the first reason is quite obvious for a model that relies only on the available mixture data, from which it infers, during the training, the characteristics (the features) of the constituent pure components, we can explain the second reason by the 'extreme' nature of water. Water can lead to both extremely small and extremely large activity coefficients, depending on the type of component it is mixed with. To fully capture this behavior, a complex and flexible model would be required, which could, however, easily be overfitting for less studied and less 'extreme' components. Our goal here was to develop a comprehensive model with the greatest possible scope, which apparently comes at the cost of some deficiencies for aqueous systems. It will be interesting to develop more specified models in future work, e.g., concentrating on highly polar components, and compare the results to those of the wide-ranging model we provide here.

MCM-UNIQUAC based on Pair-Interaction Parameters ∆U
In Figure S

Complete UNIQUAC Parameter Set
In a separate file, we provide a complete set of the pair-interaction energies U (including uncertainties in the form of standard deviations) of UNIQUAC for all combinations of the considered 1146 components as predicted by MCM-UNIQUAC after training on all experimental data points, i.e., without withholding data points for validation and testing (no results for this scenario are shown above or in the manuscript). The hyperparameters were thereby set based on the model selection procedure to the values described above. From these pair-interaction energies, the commonly used pair-interaction parameters ∆U of UNI-QUAC can easily be calculated following Eq. (3) in the manuscript. This parameter set can be used for the prediction of activity coefficients in any binary and multicomponent system of the considered components, even if no experimental data points on this system (and the constituent binary subsystems) are available. However, we note that the predictions should be used with caution, in particular if components or systems are considered that have only been sparsely examined in experiments.

Studied Components
In the above mentioned file, we also provide an overview over all studied components, including CAS numbers (if available) and chemical formulae.