Open Access Article
Zeno Romero
,
Kerstin Münnemann,
Hans Hasse and
Fabian Jirasek
*
Laboratory of Engineering Thermodynamics, RPTU Kaiserslautern-Landau, Erwin-Schrödinger-Str. 44, 67663 Kaiserslautern, Germany. E-mail: fabian.jirasek@rptu.de
First published on 26th June 2026
Predicting diffusion coefficients in mixtures is crucial for many applications, as experimental data remain scarce, and machine learning (ML) offers promising alternatives to established semi-empirical models. Among ML models, matrix completion methods (MCMs) have proven effective in predicting thermophysical properties, including diffusion coefficients in binary mixtures. However, MCMs are restricted to single-temperature predictions, and their accuracy depends strongly on the availability of high-quality experimental data for each temperature of interest. In this work, we address this challenge by presenting a hybrid tensor completion method (TCM) for predicting temperature-dependent diffusion coefficients at infinite dilution in binary mixtures. The TCM employs a Tucker decomposition and is jointly trained on experimental data for diffusion coefficients at infinite dilution in binary systems at 298 K, 313 K, and 333 K. Predictions from the semi-empirical SEGWE model serve as prior knowledge within a Bayesian training framework. The TCM then extrapolates linearly to any temperature between 268 K and 378 K, achieving markedly improved prediction accuracy compared to established models across all studied temperatures. To further enhance predictive performance, the experimental database was expanded using active learning (AL) strategies for targeted acquisition of new diffusion data by pulsed-field gradient (PFG) NMR measurements. Diffusion coefficients at infinite dilution in 19 solute + solvent systems were measured at 298 K, 313 K, and 333 K. Incorporating these results yields a substantial improvement in the TCM's predictive accuracy. These findings highlight the potential of combining data-efficient ML methods with adaptive experimentation to advance predictive modeling of transport properties.
There are two distinct classes of diffusion coefficients: mutual diffusion coefficients, which describe the collective motion of molecules driven by chemical-potential gradients, and self-diffusion coefficients, which characterize the Brownian motion of individual molecules.12 Liquid-phase mutual diffusion is commonly modeled using either the Maxwell–Stefan or Fickian framework. Established measurement methods include diaphragm cells,13 Taylor-dispersion experiments,14 dynamic light scattering,15 and concentration-profile monitoring in quiescent fluids.16
Pulsed-field gradient NMR allows a calibration-free and accurate self-diffusion measurement in both pure liquids and liquid mixtures.7,17–19 It uses a short magnetic-field-gradient pulse to label nuclear spins with position-dependent phases, followed by a second gradient pulse, applied after a defined delay, which rephases the spins. Molecular diffusion during this delay leads to incomplete rephasing, resulting in a decrease in signal intensity that scales with the diffusion coefficient; specifically, the faster the diffusion, the greater the signal reduction.
The diffusion coefficient D∞ij of a solute i at infinite dilution in a solvent j is of particular interest for several reasons: at this limit, self- and mutual diffusion coefficients coincide, and the Maxwell–Stefan and Fickian descriptions become identical. Moreover, if both infinite-dilution coefficients in a binary mixture are known (i in j and j in i), an extrapolation to finite concentrations is possible, e.g., via the empirical Vignes correlation,20 with possible extension to multi-component systems.12 The semi-empirical Stokes–Einstein–Gierer–Wirtz estimation (SEGWE) model2 is currently the most accurate semi-empirical model for the prediction of D∞ij.
Recently, we have introduced matrix completion methods (MCMs) from ML, which are well established in recommender systems,21 for predicting D∞ij at 298 K.9 The key idea is to represent experimental data measured for different binary mixtures as a matrix whose rows and columns correspond to components i and j, with each entry containing the available data for mixture i + j.22,23 Because this matrix is sparsely populated with experimental data, predicting the properties of unstudied mixtures reduces to a matrix completion problem. MCMs have since been developed for various thermodynamic properties, including activity coefficients,22,24–28 Henry's law constants,29,30 and diffusion coefficients,9,31 as well as for pair-interaction parameters in thermodynamic models.32–36
For predicting D∞ij, hybrid approaches that incorporate prior physical knowledge from the SEGWE model2 in the MCM training are especially promising, outperforming all available semi-empirical alternatives in prediction accuracy at 298 K.9 However, because MCMs require a matrix structure in their training data, they are restricted to predicting a single property of binary mixtures under fixed conditions; for D∞ij, this means single-temperature predictions. Industrial practice, however, demands knowledge across a wide temperature range, not only at 298 K, where data are even sparser.
There are two general ways for extending MCMs to higher dimensions, e.g., for predicting temperature-dependent thermodynamic properties. The first route is feasible if the temperature dependence of the property of interest is known. Then, the MCMs can be applied to predict the mixture-specific parameters of the equation describing the temperature dependence. This route was introduced by Damay et al.25 for predicting temperature-dependent activity coefficients at infinite dilution using the Gibbs–Helmholtz relation. The second route is to extend the MCM to a tensor completion method (TCM), whereby a three-dimensional tensor is spanned by the two components that make up the mixtures and the temperature. The TCM concept was transferred to thermodynamics by Damay et al.,37 who again considered the temperature-dependent prediction of activity coefficients at infinite dilution.
The TCM approach, unlike the first route, is also applicable when no general equation describing the temperature dependence is available. Liquid-phase diffusion coefficients are sometimes approximated as having a linear temperature dependence, consistent with the Stokes–Einstein theory38 if the temperature dependence of the solvent viscosity is neglected. However, this is not generally applicable, making TCMs an interesting option in this field.
The predictive capabilities of an ML model can also be enhanced by purposefully incorporating new data into the training set through active learning (AL) methods. Such AL strategies iteratively select the presumed most informative data points for experimental measurement, without prior knowledge of their values, and thereby aim to maximize model improvement with minimal experimental effort. To this end, a query strategy is employed within an AL framework.39 In a previous work, we found that uncertainty sampling, i.e., selecting the data point to be measured based on the current model's largest uncertainty, is an effective query strategy for improving the performance of an MCM in predicting D∞ij.31
In this work, we present a novel hybrid TCM for predicting D∞ij across temperatures. Our method employs a Tucker decomposition,40 analogous to that in the study by Damay et al.,37 and integrates SEGWE priors, following our earlier MCM approach.9 This TCM is trained on D∞ij data at 298 K, 313 K, and 333 K. While D∞ij is obviously of interest well beyond these three discrete temperatures, substantially less experimental information is available outside this range, preventing the development of MCMs for predicting D∞ij at these temperatures. However, the developed TCM can also predict temperatures absent from the training set, and we evaluate its predictions at temperatures between 268 K and 378 K, comparing them to experimental diffusion data and SEGWE2 predictions within this extended range. The dependence of liquid-phase diffusion coefficients on the pressure is generally small, especially at low to moderate pressures. Since all experimental D∞ij values used in this work were reported at (or near) atmospheric pressure in the original literature, we neglect the influence of the pressure and note that the developed model should not be used to predict diffusion coefficients at very high pressures.
Furthermore, we extend the available experimental data on D∞ij at 298 K, 313 K, and 333 K by measuring D∞ij using pulsed-field gradient NMR spectroscopy7,17,19 and selecting the measured systems using AL31,39 and uncertainty sampling. We systematically evaluate the influence of the new training data on the prediction accuracy.
The resulting database of experimental values of D∞ij used in this work consists of 224 data points at 298± 1 K, 75 data points at 313 ± 1 K, and 56 data points at 333 ± 1 K. It covers 45 different solutes i infinitely diluted in 31 different solvents j. The data can be arranged in temperature-specific matrices, where the rows represent the solutes and the columns represent the solvents, cf. Fig. 1. The solutes and solvents included in the matrix are listed in Tables S1 and S2 in the SI. The included compounds consist mostly of water and organic molecules that are liquid under ambient conditions and have low reactivity. The solutes additionally contain 5 substances that are gaseous under ambient conditions. Values of D∞ij range from 10−11 to 10−8 m2 s−1.
The dataset can also be represented as a third-order tensor, with the three dimensions being the solutes, solvents, and temperatures. In total, this tensor has 4185 elements, of which, however, only 8.5% are occupied by experimental data for D∞ij. As shown in Fig. 1, this tensor is not only sparsely but also heterogeneously occupied. Most data are available at 298 K, and there are some solvents and solutes for which much more data are available than for the others; there are even several solvents and solutes for which no data are available at 313 K and 333 K at all. Details of the data availability per temperature are provided in Table 1.
| T | 298 K | 313 K | 333 K |
|---|---|---|---|
| Number of data points | 224 | 75 | 56 |
| Matrix occupation rate | 16.1% | 5.4% | 4.0% |
| Number of available solvents | 31 | 24 | 18 |
| Number of available solutes | 45 | 35 | 33 |
To facilitate sample handling during the measurements planned by AL in this work, we further filtered the data from Fig. 1 to obtain a reduced dataset by excluding all gaseous compounds under ambient conditions. Due to this exclusion, we again had to filter data, so only solutes i and solvents j for which experimental data points for D∞ij(T) in at least two different mixtures i + j were available were included. We chose to exclude these substances beforehand to follow the query strategy as closely as possible, rather than intervening during the AL workflow by skipping selected systems. The temperature-specific matrix arrangement of this reduced database is shown in Fig. 2 and is the underlying database for the experimental AL workflow. The solutes and solvents included in this matrix are listed in Tables S3 and S4 in the SI.
While we focus on three temperatures here to compare TCM predictions with temperature-specific MCMs, a continuous-temperature approach, as explained in the following sections, enables generalization to arbitrary temperatures. To assess the performance of the TCM over the continuous temperature range, we use another dataset from the DDB 202541 spanning 268 K to 378 K (but excluding 298 K, 313 K, and 333 K) and containing 98 data points. No data for these temperatures were, however, used for training the TCM.
In addition to the experimental data, henceforth called D∞,expij, a synthetic database, D∞,SEGWEij, was used for pre-training both temperature-specific MCMs and the TCM. This synthetic database consists of predictions of D∞ij at 298 K, 313 K, and 333 K using the SEGWE model2 for the same solutes and solvents as in the experimental database. The solvent viscosities required for the SEGWE model2 were obtained from the DDB 2025,41 and the effective density (a parameter in the SEGWE model) was set to the recommended value ρeff = 627 kg m−3,2 as it was done in our previous works.9,31
All experimental diffusion coefficient data were reported at (or near) ambient pressure in their original publications, as are the SEGWE2 predictions. While pressure effects on liquid diffusion coefficients are generally much weaker than temperature effects, they may become significant at elevated pressures, which were not considered in this work.
ln D∞ij = ui·vj + εij
| (1) |
In the first training step, an MCM is trained on the complete synthetic ln
D∞,SEGWEij data matrix according to eqn (1) using uninformed normal prior distributions with μ0 = 0 and σ0 = 1 and a Cauchy likelihood with scale parameter λ = 0.2. The resulting preliminary features
and
, obtained by minimizing the residuals εij and described by the posterior probability distributions from this first training step, were scaled and then used as informed normal prior distributions for the second MCM trained on the sparse ln
D∞,expij matrix following eqn (1), again using a Cauchy likelihood with scale parameter λ = 0.2, minimizing the residuals εij now referring to the experimental data, and resulting in the final solute and solvent features, ui and vj. For the scaling, the mean of the posterior distributions of
and
was adopted, whereas their standard deviation was scaled with a constant factor to obtain an average value (averaged over all solutes i and solvents j) of
= 0.5. The resulting distributions were finally multiplied by the uninformed normal prior (μ0 = 0 and σ0 = 1) used in the first training step. This probabilistic hybrid approach allows prior physical information from the SEGWE model to be incorporated into the MCM, while maintaining the flexibility of the model to adapt to experimental data. More details on this hybrid approach can be found in our earlier work.9,24,31
Since we use a Bayesian approach for this second training step as well, we obtain posterior distributions over the model parameters after training, from which probability distributions for each predicted matrix entry can be calculated using eqn (1). The mean of these distributions was considered as the predicted diffusion coefficient ln
D∞,predij. Furthermore, from the obtained probability distributions, the standard deviation σij was calculated as a measure for model uncertainty.
The MCM approach was used to predict isothermal diffusion coefficients. Hence, an individual MCM was trained on data for a single temperature, i.e., a single matrix from Fig. 1, and generated predictions for the missing values at the same temperature, which was done here for 298 K, 313 K, and 333 K. It does not use information on the D∞ij at multiple temperatures and cannot extrapolate from one temperature to another. In the following, we will refer to the hybrid MCM approach simply as the MCM.
![]() | (2) |
The TCM approach is a priori discrete with respect to temperatures and was trained and evaluated simultaneously at the three temperatures 298 K, 313 K, and 333 K. For each temperature, it learns an independent set of latent temperature features wγ(T), which contain no prior information about temperature and do not incorporate any physically motivated scaling. However, these learned features wγ(T) were subsequently correlated with the temperature T to enable their prediction at any temperature, not just the discrete ones. The correlation of the temperature features and the application of the TCM for predictions across a broad temperature range are discussed in the Results section.
Tucker decomposition, which becomes equivalent to canonical polyadic decomposition if κ is the unit tensor, was chosen because of its flexibility by introducing κ. Analogous to the MCM, we propose the hybrid TCM approach, which additionally incorporates SEGWE2 predictions into its training. This approach consists of two steps, schematically shown in Fig. 3.
![]() | ||
| Fig. 3 Schematic representation of the hybrid TCM for predicting temperature-dependent D∞ij developed in this work. The TCM incorporates prior information from the SEGWE model2 and uses the Tucker decomposition for tensor factorization. | ||
Analogous to the MCM, the TCM is trained in two steps. First, the TCM is fitted to the fully completed synthetic tensor of ln
D∞,SEGWEij using uninformed normal prior distributions with μ0 = 0 and σ0 = 1 and a Cauchy likelihood with scale parameter λ = 0.2. The posterior distributions of the latent features u*, v*, w*, and κ* from this run then serve as priors for a second MCM on the sparse experimental tensor ln
D∞,expij using a Cauchy likelihood with scale parameter λ = 0.2. For each feature, we keep the posterior mean and rescale its standard deviation by a constant so that the overall average becomes
= 0.5 (averaged over all i, j, and T). These informed priors are multiplied by the default uninformative prior (μ0 = 0 and σ0 = 1). Following our earlier work,9,31 this scheme injects prior physical knowledge from the SEGWE2 model into the TCM while retaining flexibility to fit the experimental data.
ln
D∞,predij are calculated analogously to MCM, from the posterior distributions of the model parameters, according to eqn (2). The mean of the resulting distribution for each tensor entry is taken as ln
D∞,predij, whereas their standard deviation σij(T) serves as a measure for model uncertainty.
The use of κ generally allows different latent feature dimensions ru, rv, and rw, which are the hyperparameters of the model. We have carried out hyperparameter optimization in this work, using system-wise leave-one-out cross-validation, cf. below for details. The results of the hyperparameter study are given in Fig. S2 of the SI. We found that the best prediction accuracy was achieved using ru = rv = rw = 2.
Because the training data are restricted to measurements performed at (or near) ambient pressure, the present TCM (and MCM) should only be applied at low to moderate pressures. Extrapolation to elevated pressures would require additional training data and incorporation of pressure as an explicit model dimension, which could be the subject of future work.
In this work, the ML model to be improved is the TCM for predicting D∞ij in binary mixtures at 298 K, 313 K, and 333 K. For the AL, we thereby constrain the newly measured data and evaluations to the three discrete temperatures, i.e., the possible solute–solvent–temperature tuples (i, j, T), thereby limiting the experimental space, which would otherwise be infinitely large due to continuous temperature. Consequently, we used a pool-based sampling approach, where all solute–solvent pairs (i, j) for which no D∞,expij exist at any temperature T comprise the sampling pool
, which contains the solute–solvent pairs from which the query strategy may choose new mixtures to be measured.
Fig. 4 shows the general AL framework used in this work, which was adopted from our previous work.31
As illustrated in Fig. 4, the AL workflow is an iterative process. We begin with the initial training data set, i.e., the initially available experimental data for D∞ij, cf. Fig. 1. The TCM is trained on this data set and can then be used to generate a complete tensor of predicted diffusion coefficients D∞,predij. Based on the obtained predictions, a query strategy is used to select a solute–solvent pair
for which no experimental data are available at any temperature T ∈ Θ, where Θ = {298 K, 313 K, 333 K}. D∞,expij ∀ T ∈ Θ are then measured for this selected system by PFG NMR spectroscopy at all temperatures T, and the new data are subsequently added to the training data set. This procedure is repeated several times, increasing the training data size at each iteration and thus (hopefully) improving the prediction accuracy of the model. The key to this improvement lies in choosing a suitable query strategy.
In our previous study, we found that uncertainty sampling was the most suitable query strategy for improving the prediction of diffusion coefficients with MCMs,31 which is why we again use this strategy in this work for the TCM approach. For this purpose, we average the prediction uncertainty σij(T) resulting from our TCM over the three studied temperatures T. This results in a solute–solvent matrix of temperature-averaged uncertainties, from which the entry (i, j)* with the highest associated prediction uncertainty
ij was selected, cf. eqn (3), and the new diffusion coefficients D∞,expij are measured using PFG NMR spectroscopy.
![]() | (3) |
In practice, uncertainty sampling tends to sample outliers that are not representative of the underlying data distribution, which we also observed in our prior work.31 While sampling some outliers can improve the model's prediction accuracy, continuously sampling them yields little new information and leads to redundancy.61,62 Specifically, in the context of the Bayesian MCM (and TCM), after repeated sampling within a single row or column, i.e., repeated sampling of the same solute i or solvent j, the information gain by inclusion of another data point in the same row or column is small. At the same time, the posterior for all other compounds can remain wide.63 We thus introduce the simple rule of removing a compound from the sampling pool after it has been sampled in too many consecutive rounds. This approach encourages exploration of the chemical space and reduces redundancy in a simple way.
The predictive performance of the models was evaluated using leave-one-out analysis.67 This procedure is stricter than a random train-test split of individual data points, since the model cannot use information for the excluded solute–solvent pair at any temperature. This evaluation therefore assesses prediction of unseen binary systems rather than interpolation of isolated missing entries in the tensor. Each model was trained on a subset of D∞,expij, which includes all available experimental data except for one binary system to be predicted. In the case of the MCM, this means that the training data included all experimental data for one specific temperature T, except for one solute–solvent pair (i, j), which was then predicted at the same temperature. In the case of the TCM, the training data included all experimental data for all three temperatures T ∈ Θ, except for one solute–solvent pair (i, j), which was excluded at all temperatures and predicted at all temperatures. Thus, in all cases, the predictions were made on diffusion coefficients for truly unseen solute–solvent pairs (i, j).
To evaluate the prediction accuracy at each temperature T ∈ Θ, we computed the absolute relative error (AREij(T)) for each data point. These per-point errors were calculated using
![]() | (4) |
These errors were aggregated over the set
, which contains all (i, j) pairs where experimental data are available at temperature T and reported as box plots. The temperature-specific relative mean absolute error (rMAE(T), cf. eqn (5)), and the relative mean squared error (rMSE(T), cf. eqn (6)), are also reported:
![]() | (5) |
![]() | (6) |
Furthermore, to demonstrate the generalization of the TCM to continuous temperature values, we trained a TCM on all data for Θ = {298 K, 313 K, 333 K} and report the errors across the temperature range [268 K, 378 K] (excluding data at T ∈ Θ) using a box plot with aggregated 10 K temperature bins. In all cases, we compare the TCM results to SEGWE predictions and, for T ∈ Θ, also to isothermal MCMs, using the same error metrics.
The pulse sequence stebpgp1s,68 a stimulated echo sequence with bipolar gradients, was used as implemented in TopSpin 3.6.5 (Bruker). The Stejskal–Tanner equation was used to calculate the self-diffusion coefficients Di:69
![]() | (7) |
Here, I is the signal intensity, I0 is the intensity at the lowest gradient strength, γ is the gyromagnetic ratio, δ is the gradient duration, Δ is the diffusion time, τ is the correction for bipolar gradients, and g is the gradient strength. Di was obtained by fitting the equation to the measured I/I0 ratios using a least-squares approach with the Python package lmfit.70 Peak integrals were evaluated manually using MNova (Mestrelab). The experimental uncertainty σexpi was estimated from the root-mean-square error of the fit residuals, reported as a 95% confidence interval assuming a t-distribution. The uncertainty is indicated with the experimental results.
The pulse sequence parameters were Δ = 50 ms and τ = 0.2 ms. The gradient strengths g were varied from 0.023 to 0.431 T m−1 in eight increments with equal squared spacing. 32 scans were conducted at each increment. The gradient duration δ was adjusted (300–5000 µs) to ensure at least 80% signal attenuation from lowest to highest g.
Solutions with three different solute concentrations (0.005, 0.01, and 0.025 mol mol−1) were gravimetrically prepared for each measured solute–solvent system and measured at three temperatures (298 K, 313 K, and 333 K) and ambient pressure. When multiple peaks were present for the same compound, their respective measured Di values were averaged. The diffusion coefficients measured at the three concentrations were linearly extrapolated to infinite dilution of the solute to obtain D∞,expij(T). The overall uncertainty σ∞,expij(T) was calculated by combining propagated measurement errors and extrapolation uncertainty, reported as a 95% confidence interval assuming a t-distribution.
![]() | ||
| Fig. 5 Boxplot of the AREij of the D∞ij predicted using the SEGWE model,2 the MCM,9 and the developed TCM. MCM and TCM results were obtained using leave-one-out analysis, and the SEGWE model was used as proposed by the original authors.2 Boxes represent interquartile ranges (IQRs), and whiskers represent 1.5 IQR. | ||
Fig. 5 demonstrates that the MCMs (green) yield substantially lower prediction errors than SEGWE (red),2 confirming previous results at 298 K reported by our group.9 Notably, the MCM approach maintains superior performance over SEGWE, also at elevated temperatures (313 K and 333 K), despite the significantly lower availability of experimental data at these temperatures.
The TCM predictions (blue) exhibit lower error scores, including narrower interquartile ranges (IQRs) and 1.5 × IQR whiskers than both the SEGWE model and the MCM, indicating that the TCM predictions are more robust and have fewer outliers than the previously available methods. This robustness in predictive performance is further illustrated by the histograms of the relative prediction errors of D∞ij for each method and temperature shown in Fig. S4 in the SI.
The TCM developed in this work further improves the predictive accuracy over both the SEGWE model and the MCM across all three studied temperatures. This result is astonishing, as one could have expected a deterioration going from an individual fit for each temperature to a global fit over all temperatures. It is likely that the inclusion of additional training data across multiple temperatures allows the TCM to compensate for the more limited data sets at the higher temperatures, where substantially fewer measurements are available. This interpretation is supported by the larger performance gains observed at those temperatures. The results demonstrate that incorporating diffusion coefficient data across multiple temperatures into a model's training not only broadens the model's predictive scope but also enhances its accuracy at individual temperatures.
![]() | ||
| Fig. 6 Temperature features w1(T) and w2(T) of a TCM trained on the full discrete data set as a function of T and linear fits. | ||
Fig. 6 shows a linear dependence of the discrete w1 and w2 (circles) on the temperature T. With the goal of predicting D∞ij across a continuous T range, we thus model the temperature dependence of wγ using eqn (8):
| wγ(T) = Aγ + BγT | (8) |
The linear regression statistics obtained from fitting eqn (8) to the discrete wγ are detailed in Table 2, including the coefficient of determination (R2) and mean squared error (MSE) of the fit.
| γ | Aγ | Bγ | R2 | MSE |
|---|---|---|---|---|
| 1 | 1.195 | −6.179 × 10−3 | 0.9939 | 4.81 × 10−5 |
| 2 | 8.271 | −2.648 × 10−2 | 0.9997 | 3.18 × 10−5 |
Table 2 shows a very strong (R2 > 0.99) linear correlation between the wγ and T. Considering the temperature independence of the solute and solvent features u and v and the core tensor κ, this implies that within the considered temperature range ln
D∞ij is well-approximated as being linear in T. This result does not directly correlate with Stokes–Einstein theory38 or SEGWE,2 as they require the solvent viscosity, the temperature dependence of which is not easily described. Rather than implying a fundamentally linear temperature dependence, the more complex underlying dependence can be represented adequately by a linear approximation within the limited temperature range studied here.
It is worth noting that the TCM learned this correlation only from the experimental values of D∞ij at 298 K, 313 K, and 333 K. No information on the actual physical temperature was provided, as was also the case for solutes and solvents. It is thus most astonishing that the TCM was able to learn a correlation between its parameters and the temperature purely from experimental data. Using this correlation, we predicted D∞ij for the same solute–solvent matrix of Fig. 1, at temperatures between 268 K and 378 K, the prediction error of which is shown as a function of T in Fig. 7, using 10 K temperature bins.
![]() | ||
| Fig. 7 Boxplot of the AREij of the D∞ij predicted using SEGWE2 and the TCM as a function of T. The numbers above each box indicate the number of data points per bin, horizontal lines represent the median, boxes represent the IQR, and whiskers represent 1.5 IQR. | ||
The TCM maintains high performance across the temperature range of 268 K to 378 K, with a total rMAE of 0.118, while the SEGWE model has a total rMAE of 0.263 for the same data set. Additionally, the TCM substantially outperforms SEGWE,2 even at temperatures not present in the TCM's training set. It is most surprising that, despite the training data containing only a very small temperature range (35 K) and only three temperatures, the TCM can extrapolate easily to any unseen temperature in a much broader range (110 K), with barely any loss in accuracy. As expected with increasing temperatures, the prediction accuracy gradually worsens; thus, the linear scaling with T should be used only within the specified 268 K to 378 K range. To improve prediction accuracy at higher temperatures, alternative scaling and the inclusion of experimental data at higher temperatures could be used.
| No. | Solute i | Solvent j | D∞,expij/10−9 m2 s−1 | ||
|---|---|---|---|---|---|
| 298 K | 313 K | 333 K | |||
| 1 | Methyl isopropyl ketone | 1,2-Propanediol | 0.052 ± 0.002 | 0.115 ± 0.009 | 0.248 ± 0.044 |
| 2 | Butyl acetate | 1,2-Propanediol | 0.059 ± 0.002 | 0.122 ± 0.003 | 0.267 ± 0.006 |
| 3 | Benzaldehyde | 1,2-Propanediol | 0.059 ± 0.009 | 0.127 ± 0.001 | 0.296 ± 0.022 |
| 4 | Dimethoxymethane | Acetone | 4.064 ± 0.025 | 4.894 ± 0.041 | 6.161 ± 0.102 |
| 5 | Water | Dimethoxymethane | 5.737 ± 0.132 | 7.085 ± 0.131 | 9.270 ± 0.177 |
| 6 | 2-Methyl-2,4-pentanediol | Acetone | 5.019 ± 0.012 | 6.323 ± 0.024 | 8.356 ± 0.030 |
| 7 | m-Cresol | Acetonitrile | 2.675 ± 0.016 | 3.368 ± 0.072 | 4.357 ± 0.042 |
| 8 | Water | 2,4,6-Trioxaheptane | 2.559 ± 0.087 | 3.249 ± 0.126 | 3.965 ± 0.049 |
| 9 | Hexafluorobenzene | 1-Butanol | 0.896 ± 0.005 | 1.241 ± 0.008 | 1.828 ± 0.009 |
| 10 | Chlorobenzene | Methyl isopropyl ketone | 2.575 ± 0.005 | 3.140 ± 0.018 | 4.381 ± 0.200 |
| 11 | Benzene | Butyl chloride | 3.292 ± 0.021 | 3.957 ± 0.035 | 4.994 ± 0.058 |
| 12 | Methyl isopropyl ketone | Acetone | 3.736 ± 0.008 | 4.448 ± 0.020 | 5.495 ± 0.039 |
| 13 | Glycerol | Acetonitrile | 2.668 ± 0.039 | 3.219 ± 0.081 | 4.099 ± 0.042 |
| 14 | Water | 2,4,6,8,10-Pentaoxaundecane | 0.742 ± 0.028 | 0.937 ± 0.046 | 1.276 ± 0.054 |
| 15 | Water | 2,4,6,8-Tetraoxanonane | 1.327 ± 0.005 | 1.730 ± 0.034 | 2.351 ± 0.024 |
| 16 | 2,4,6-Trioxaheptane | Dimethoxymethane | 3.464 ± 0.041 | 4.186 ± 0.026 | 4.869 ± 0.026 |
| 17 | Butyric acid | Dimethoxymethane | 2.381 ± 0.109 | 3.262 ± 0.015 | 4.175 ± 0.026 |
| 18 | Di-tert-butyl sulfide | Dimethoxymethane | 2.757 ± 0.008 | 3.347 ± 0.010 | 4.306 ± 0.025 |
| 19 | Phenol | Dimethoxymethane | 2.768 ± 0.012 | 3.418 ± 0.018 | 4.470 ± 0.045 |
As expected, D∞,expij increases with increasing temperature for all systems studied. The experimental uncertainty shows significant variation. In some cases, the uncertainty is as high as 10% (partly even higher), mainly caused by the decreasing sensitivity of the NMR experiment at high temperatures in combination with low solute concentrations.
In the experimental implementation of the AL approach, we observed that some solvents were proposed particularly often by uncertainty sampling. In this study, as shown in Table 3, 1,2-propanediol was initially suggested most often, which we thus excluded from the sampling pool after the third measurement. The same effect was observed for dimethoxymethane towards the end of our measurements.
Using the entire set of new data for 19 previously unstudied mixtures for the training, the prediction rMSE decreased from 0.18 to 0.15 at 298 K, from 0.10 to 0.08 at 313 K, and from 0.07 to 0.06 at 333 K, while the occupation rate of the matrix increased only by 1.8%. The prediction rMAE and rMSE after each AL iteration are shown in the SI, Fig. S5. These results show that substantial improvements in the prediction of diffusion coefficients can be achieved with only a few additional measurements, consistent with our previous findings.31
In the SI, we report the final TCM parameters obtained after training on the complete data set, comprising all literature data (cf. Fig. 1) and the new data measured in this work (cf. Table 3); these parameters should be used when applying the model to predict diffusion coefficients.
Furthermore, the available data on D∞ij were extended through measurements using PFG NMR spectroscopy, in which the systems were selected using an active learning (AL) approach guided by the model's uncertainty. In total, 19 systems for which no prior data were available were measured at 298 K, 313 K, and 333 K. Even though this only increases the tensor's occupation rate by 1.8%, considerable improvements in prediction quality were observed. However, further improvements could be achieved by developing tailored query strategies in future work.
Additional experimental data for D∞ij found during our comprehensive literature study are available from their original sources.19,31,41–60
In the supplementary information (SI) of this work, we report the new diffusion data measured in this work, the complete Stan code used in processing the data sets in this work, as well as a set of parameters obtained using the TCM after training on all literature data (cf. Fig. 1) and the new data measured in this work. See DOI: https://doi.org/10.1039/d6cp00732e.
| This journal is © the Owner Societies 2026 |