Open Access Article
Aliakbar Roosta
*ab,
Nima Rezaei
a and
Hamid Reza Godinib
aDepartment of Separation Science, School of Engineering Science, LUT University, Lappeenranta, Finland
bDepartment of Energy and Mechanical Engineering, School of Engineering, Aalto University, Espoo, Finland. E-mail: aliakbar.roosta@aalto.fi
First published on 8th June 2026
Reliable estimation of self-diffusion coefficient is fundamental for characterizing mass transport within fluids; however, an accurate prediction remains difficult due to the strong influence of thermodynamic conditions (temperature and pressure) and molecular characteristics (such as size, shape, and intermolecular forces). In this study, a hybrid predictive model is introduced, combining the PCP-SAFT equation of state with an artificial neural network (ANN) to estimate self-diffusion coefficients over a broad range of conditions. The model is developed using a dataset comprising 2263 experimental measurements for 67 compounds, spanning temperatures between 93.0 and 973.2 K, pressures up to 3036 bar, corresponding to self-diffusion coefficients spanning nearly five orders of magnitude from 10−12 to 10−7 m2 s−1. To regorously assess the predictive performance, the dataset was partitioned into 30% reserved for independent validation and 70% for training. The proposed model incorporates thermodynamic inputs, namely density and dimensionless form of residual entropy obtained from PCP-SAFT, together with molecular descriptors derived from COSMO-SAC sigma profiles. The selected ANN architecture, comprising two hidden layers with 14 and 7 neurons, respectively, provides high predictive performance, achieving R2 values of 0.9937 and 0.9763 and AARD values of 8.89% and 15.89% for the training and testing datasets, respectively. Overall, the proposed framework offers a unified, reliable model for predicting diffusion behavior under diverse thermodynamic conditions.
Experimental measurements of self-diffusion coefficients are often time-consuming, costly, and limited to specific compounds and conditions. As a result, there is a growing need for predictive models capable of estimating diffusion coefficients over wide thermodynamic ranges and for diverse chemical families. Conventional approaches such as theoretical7–9 and semi-empirical models10–12 have been developed to describe the self-diffusion behavior. While such methods can provide reasonable accuracy for specific systems, they are typically restricted to the compounds and the experimental conditions studied, limiting their applicability to new or uncharacterized compounds and their predictability outside the studied conditions.
To address these limitations, recent efforts have focused on data-driven approaches, particularly machine learning techniques such as artificial neural networks (ANNs), to establish relationships between molecular structure, thermodynamic properties, and transport behavior.13–16
As a potential source of molecular-level input for these models, COSMO-based methods have emerged as powerful tools for describing molecular interactions through sigma-profiles, which represent the surface charge density distribution of the molecules.17,18 These profiles can be transformed into compact molecular descriptors that capture key molecular features such as polarity, charge distribution, and hydrogen-bonding capability. When combined with machine learning models, these descriptors enable the development of predictive models that incorporate molecular-level information into the model.19–21
In this work, we propose a hybrid modeling approach that integrates COSMO-SAC-derived molecular descriptors with thermodynamic properties (density and dimensionless form of residual entropy) calculated from perturbed-chain polar statistical associating fluid theory (PCP-SAFT) equation of state within an ANN framework. The developed model is designed to provide accurate predictions of self-diffusion coefficients for different chemicals across a wide range of temperatures and pressures. A notable advantage of the proposed model is its broad applicability across extensive operating conditions, including elevated temperatures (up to 973 K) and pressures (up to 3036 bar), conditions that are seldom covered in the existing studies. This capability is particularly relevant for practical applications where fluids are exposed to extreme conditions, such as high-temperature and high-pressure processes, including supercritical separation.22,23
The remainder of this paper is structured as follows. Section 2 describes the dataset compilation, the extraction of molecular descriptors from COSMO-SAC sigma-profiles, and the implementation of the PCP-SAFT equation of state, along with the design and training of the ANN model. Section 3 presents a detailed evaluation of the model performance using statistical metrics, graphical analyses, and sensitivity analysis to identify the most influential input variables. Finally, Section 4 summarizes the main findings and outlines potential directions for future work.
| No. | Name | CAS no. | No. of data | T/K | P/bar | Ref. |
|---|---|---|---|---|---|---|
| Train data | ||||||
| 1 | Methane | 74-82-8 | 117 | 93–454 | 1–898.3 | 34–41 |
| 2 | Ethane | 74-84-0 | 54 | 136–454 | 43.6–978.5 | 34, 36, 37 and 39 |
| 3 | Propane | 74-98-6 | 21 | 112–453 | 14.7–500 | 34 and 39 |
| 4 | Butane | 106-97-8 | 9 | 150–451 | 50–500 | 34 |
| 5 | Pentane | 109-66-0 | 29 | 174–450 | 1–981 | 34 and 42–44 |
| 6 | Heptane | 142-82-5 | 47 | 186.1–360.6 | 1–981 | 34 and 42–44 |
| 7 | Octane | 111-65-9 | 36 | 248.14–383.7 | 1–998 | 34, 44 and 45 |
| 8 | Decane | 124-18-5 | 39 | 247.86–448 | 1–750 | 34 and 42–45 |
| 9 | Undecane | 1120-21-4 | 17 | 293–353 | 1–981 | 34 and 45 |
| 10 | Dodecane | 112-40-3 | 33 | 268.75–434.7 | 1–510 | 34, 45 and 46 |
| 11 | Tetradecane | 629-59-4 | 23 | 279.36–443 | 1–750 | 34, 45 and 46 |
| 12 | Pentadecane | 629-62-9 | 16 | 288.16–353 | 1–981 | 34 and 45 |
| 13 | Hexadecane | 544-76-3 | 31 | 292.68–472.5 | 1–996 | 34 and 45 |
| 14 | Heptadecane | 629-78-7 | 14 | 303–353 | 1–981 | 34 |
| 15 | Octadecane | 593-45-3 | 11 | 301.86–425.8 | 1 | 34 |
| 16 | Eicosane | 112-95-8 | 6 | 323.16–443.7 | 1 | 34 |
| 17 | Tetracosane | 646-31-1 | 10 | 322.16–423.7 | 1 | 34 |
| 18 | 2-Methylpentane | 107-83-5 | 5 | 200–308.2 | 1 | 34 |
| 19 | 3-Methylpentane | 96-14-0 | 8 | 200–313.2 | 1 | 34 and 47 |
| 20 | 2,3-Dimethylbutane | 79-29-8 | 11 | 175.48–453 | 1–500 | 34 |
| 21 | 2,2-Dimethylbutane | 75-83-2 | 18 | 262.37–450 | 1–600 | 34 and 47 |
| 22 | Cyclopentane | 287-92-3 | 12 | 273.16–328 | 1–750 | 34 |
| 23 | Cyclohexane | 110-82-7 | 84 | 281.7–393.2 | 1–900 | 34, 44–46, 48 and 49 |
| 24 | Cycloheptane | 291-64-5 | 7 | 288.16–348.8 | 1 | 34 |
| 25 | Ethanol | 64-17-5 | 54 | 173–437 | 1–931 | 34, 45 and 50–53 |
| 26 | 1-Propanol | 71-23-8 | 32 | 212–441 | 1–750 | 34, 45 and 53 |
| 27 | 1-Butanol | 71-36-3 | 11 | 268.16–353.2 | 1 | 34, 45, 54 and 55 |
| 28 | 2-Pentanol | 6032-29-7 | 46 | 237.1–483.1 | 50–500 | 34 and 56 |
| 29 | 3-Pentanol | 584-02-1 | 52 | 249.7–474.5 | 50–500 | 34 and 56 |
| 30 | 1-Pentanol | 71-41-0 | 22 | 213–428.6 | 1–500 | 34 and 56 |
| 31 | 1-Hexanol | 111-27-3 | 5 | 278.16–338.2 | 1 | 34 and 55 |
| 32 | 1-Octanol | 111-87-5 | 9 | 288.16–343.2 | 1 | 34 |
| 33 | Glycerol | 56-81-5 | 35 | 296.8574–435.1 | 1 | 57–60 |
| 34 | Benzene | 71-43-2 | 146 | 279.96–373.2 | 1–980.7 | 34, 44, 48–50 and 61–64 |
| 35 | Toluene | 108-88-3 | 61 | 175.4286–729.2 | 1–997 | 34 and 61 |
| 36 | o-Terphenyl | 84-15-1 | 16 | 328.16–438.2 | 1 | 65 |
| 37 | Acetone | 67-64-1 | 20 | 182.86–323.2 | 1 | 34 |
| 38 | Water | 7732-18-5 | 264 | 273–973.2 | 1–976 | 34, 66 and 67 |
| 39 | Tetrahydrofuran | 109-99-9 | 7 | 180.56–308.2 | 1 | 34 |
| 40 | Ethylene | 74-85-1 | 62 | 123.15–348.2 | 20.4–810.6 | 34 and 36 |
| 41 | Carbon disulfide | 75-15-0 | 10 | 268.2–313.2 | 1–811 | 34 |
| 42 | 1,2-Dichloroethane | 107-06-2 | 12 | 278.15–298.2 | 1–2795 | 68 |
| 43 | Acetonitrile | 75-05-8 | 64 | 238.2–343.2 | 1–3036 | 69 |
| Test data | ||||||
| 44 | Ammonia | 7664-41-7 | 18 | 199.2–473 | 1–750 | 34, 70 and 71 |
| 45 | 2-Propanol | 67-63-0 | 10 | 263–360 | 1–500 | 72 |
| 46 | Methanol | 67-56-1 | 43 | 157–453 | 1–981 | 50–52 and 72 |
| 47 | Tridecane | 629-50-5 | 18 | 288.2–353 | 1–981 | 34 and 45 |
| 48 | Nonane | 111-84-2 | 38 | 235.5–403.2 | 1–990 | 34, 44 and 45 |
| 49 | Hexane | 110-54-3 | 72 | 188.5–443 | 1–998 | 34, 44, 47, 61, 72 and 73 |
| 50 | Isopentane | 78-78-4 | 24 | 298–328 | 1–2000 | 74 |
| 51 | Bromoform | 75-25-2 | 8 | 283.2–343.2 | 1 | 75 |
| 52 | N,N-Dimethylacetamide | 127-19-5 | 35 | 255–468 | 1–2000 | 76 |
| 53 | Dimethyl ether | 115-10-6 | 40 | 184.5–458 | 500–2000 | 77 |
| 54 | Diiodomethane | 75-11-6 | 24 | 285.7–351.3 | 1 | 78 |
| 55 | Dichloromethane | 75-09-2 | 36 | 186–406 | 1–2000 | 79 |
| 56 | Chloroform | 67-66-3 | 40 | 217–397 | 1–1500 | 79 |
| 57 | Carbon tetrachloride | 56-23-5 | 3 | 313.2–333.2 | 1 | 80 |
| 58 | Chlorotrifluoromethane | 75-72-9 | 60 | 133–433 | 250–2000 | 81 |
| 59 | Bromotrifluoromethane | 75-63-8 | 59 | 141–432 | 250–2000 | 81 |
| 60 | Fluorobenzene | 462-06-6 | 13 | 240–360 | 1 | 82 |
| 61 | Iodobenzene | 591-50-4 | 15 | 330–440 | 1 | 82 |
| 62 | Bromobenzene | 108-86-1 | 18 | 250–420 | 1 | 82 |
| 63 | Trimethylamine | 75-50-3 | 44 | 174–423 | 100–2000 | 83 |
| 64 | N,N-Dimethylformamide | 68-12-2 | 36 | 243–448 | 1–2000 | 84 |
| 65 | Propylene glycol | 57-55-6 | 4 | 304–318 | 1 | 85 |
| 66 | Dibromomethane | 74-95-3 | 8 | 285–363 | 1 | 86 |
| 67 | 1,2-Dibromoethane | 106-93-4 | 11 | 285–400 | 1 | 86 |
![]() | (1) |
However, entropy-scaling relationships are not universally exact. Deviations have been reported for complex fluids, mixtures, strongly associating systems, and systems exhibiting thermodynamic anomalies, where the coupling between the molecular structure and transport becomes more intricate. In particular, near critical regions, at elevated densities, or in systems with strong directional interactions, local structuring and fluctuations may limit the applicability of simple entropy-based scaling.27
In the present work, the Rosenfeld scaling concept is employed as a physically motivated normalization framework for the self-diffusion coefficient, while the final nonlinear relationship is learned using an ANN that incorporates both thermodynamic variables and molecular descriptors.
Several formulations are suggested in the literature for defining the reference self-diffusion coefficient, including those proposed by Rosenfeld,24 Chapman-Enskog,28,29 and Bretonnet.30 In addition, various empirical relationships are reported to correlate dimensionless self-diffusion (D*) with residual entropy.31–33 However, these correlations are typically parameterized for specific compounds, and their predictive capability is restricted to the systems studied, limiting their applicability to broader chemical systems.
In this work, a generalized predictive model is developed to estimate self-diffusion coefficients for a wide variety of compounds over extended temperature and pressure ranges. First, the Rosenfeld24 scaling approach is adopted to define the reference for the self-diffusion coefficient:
![]() | (2) |
denotes the number density of molecules (m−3), T (K) is the temperature, M (kg kmol−1) stands for the molar mass, and R (8314.46 J K−1 kmol−1) is the universal gas constant. The reference diffusion coefficient has dimensions of diffusivity. The term
represents a characteristic molecular length scale, while
represents a characteristic molecular thermal velocity. Their product, therefore, gives units of m2 s−1, consistent with a diffusion coefficient. In this work, Dref is not used as an independent model for diffusion. Instead, it is used to nondimensionalize the experimental self-diffusion coefficient. The effects of the molecular structure are subsequently incorporated through PCP-SAFT-derived thermodynamic properties and COSMO-SAC molecular descriptors in the ANN framework.
Having the reference parameter, we present in the subsequent sections, the methodology to develop a generalized model for predicting the dimensionless self-diffusion coefficient (D*), following the procedure for obtaining the actual self-diffusion coefficient (D) using the reference value Dref.24
![]() | (3) |
In addition, key molecular descriptors can be derived using sigma-moments defined by eqn (4)–(8):
![]() | (4) |
![]() | (5) |
![]() | (6) |
![]() | (7) |
![]() | (8) |
These moments characterize different aspects of the surface charge distribution: M1 gives the net surface charge, M2 reflects the polarity, and M3 describes the asymmetry. However, during model development, M3 was found to have a negligible influence on the prediction of self-diffusion coefficients. The sensitivity analysis on the experimental data showed that M3 contributes only approximately 3% to the prediction of the self-diffusion coefficient, which is significantly lower than the contributions from the other descriptors considered in this work; therefore, M3 was excluded from the input variables.
MHBD1 and MHBA1 quantify hydrogen-bond donor and acceptor strengths, respectively. The threshold value σHB = 0.008 e Å−2 separates nonpolar and polar surface segments.17,18
The residual Helmholtz energy is expressed in eqn (9):88,89
| ares = ahc + ad + ap + aassoc | (9) |
The contributions correspond to hard-chain repulsion (hc), dispersion interactions (d), polar interactions (p), and association (assoc), respectively. For non-associating, nonpolar compounds, the model requires three parameters: segment number (m), segment diameter (σ), and dispersion energy (ε). Associating fluids require two additional parameters, the association energy
and association volume
, while polar compounds are characterized using the dipole moment (μD).
The dimensionless form of residual entropy is calculated from the Helmholtz energy as:90
![]() | (10) |
In addition to residual entropy, density (ρ) is also included as an input variable, which is calculated from the EoS. The inclusion of density as an input variable is because of the dependency of molecular diffusion on fluid density.
| Input variables | Description | |
|---|---|---|
| COSMO-SAC parameters | Atot | Total surface area of each molecule (Å2) |
| M1 | Index of net surface charge (e) | |
| M2 | Index of polarity (e2 Å−2) | |
| MHBD1 | Index of hydrogen-bond donor strength (e) | |
| MHBA1 | Index of hydrogen-bond acceptor strength (e) | |
| Thermodynamic properties | sres/R | Dimensionless form of residual entropy |
| ρ | Molar density (kmol m−3) |
Prior to training, all inputs and outputs were linearly scaled to the range [−1, 1], which improved numerical stability and facilitated convergence. The same scaling parameters derived from the training data were applied to the testing dataset. To identify an optimal and robust network configuration, we systematically explored different architectures, activation functions, and training algorithms. Both shallow and deep network structures were examined, including single hidden layers with 10–40 neurons and two-layer configurations with varying neuron counts.
Three activation functions of tansig, logsig, and poslin were evaluated for the hidden layers, while a linear activation was consistently used in the output layer to do the regression task. Training was conducted using multiple optimization algorithms, including Levenberg–Marquardt, scaled conjugate gradient, and Bayesian regularization. Model performances using training and independent testing datasets were assessed through statistical metrics such as the coefficient of determination (R2), mean absolute error (MAE), average absolute relative deviation (AARD%), and maximum absolute relative deviation. In addition, we analyzed parity plots, error distributions, and error trends across diffusion ranges to evaluate the accuracy and robustness of the model.
To justify the selected ANN topology, several network architectures were evaluated, and some of them are summarized in Table 3. Networks with fewer neurons, such as 8-4 and 10-5, showed higher AARD values for both training and testing datasets, indicating insufficient flexibility to capture the nonlinear relationship between the input descriptors and the dimensionless self-diffusion coefficient. Increasing the number of neurons improved the training performance; however, architectures larger than 14-7 led to reduced testing accuracy despite lower training errors, suggesting the onset of overfitting. The 14-7 architecture provided the best compromise between model complexity and generalization capability, achieving high accuracy for the training data while maintaining the lowest testing AARD among the evaluated structures. Therefore, this topology was selected as the final ANN configuration.
| Hidden-layer neurons | Train AARD% | Test AARD% | Comment |
|---|---|---|---|
| 8-4 | 15.36 | 21.84 | Underfitting |
| 10-5 | 11.92 | 19.96 | Improved but less accurate |
| 12-6 | 9.84 | 17.51 | Good performance |
| 14-7 | 8.89 | 15.89 | Selected model |
| 16-8 | 7.45 | 18.54 | Slight overfitting |
| 20-10 | 6.43 | 22.11 | Overfitting |
In addition, to evaluate the statistical robustness and generalization capability of the 14-7 architecture model, a repeated compound-wise validation procedure was performed. In this approach, the dataset was randomly partitioned multiple times at the compound level, ensuring that each testing set contained entirely unseen chemical species. The ANN model was retrained for each split, and the resulting performance metrics were statistically analyzed.
As shown in Table 4, the variation in model performance across different splits is relatively small. The AARD and R2 values for both training and testing datasets exhibit limited standard deviations, and the corresponding 95% confidence intervals remain narrow. This indicates that the predictive performance of the model is not sensitive to the specific selection of compounds in the training or testing sets. Overall, this repeated validation analysis provides strong evidence that the hybrid PCP-SAFT + ANN model exhibits both robustness and generalizability.
| Dataset | Training | Testing |
|---|---|---|
| R2 | 0.9937 ± 0.0038 | 0.9763 ± 0.0131 |
| ARAD | 8.89 ± 1.07 | 15.89 ± 2.08 |
Table 5 presents a statistical evaluation of the 14-7 architecture model using training and independent testing datasets. The results clearly demonstrate the high predictive accuracy of the proposed hybrid framework. For the training dataset (1586 data points), the model achieves a coefficient of determination of R2 = 0.9937 along with an MAE of 9.23 × 10−10 m2 s−1 and an AARD of 8.89%. These low error values, combined with the high R2, indicate that the ANN successfully captures the complex and nonlinear dependence of the self-diffusion coefficient on the selected input variables. The predictive performance remains robust when evaluated against the independent testing dataset (677 data points). In this case, the model yields R2 = 0.9763, MAE = 4.32 × 10−10 m2 s−1, and AARD = 15.89%. Considering the full dataset (2263 data points), the overall performance remains consistently high, with R2 = 0.9839. The model maintains reliable accuracy across a wide range of self-diffusion coefficients, spanning approximately five orders of magnitude (1.89 × 10−12–3.61 × 10−7 m2 s−1). This broad applicability demonstrates the robustness of the hybrid PCP-SAFT + ANN approach for predicting self-diffusion behavior in diverse chemical systems.
| Set | No data | R2 | MAE (m2 s−1) | AARD% | Max ARD% |
|---|---|---|---|---|---|
| Train | 1586 | 0.9937 | 9.23 × 10−10 | 8.89 | 54.38 |
| Test | 677 | 0.9763 | 4.32 × 10−10 | 15.89 | 61.14 |
| Total | 2263 | 0.9839 | 7.76 × 10−10 | 10.98 | 61.14 |
Fig. 1 shows the structure of the proposed hybrid framework, which integrates thermodynamic information from the PCP-SAFT EoS with a data-driven ANN model. The input layer consists of seven neurons representing key descriptors, including COSMO-SAC-derived molecular parameters, dimensionless residual entropy, and density obtained from PCP-SAFT calculations. These inputs provide both molecular-level and thermodynamic information, enabling a physically informed prediction. The proposed ANN includes two hidden layers with 14 and 7 neurons, respectively. This configuration was found to be sufficiently flexible to capture the nonlinear interactions between the descriptors without overfitting the data. The output layer contains a single neuron that predicts the dimensionless self-diffusion coefficient (D*). The complete set of model parameters and implementation details are provided in the SI.
![]() | ||
| Fig. 1 Schematic representation of the hybrid PCP-SAFT + ANN framework used for self-diffusion coefficient prediction. | ||
Fig. 2 presents the parity plot comparing the predicted and experimental values of the self-diffusion coefficient (D) for both the training and testing datasets. The close alignment of the data points along the diagonal reference line (y = x) indicates a high level of agreement between the model predictions and experimental measurements. The model maintains a strong predictive accuracy over a broad range of self-diffusion coefficients, spanning approximately five orders of magnitude (10−12 to 10−7 m2 s−1). This wide coverage highlights the capability of the hybrid PCP-SAFT + ANN framework to reliably capture diffusion behavior for systems with significantly different transport characteristics. Data points corresponding to the training set (blue circles) are densely distributed around the parity line, confirming that the model has effectively learned the underlying nonlinear relationships between the input descriptors and the target property. More importantly, the testing dataset (green triangles), which includes compounds not involved in the training process, also follows the parity line closely. This demonstrates that the model retains its high predictive accuracy when applied to unseen data. No noticeable systematic bias or deviation is observed across the entire range of D values, further supporting the robustness and stability of the selected network architecture. Overall, the results confirm that the proposed hybrid model provides reliable and consistent predictions of self-diffusion coefficients across diverse chemical systems and thermodynamic conditions.
![]() | ||
| Fig. 2 Parity plot comparing the estimated and experimental self-diffusion coefficients for the training (1586 points) and testing (677 points) datasets. | ||
Fig. 3 shows the distribution of relative error residuals for both the training and testing datasets as a function of the experimental self-diffusion coefficient. The error profiles provide additional insight into the consistency and reliability of the developed model. For the training dataset, the relative errors are predominantly distributed around zero across the entire range of diffusion coefficients. The narrow spread of the data indicates that the model achieves high accuracy with minimal dispersion, confirming its ability to represent the underlying relationships within the training data. In the case of the testing dataset, a slightly broader distribution of errors is observed, which is expected for data not included during the model calibration. Nevertheless, the errors are randomly distributed around zero, and no systematic overprediction or underprediction trends are observed. This relatively low error and its random distribution indicate that the model preserves its predictive capability when applied to unseen compounds. Importantly, the absence of any noticeable bias or trend in the error magnitude with respect to the diffusion coefficient suggests that the model performance is reliable across the investigated range. Overall, the results demonstrate that the hybrid PCP-SAFT + ANN model delivers unbiased and robust predictions, with no indication of overfitting and with strong generalization across diverse chemical systems and operating conditions.
![]() | ||
| Fig. 3 Relative prediction error (%) as a function of experimental self-diffusion coefficient for the training (top panel) and testing (bottom panel) datasets. | ||
Fig. 4 presents the histograms of relative prediction errors for both the training and testing datasets, providing a statistical perspective on the model accuracy and error distribution. For the training dataset, the error distribution is sharply centered around zero, with 70% falling to within ±10%, and over 90% within ±20%. This narrow and symmetric distribution confirms the high prediction accuracy of the model and demonstrates its ability to accurately represent the nonlinear dependence of the self-diffusion coefficient on the selected descriptors. Only a small fraction of the data exhibits larger deviations, which are mainly associated with conditions at very low diffusion coefficients, where sensitivity to input parameters and experimental uncertainty are typically higher. For the testing dataset, the error distribution remains centered close to zero, indicating that the model predictions are essentially unbiased for unseen compounds. Despite a slightly broader spread compared to the training data, 50% of error residuals are located within ±10% and 70% within ±20%, which are practically acceptable error bounds. The symmetric shape of the histogram further confirms the lack of skew relating to overprediction or underprediction. Overall, these quantitative error distributions demonstrate that the hybrid PCP-SAFT + ANN model achieves both high accuracy and strong generalization capability, maintaining reliable performance across a wide range of self-diffusion coefficients and thermodynamic conditions.
![]() | ||
| Fig. 4 Histograms of relative prediction errors (%) for the self-diffusion coefficient for the training (left panel) and testing (right panel) datasets. | ||
Fig. 5 presents the variation of the average absolute relative deviation (AARD%) across different ranges of self-diffusion coefficient for both the training and testing datasets. Each interval corresponds to one order of magnitude of D, enabling a consistent assessment of model performance across the entire diffusion range. For the training dataset, the AARD shows a clear decreasing trend with increasing self-diffusion coefficient. The highest deviations are observed at the lowest diffusion range (10−12–10−11 m2 s−1); the AARD steadily declines as D increases, reaching its minimum values at the highest diffusion ranges. The increase in the relative deviation observed at very low diffusion coefficients can be attributed to both numerical sensitivity and experimental uncertainty. From a mathematical standpoint, the AARD involves normalization by the experimental value; therefore, when D is very small, even minor absolute differences between the predicted and experimental values can result in larger relative errors.
![]() | ||
| Fig. 5 Variation of AARD (%) of the ANN model in estimating self-diffusion coefficient intervals for both training and test datasets. | ||
In addition, low-diffusivity conditions typically correspond to high-density or highly structured fluid states, where molecular mobility is significantly restricted. Under such conditions, experimental measurements of self-diffusion coefficients are inherently more challenging and may be associated with higher uncertainty due to limitations in measurement techniques and sensitivity to temperature and pressure control.
From a modeling perspective, the calculation of residual entropy using PCP-SAFT also becomes more sensitive under these conditions.
Despite these challenges, the model maintains consistently low absolute accuracy across the full range of diffusion coefficients, and the observed increase in relative deviation at very low values is primarily a consequence of normalization effects and data sensitivity rather than a systematic limitation of the proposed framework. A similar trend is observed for the testing dataset, although with higher deviations as expected for unseen data. Overall, Fig. 5 confirms that the hybrid PCP-SAFT + ANN model provides reliable predictions of the self-diffusion coefficient across the six orders of magnitude range.
Fig. 6 presents the cumulative coverage curves for the training, testing, and overall datasets as a function of the AARD threshold. This representation provides a comprehensive assessment of the predictive reliability of the hybrid PCP-SAFT + ANN model across the full range of self-diffusion coefficients. For the training dataset, the curve increases sharply at low AARD thresholds, indicating that a large fraction of the data is predicted with high accuracy. 90% of the training data fall within 20% AARD, and complete coverage is achieved below about 52% AARD. This steep rise confirms the strong fitting capability of the model. For the testing dataset, the cumulative curve exhibits a more gradual increase, reflecting a greater variability as expected for unseen data. Nevertheless, the model still demonstrates solid predictive performance, with 70% of the data within 20% AARD and over 90% within 40% AARD. This behavior highlights the ability of the model to generalize effectively across different compounds and thermodynamic conditions. The curve corresponding to the full dataset lies between the training and testing curves, as expected, and reflects the overall predictive performance of the model. All three curves approach 100% coverage at AARD values below 62%, indicating that even the largest deviations remain within acceptable bounds. Overall, this cumulative analysis confirms that the developed hybrid model provides reliable and consistent predictions of the self-diffusion coefficient, with a high proportion of results falling within practically acceptable error limits for both known and unseen systems.
![]() | ||
| Fig. 6 Cumulative fraction of data within a given AARD threshold for the training, testing, and overall datasets. | ||
Table 6 presents the predictive performance of the hybrid PCP-SAFT + ANN model for different classes of compounds, categorized based on their intermolecular interaction type. This classification enables a more detailed assessment of the model's capability across fluids with fundamentally different interaction mechanisms, including dispersion-dominated (nonpolar), dipolar (polar non-associating), and hydrogen-bonding (associating) systems.
| Set | No component | No data | R2 | MAE (m2 s−1) | AARD% | Max ARD% |
|---|---|---|---|---|---|---|
| Nonpolar | 34 | 1108 | 0.9931 | 1.15 × 10−9 | 8.81 | 54.38 |
| Polar non-associating | 19 | 550 | 0.9343 | 4.57 × 10−10 | 16.57 | 58.72 |
| Associating | 14 | 605 | 0.9959 | 3.81 × 10−10 | 9.87 | 61.14 |
For nonpolar compounds, the model achieves a high level of accuracy, with an R2 value of 0.9931 and an AARD of 8.81%. These systems are primarily governed by dispersion interactions, which are well described by PCP-SAFT. As a result, the residual entropy and density provide a consistent representation of the thermodynamic state, enabling the ANN to accurately capture the diffusion behavior across a wide range of conditions.
Similarly, associating compounds exhibit excellent predictive performance, with the highest R2 value of 0.9959 and an AARD of 9.87%. This indicates that the hybrid framework successfully captures the effect of hydrogen-bonding interactions. The inclusion of association terms in PCP-SAFT, together with hydrogen-bond-related descriptors derived from COSMO-SAC, provides sufficient information to the ANN to capture the additional complexities introduced by specific intermolecular interactions.
In contrast, the performance for polar non-associating compounds is comparatively lower, with an R2 value of 0.9343 and an AARD of 16.57%. This can be attributed to the more complex behavior of dipolar interactions, which are generally weaker and more sensitive to molecular orientation compared to hydrogen bonding. In such systems, the relationship between the thermodynamic properties and diffusion behavior is less direct, leading to increased variability in the data and a slightly reduced predictive accuracy. In addition, polar compounds often exhibit a wider range of molecular structures and dipole moments, which may not be fully captured by the selected descriptors.
Despite these differences, the model maintains reasonable accuracy across all categories, demonstrating its robustness and general applicability. The maximum absolute relative deviation (max ARD) remains within a similar range for all groups, indicating that extreme deviations are not systematically associated with any specific class of compounds.
![]() | ||
| Fig. 7 Comparison of estimated and experimental self-diffusion coefficients for four representative compounds from the training dataset across a wide range of temperatures and pressures. Lines represent predictions, while symbols denote the literature data: benzene,34,44,48–50,61–64 toluene,34,61 cyclohexane,34,44–46,48,49 and propane.34,39 | ||
Fig. 8 shows the predictive performances of the developed hybrid PCP-SAFT + ANN model for four representative compounds (ammonia, methanol, isopentane, and chloroform) that were not included in the training dataset. The comparison between predicted values and reported experimental data validate excellent model's generalization capability across different chemical systems and thermodynamic conditions. As can be seen in Fig. 8, the ANN predictions closely follow the reported data over the full range of temperatures and pressures. The model accurately reproduces both the magnitude and the variation of the self-diffusion coefficient, with predicted curves closely following the experimental data points. The agreement is consistent across different pressure levels, and the model successfully captures the separation between isobars at each temperature. The deviations between the predicted and experimental values are generally small and show no systematic trend. Note that all systems presented in Fig. 8 were not included in the training dataset, making this validation particularly rigorous. Overall, Fig. 8 demonstrates that the hybrid PCP-SAFT + ANN model provides accurate and reliable predictions for unseen compounds over a wide range of thermodynamic conditions, highlighting its strong generalization capability and suitability for practical applications.
![]() | ||
| Fig. 8 Comparison of estimated and experimental self-diffusion coefficients for four representative compounds from the test dataset across a wide range of temperatures and pressures: ammonia,34,70,71 methanol,50–52,72 isopentane,74 and chloroform.79 | ||
, with a relative contribution of approximately 35%, indicating that intermolecular interactions play the most dominant role in determining diffusion behavior. This is followed by Atot (17%); the first sigma-profile moment, M1 (14%); and the second sigma-profile moment, M2 (12%); all of which make notable contributions to the model prediction. The hydrogen-bond donor descriptor (MHBD1) also shows a meaningful effect, with a relative importance of approximately 9%, while density (ρ) contributes around 8%, and the hydrogen-bond acceptor descriptor (MHBA1) exhibits the smallest contribution, at approximately 5%. These results highlight that thermodynamic properties and molecular surface characteristics play key roles in governing the molecular diffusion. Moderate contributions are observed for M1 (14%) and M2 (12%), suggesting that molecular surface descriptors still play relevant, though secondary roles in influencing diffusion. Overall, the sensitivity analysis indicates that the ANN model relies primarily on residual entropy, while molecular descriptors also contribute significantly.
![]() | ||
| Fig. 9 Contribution of input variables to the ANN predictions of self-diffusion coefficients, evaluated via permutation-based sensitivity analysis. | ||
Table 7 presents the Pearson correlation matrix for the seven input variables used in the ANN model. The results show that most descriptors exhibit weak to moderate correlations, while some COSMO-SAC-derived descriptors show stronger correlations with each other. For example, M2 is strongly correlated with MHBD1 and MHBA1, with correlation coefficients of 0.80 and 0.68, respectively. This is expected because these descriptors are all derived from the sigma-profile and represent related aspects of molecular surface charge distribution and hydrogen-bonding tendency.
| Atot | M1 | M2 | MHBD1 | MHBA1 | sres/R | ρ | |
|---|---|---|---|---|---|---|---|
| Atot | 1.00 | −0.68 | −0.33 | −0.35 | −0.40 | −0.59 | −0.53 |
| M1 | −0.68 | 1.00 | 0.12 | 0.06 | 0.01 | 0.53 | 0.12 |
| M2 | −0.33 | 0.12 | 1.00 | 0.80 | 0.68 | −0.23 | 0.55 |
| MHBD1 | −0.35 | 0.06 | 0.80 | 1.00 | 0.64 | −0.20 | 0.54 |
| MHBA1 | −0.40 | 0.01 | 0.68 | 0.64 | 1.00 | −0.09 | 0.77 |
| sres/R | −0.59 | 0.53 | −0.23 | −0.20 | −0.09 | 1.00 | 0.12 |
| ρ | −0.53 | 0.12 | 0.55 | 0.54 | 0.77 | 0.12 | 1.00 |
A strong correlation is also observed between MHBA1 and density, with a correlation coefficient of 0.77. This suggests that compounds with stronger hydrogen-bond acceptor characteristics in the present dataset tend to be associated with higher-density conditions or molecular classes. However, having correlations among the inputs does not necessarily imply redundancy, because, for example, density is a thermodynamic-state variable, while MHBA1 is a molecular descriptor.
Notably, the dimensionless residual entropy, sres/R, shows weak to moderate correlations with most COSMO-SAC descriptors. Its strongest correlations are with Atot and M1, with coefficients of −0.59 and 0.53, respectively, while its correlations with M2, MHBD1, MHBA1, and density are relatively weak.
To further evaluate the contribution of residual entropy to the predictive capability of the model, an ablation study was performed in which the dimensionless residual entropy (sres/R) was removed from the input set and the ANN was retrained using the same training/testing protocol.
As seen in Table 8, the resulting model exhibited a noticeable deterioration in predictive performance, particularly for the independent testing dataset, with increased AARD values and reduced R2. This result confirms that residual entropy provides essential thermodynamic information that cannot be fully replaced by the remaining molecular descriptors and density alone.
| Model | Train | Test | ||
|---|---|---|---|---|
| R2 | AARD% | R2 | AARD% | |
| With sres/R | 0.9937 | 8.89 | 0.9763 | 15.89 |
| Without sres/R | 0.9491 | 29.60 | 0.8390 | 51.69 |
Molecular descriptors derived from COSMO-SAC sigma-profiles, including the total surface area, polarity-related parameters, and hydrogen-bonding contributions, were employed alongside thermodynamic properties (dimensionless form of residual entropy and density) calculated using PCP-SAFT. These inputs enabled the ANN model to capture the underlying relationships governing diffusion behavior.
To ensure the model's robustness and generalization capability, the dataset was partitioned on a compound basis, with 45 compounds (1586 data points) used for training and 24 compounds (677 data points) reserved exclusively for independent testing. A systematic exploration of ANN architectures, activation functions, and training algorithms was conducted to identify the optimal model configuration. The selected architecture, consisting of two hidden layers with 14 and 7 neurons, demonstrated excellent predictive performance, achieving R2 values of 0.9937 and 0.9763 for the training and testing datasets, respectively, along with AARD values of 8.89% and 15.89%. The largest relative deviations were observed at very low diffusion coefficients, where small absolute differences lead to amplified percentage errors.
Sensitivity analysis based on Pearson correlation coefficients indicates that the dimensionless residual entropy is the most influential variable in predicting self-diffusion coefficients, followed by total surface area and polarity-related descriptors. In contrast, hydrogen-bonding descriptors and density exhibit comparatively lower contributions. This highlights the dominant role of thermodynamic state representation, particularly residual entropy, in governing diffusion behavior. Such a finding underscores the effectiveness of integrating PCP-SAFT-derived thermodynamic properties with molecular descriptors, providing a robust basis for predictive modeling.
The strong predictive capability of the model highlights the effectiveness of integrating physically grounded thermodynamic inputs with data-driven approaches. The proposed PCP-SAFT + ANN framework provides a reliable and generalizable tool for estimating self-diffusion coefficients across a wide range of compounds and conditions. This approach can support process design, transport modeling, and simulation tasks where accurate diffusion data are required. Future work may extend this framework to multicomponent systems and incorporate it into process simulation platforms.
Supplementary information related to this study is accessible online to support the reproducibility and promote transparency of the proposed model. Supplementary information:dimensionless_self_diffusion_ANN_2026.mat—contains the trained ANN model, input standardization parameters, and model configuration metadata. Dimensionless_self_diffusion_Predictor.m—a MATLAB script that allows users to predict the self-diffusion coefficient of chemicals by entering COSMO-SAC-derived molecular descriptors, dimensionless form of residual entropy, and molar density, the script uses the trained ANN model. COSMO_SAC_derived_molecular_descriptors.xlsx—contains the COSMO-SAC-derived molecular descriptors for 69 chemicals. See DOI: https://doi.org/10.1039/d6cp01425a.
| This journal is © the Owner Societies 2026 |