Benedikt Wintera,
Philipp Rehner
a,
Timm Esperb,
Johannes Schilling
a and
André Bardow
*a
aEnergy and Process Systems Engineering, ETH Zurich, Switzerland. E-mail: abardow@ethz.ch
bInstitute of Thermodynamics and Thermal Process Engineering, University of Stuttgart, Germany
First published on 29th January 2025
A major bottleneck in developing sustainable processes and materials is a lack of property data. Recently, machine learning approaches have vastly improved previous methods for predicting molecular properties. However, these machine learning models are often not able to handle thermodynamic constraints adequately. In this work, we present a machine learning model based on natural language processing to predict pure-component parameters for the perturbed-chain statistical associating fluid theory (PC-SAFT) equation of state. The model is based on our previously proposed SMILES-to-Properties-Transformer (SPT). By incorporating PC-SAFT into the neural network architecture, the machine learning model is trained directly on experimental vapor pressure and liquid density data. Combining established physical modeling approaches with state-of-the-art machine learning methods enables high-accuracy predictions across a wide range of pressures and temperatures, while keeping the thermodynamic consistency of an equation of state like PC-SAFT. SPTPC-SAFT demonstrates exceptional prediction accuracy even for complex molecules with various functional groups, outperforming traditional group contribution methods by a factor of four in the mean average percentage deviation. Moreover, SPTPC-SAFT captures the behavior of stereoisomers without any special consideration. To facilitate the application of our model, we provide predicted PC-SAFT parameters of 13279 components, making PC-SAFT accessible to all researchers.
Over the years, the research on predicting molecular properties has led to many approaches based on, e.g., quantitative structure–property relationships (QSPRs),1,2 group contribution (GC) methods3–6 and quantum mechanics.7–9 However, many of these classical methods either have low accuracy, are limited to certain functional groups, or require large computational resources. As a recent addition to these approaches, machine learning methods have emerged as a powerful tool due to their ability to learn complex patterns and generalize from data, overcoming some of the shortcomings of the classical methods. Some recent examples of machine learning approaches include methods for the prediction of binary properties such as activity coefficients10–13 or a large range of pure component properties.14–17
However, the majority of recent machine learning approaches focus on singular properties, not a holistic description of a system. Thermodynamics teaches that equilibrium properties of fluids are not independent but rather related through an equation of state. Modern equations of state are expressed as a thermodynamic potential, usually the Helmholtz energy, as a function of its characteristic variables. All equilibrium properties are then available as partial derivatives of the thermodynamic potential. Equations of state can be broadly classified into three categories: (1) cubic equations of state (such as the Peng–Robinson18 and the Soave–Redlich–Kwong19 equation of state), (2) highly accurate reference equations for specific systems (including water,20 carbon dioxide,21 nitrogen,22 and natural gas components23), and (3) molecular equations of state (such as the SAFT family24–27). The main distinction among these categories lies in the data required for parameterization, with cubic equations of state necessitating the fewest parameters and reference equations of state demanding the most.
Parameterizing equations of state typically relies on experimental data, which is often unavailable for novel molecules or expensive to obtain from commercial databases or experiments. In the absence of experimental data, various predictive methods have been developed for equations of state, primarily focused on GC methods.28,29 Since group contribution methods rely on a predefined set of functional groups and their respective contributions, those methods are limited to certain subsets of the molecular space and often struggle to predict the properties of more complex molecules accurately. Furthermore, capturing effects linked to isomers or more intricate intermolecular forces requires the definition of higher-order groups, for which adequate parametrization is more data-demanding,30 or fundamental improvements to the PC-SAFT-theory.
Recently, machine learning (ML) methods have been developed to predict pure component parameters for equations of state. The focus has been on the perturbed-chain statistical associating fluid theory (PC-SAFT) equation of state developed by Gross and Sadowski.25 The ML models use as input either group counts,31 molecular fingerprints,32 or a variety of molecular descriptors.33 However, these methods are not trained directly on experimental property data but on previously fitted pure component parameters of PC-SAFT. This reliance on previously fitted pure component parameters vastly constraints the amount of available training data, thus likely limiting the applicability domain of these models. Moreover, small errors in predicted pure component parameters can have large effects on the final predicted fluid properties. Consequently, training machine learning models directly on experimental property data is preferred.
In previous work, we demonstrated how explicit physical equations could be integrated into a machine learning framework, using the NRTL-equation as an example.34 However, integrating PC-SAFT into a machine learning framework presents two additional challenges: Firstly, PC-SAFT is not explicit in measurable properties like vapor pressures and liquid densities. Instead, vapor pressures and liquid densities have to be determined iteratively from partial derivatives of the Helmholtz energy, requiring a more sophisticated approach than a straightforward integration into the neural network. Secondly, the physical significance of the pure component parameters of PC-SAFT is the basis of its robust extrapolation, in particular to mixtures. Therefore, any predictive method should ensure that parameters related to their physical basis are obtained.
In this work, we present a natural language-based machine learning model for predicting pure component parameters of PC-SAFT trained directly on experimental data. For this purpose, the PC-SAFT equation of state is directly integrated into our previously proposed SMILES-to-Properties-Transformer (SPT).11,34 The resulting SPT-PC-SAFT model exhibits high prediction performance, accurately predicting thermophysical properties for complex molecules with various functional groups. Remarkably, our model is also capable of correctly predicting the behavior of stereoisomers.
Fig. 1 illustrates the overall structure of the proposed SPTPC-SAFT model: first, molecules are represented as SMILES codes, which are fed into a natural language processing model that predicts parameters, which are used within the PC-SAFT equation of state to compute vapor pressures psat and liquid densities ρL at a given temperature or temperature and pressure, respectively. To avoid assigning dipole moments and association parameters to non-polar or non-associating molecules, the likelihood that a component is associating (λassoc) or polar (λpolar) is also predicted by SPTPC-SAFT and molecules are only assigned associating or polar parameters if the molecule is predicted to be associating or polar. During the model training, the PC-SAFT equation of state is incorporated into the forward and backward pass, allowing for the calculation of analytical gradients of the loss (target function) with respect to the model parameters. This integration enables us to train a machine learning model end-to-end on experimental data and not only on previously fitted parameters.
In the following sections, the model and training procedure of SPTPC-SAFT are described in detail: Section 2.1 introduces the architecture of the machine learning model and the integration of the PC-SAFT equation. Section 2.2 describes the data sources, data processing, and the definition of training and validation sets. In Section 2.3, we describe the selection of hyper-parameters and the training process of SPTPC-SAFT.
In the following, we present the SPTPC-SAFT architecture in three sections: input embedding (Section 2.1.1), multi-head attention (Section 2.1.2), and head (Section 2.1.3).
The input of SPTPC-SAFT consists of the SMILES codes representing the molecule of interest with special characters denoting the start of the sequence <SOS>, and the end of the sequence <EOS>. The remainder of the input sequence is filled up to a maximum sequence length nseq of 128 with padding <PAD>:
<SOS>, SMILES, <EOS>, <PAD>,… |
To render the input string suitable for the machine learning model, the string is tokenized, breaking the sequence into tokens that can each be represented by a unique number. Generally, tokens may comprise multiple characters, but in this work, each token consists of a single character. The tokenization process for SMILES can be compared to assigning first-order groups in group contribution methods. The complete vocabulary containing all tokens can be found in the ESI Section 1.†
The input sequence undergoes one-hot encoding, where each token is represented by a learned vector of size nemb = 512. An input matrix of size nemb × nseq is generated by concatenating the vectors representing the tokens of the input sequence. After encoding the input sequence, an additional vector is appended to the right of the input matrix, which holds a linear projection of continuous variables into the embedding space. In the case of the original SPT model,11 temperature information is encoded in this vector. In SPTPC-SAFT, no continuous variables are supplied here, as temperature and pressure information is only introduced in the final stage (see Fig. 2), and thus, the continuous variable vector only contains zeros. After adding the continuous variables, the resulting input matrix has a size of nemb × nseq + 1. Subsequently, a learned positional encoding, which contains a learned embedding for each position, of size nemb × nseq + 1 is added to the input matrix. At this stage, the input matrix contains information on all atoms and bonds in the molecule and their positions. However, each token lacks information about its surroundings, as no information has been exchanged between tokens yet. This information sharing between tokens is discussed in the following multi-head attention section.
For a more comprehensive and visual explanation, readers are directed to the blog of Alammar46 or the comprehensive description in the ESI of our previous work.34
First, the pure component parameters of PC-SAFT have inherent physical meaning, and preserving this physical meaning cannot be guaranteed in a simple regression model. Second, the target properties used for training the model, i.e., vapor pressures and liquid densities, are not direct outputs of PC-SAFT; instead, these target properties must be iteratively converged. While software packages are available that provide robust computations of bulk and phase equilibrium properties with PC-SAFT,47 it is crucial to ensure that the neural network maintains an intact computational graph to allow the network to obtain a derivative of the target value with respect to all model parameters. An intact computational graph can be ensured when all calculations are conducted within a consistent framework like PyTorch.
![]() | ||
Fig. 3 Head section of the model. The natural language processing section of the SPT model returns a vector of length 8. This vector contains six auxiliary pure component parameters of PC-SAFT (![]() ![]() ![]() ![]() ![]() ![]() |
After leaving the multi-head attention section, the model has an output of size nemb × nseq. To reduce the dimensionality, a max function is first applied across the sequence dimensions, resulting in a vector of size nemb × 1. Afterward, a linear layer projects this vector to a vector of the auxiliary pure component parameters of size 8, which contains the auxiliary pure component parameters of PC-SAFT = [
AB
AB
] and auxiliary association and polarity likelihoods (A, P). From the auxiliary parameters
, the pure component parameters of PC-SAFT ϕ are calculated using the following equation:
![]() | (1) |
Here, ϕmean is an externally set hyperparameter determined via a hyperparameter scan. The auxiliary parameters ensure that reasonable values for the pure component parameters of PC-SAFT are reached at the beginning of the training when can be expected to be small values around 0, effectively serving as a staring value for the model. Properly setting the ϕmean parameters ensures quicker convergence. The factor Λ = [1 1 1 λassoc λassoc (1 − λassoc)λpolar] is used to activate or deactivate the association parameters and the dipole moment using the association and polarity likelihoods λassoc and λpolar. To calculate the likelihoods, the auxiliary likelihood parameters A and P are passed through a sigmoid function that normalizes them between 0 and 1:
![]() | (2) |
![]() | (3) |
For associating molecules, we assume that the association contribution dominates the polar contribution. Thus, the dipole moment parameter is set to 0 by multiplying with (1 − λassoc)λpolar.
The parameters ϕ = [m σ ε εAB κAB μ] are then passed into the PC-SAFT equation of state to compute either saturation pressures psat or liquid densities ρL. The resulting vapor pressures and liquid densities are subsequently passed into the target function along with the associating and polar likelihood λassoc and λpolar, respectively. Including the likelihoods in the target function helps with distinguishing between different intermolecular interactions. A more comprehensive assessment of the strength of different intermolecular interactions and more general association schemes beyond 2B require, in our view, the integration of mixture data into the parameter prediction (cf. ref. 48).
However, the pure component vapor pressure is not directly accessible via a derivative of the Helmholtz energy. Instead, the pure component vapor pressure is implicitly defined as the solution of three nonlinear equations,
μ(T,ρV) = μ(T,ρL) | (4) |
p(T,ρV) = psat | (5) |
p(T,ρL) = psat | (6) |
In general, the derivatives of an implicitly defined function x(ϕ) that depends on parameters ϕ via f(x, ϕ) = 0, can be found by calculating a single step of a Newton iteration starting from an already converged solution x* as:
![]() | (7) |
Applying the concept to the calculation of liquid densities leads to:
![]() | (8) |
For the vapor pressures, after solving the system of three equations shown above, the last Newton step is:
![]() | (9) |
Implementing eqn (8) and (9) into the neural network ensures a fully connected computational graph that can be used by PyTorch to evaluate derivatives of the loss function while still allowing the use of efficient external routines to converge states. While we developed this method to use equations of state, it could also be applied to a wider range of problems where parameters for implicit equations have to be determined using neural networks.
From this large data collection, all molecules are removed that do not contain at least one carbon atom and most metal complexes except silicon. The remaining data is then split into two sets depending on their data quality: the clean and the remaining dataset. The clean dataset contains molecules that have already been used for the fitting of pure component parameters of PC-SAFT by Esper et al.53 and contains 1103 components, 189504 vapor pressure data points, and 282
642 liquid density data points. The pressure data in the clean dataset have undergone a significant effort to eliminate outliers.53 Only data from the clean dataset is used for validation.
The remaining dataset includes the data of the aforementioned databases that is not suitable to directly fit pure component PC-SAFT parameters, as not sufficiently many vapor pressures and liquid densities are available for a given component. However, this data can still be used in SPTPC-SAFT due to the end-to-end training approach. The remaining dataset has a lower data quality than the clean dataset but contains a larger variety of molecules. Several steps were conducted to clean the remaining dataset: first, all data points at a vapor pressure of 1.0 ± 0.1 bar at 298.15 ± 1.00 K are excluded, as these seem to be data points entered erroneously. Then, we removed data points that could not be fitted using PC-SAFT. To remove the data points, we trained eight SPTPC-SAFT models on the clean and remaining data for 15 epochs using a SmoothL1 loss, thus giving less weight to outliers than using an MSE loss. Eight models were used for convenience since eight GPUs were available to us while providing a good compromise between robustness and performance. Afterward, we removed all data points from the remaining dataset that have a training loss larger than 0.5. In total, 21456 of 233
988 data points were removed from the remaining data. Fig. S3 in the ESI† illustrates typical examples of errors identified using our data-cleaning method. Manual review of the removed data points showed that mostly unreasonable-looking data points were removed from the remaining data. The large deviations can either be attributed to scattering of the experimental data, especially at low pressures, or to systematic deviations, either due to limitations of PC-SAFT or erroneously reported experimental results. Overall, 160
186 data points for vapor pressure and 52
343 data points for liquid densities remain in the data set with 12
019 and 2067 molecules, respectively.
As our model was employed to clean the remaining data, it is important to note that the remaining dataset is solely used for training the model and not for any form of model validation. For model validation, only the clean dataset is used.53 Thereby, we ensure that our model's performance evaluation is based on reliable and high-quality data and unbiased by our data cleaning steps.
Some of the molecules in the training data are structural isomers such as cis-2-butane and trans-2-butane. SPT uses isomeric SMILES codes and can thus distinguish between the cis and trans versions of molecules. However, for some isomeric molecules, our training data also contains data only labeled with the non-isomeric SMILES. In these cases, the data is either one unknown isomer, a mixture of isomers with very similar properties, or mislabeled data of two differently behaving isomers. To avoid ambiguities, we dropped any data related to non-isomeric SMILES codes for components of which isomeric SMILES are present.
To train the model to recognize if a component is associating or polar, the training data is labeled. To label molecules as associating or polar, we use the following approaches: for associating components, we use RDKit to identify molecules with at least one hydrogen bond donor site and one hydrogen bond acceptor site.45 Components that meet this criterion are labeled as associating. To label molecules as polar, a consistent database of dipole information is needed. Here, we use the COSMO-Therm database 2020, where the dipole moment is available for 12182 molecules in the energy files. If the dipole moment is above 0.35 D, the molecule is labeled as polar. The limit is set semi-arbitrary by looking at molecules close to the limit and judging if they are polar. Examples of molecules around this polarity threshold are shown in Fig. 4. If a component in the training data is unavailable in the COSMO-Therm database, its polarity likelihood is masked in the loss function and thus ignored during training. We thus only train the polarity classifier on the subset of molecules with known polarity. Polarity information is available for around 95% of all molecules in the clean dataset and 50% of the molecules in the remaining dataset.
However, we impose certain restrictions on the data used for validation. Only components with at least three carbon atoms are included in the validation set, as extrapolation from larger molecules towards very small molecules, such as methane and carbon dioxide, works poorly and the space of small molecules is already well-explored experimentally. Thus, pure component parameters of PC-SAFT are generally available for small molecules.53 Additionally, structural isomers are treated as one component with respect to training/validation splits. Therefore, if the trans version of a molecule is in the validation set, the cis version is also included in the validation set, and vice versa. The same workflow is applied for enantiomers.
In previous work, it was demonstrated that the prediction error of molecular properties tends to exhibit a roughly log-linear relationship with the amount of training data for the prediction of activity coefficients.11 Although it would be interesting to explore similar data scaling for PC-SAFT, the significant computational resources required are beyond the scope of this paper.
To identify good values for ϕmean, we generated a synthetic training dataset with 1494 pure component parameters of PC-SAFT from the work of Esper et al.53 and used these parameters to calculate 100 pressure and density values. To validate our model's performance, we reserved 5% of the components as a separate validation set. Over this set, a scan was conducted using the parameter values listed in Table 1, and the set of parameters leading to the lowest loss on the test set was chosen.
Parameter | m | σ/Å | ε/k/K | μ/D | κAB | εAB/k/K |
---|---|---|---|---|---|---|
σmean | 2 | 5 | 300 | 3 | 0.005 | 1500 |
During the hyperparameter scan, we found that values for ϕmean that overestimate the critical point help with the convergence. The overestimation ensures that most calculations return valid results in the initial stages of the model training, speeding up the training and avoiding divergence of the model. Vapor pressure data for temperatures above the predicted critical point are excluded from the calculation of the loss function to avoid poisoning the gradients with NaN values. This treatment is particularly relevant at the beginning of the training, where deviations are large. For highly converged models, failures in the calculation of vapor pressures are unlikely due to PC-SAFT's inherent tendency to overestimate critical points.
The training was performed on 4 RTX-3090s using a learning rate of 10−4 and 50 epochs. Training takes about 10 h for 8 training/validation splits running two models per GPU in parallel.
![]() | (10) |
Fig. 5 illustrates additionally how the APD translates into pressure-temperature (p/T) plots and demonstrates the diverse set of molecules for which SPTPC-SAFT can account. These examples are cyclohexylamine with an APD of 2%, ethyl cyanoacetate with an APD of 9%, octamethyl-1,3,5,7,2,4,6,8-tetraoxatetrasilocane with an APD of 19%, and triacetin with an APD of 51%.
The relationship between APD, molecule size, and vapor pressure range is further illustrated in Fig. 6, which displays the APD in vapor pressure prediction as a function of the number of heavy atoms and pressure. A region of relatively low APD is achieved for molecules containing between 4 and 20 heavy atoms within a vapor pressure range of 1 kPa to 100 MPa. In contrast, high deviation predominantly occurs at the edges of the data space, particularly for large molecules at low pressures. This behavior might be due to a lower density of data and higher uncertainty when measuring low-pressure systems.
![]() | ||
Fig. 6 Average percentage deviation in vapor pressure as a function of experimental vapor pressure and the number of heavy atoms in the molecules. Deviations larger than 0.5 are truncated at 0.5. |
In Fig. 7, the relationship between APD (Average Percentage Deviation) and molecular families is explored. The classification of the molecular families is based on the DIPPR database,51 which contains families for 609 out of the 870 components in the validation set. Molecules not assigned to a family are excluded from this analysis. A noticeable correlation is obtained between the expected prediction error and the molecular families. Notably, molecular families composed solely of oxygen and carbon exhibit above-average prediction accuracy. In contrast, fluorinated, halogenated (bromide and iodine), and particularly nitrogen-containing compounds present challenges in prediction. A comprehensive list of the validation set, categorized by molecular group, can be found in the ESI.† Overall SPTPC-SAFT, performs well for the majority of molecular families.
![]() | ||
Fig. 7 Average percentage deviation in vapor pressure as a function of the molecular family. Molecular families are assigned according to the DIPPR database.51 Of the 870 components in the validation set, 609 components could be assigned a molecular family. Green boxes show families with a median APD of 2.5% below the overall mean APD of 13.5%, red boxes show families with an APD of 2.5% above the overall mean APD. |
The APD in liquid density is generally lower than the deviation in vapor pressure. A comparison of the numerical values for the two quantities is difficult due to the different range and quality of the data. The trend is in line with the general behavior of PC-SAFT, as demonstrated by the large-scale parameterization of Esper et al.53 For densities, our SPTPC-SAFT model achieves a mean APD of 3.1%. Predicted liquid densities at 1 bar are shown for a range of alkanes and alcohols in Fig. 8, generally demonstrating a good agreement with the measured data.
![]() | ||
Fig. 8 Prediction of molar density of C4 to C10 alkanes (a) and alcohols (b) at 1 bar over a range of temperatures using SPTPC-SAFT (lines). Experimental data (crosses) are taken from the DDB. |
Name | SMILES | m | σ/Å | ε/k/K | μ/D | κAB | εAB/k/K |
---|---|---|---|---|---|---|---|
Butane | CCCC | 2.3 | 3.7 | 224 | |||
Hexane | CCCCCC | 2.9 | 3.9 | 244 | |||
Octane | CCCCCCCC | 3.6 | 3.9 | 248 | |||
1-Butanol | CCCCO | 3.2 | 3.5 | 247 | 0.006 | 2409 | |
1-Hexanol | CCCCCCO | 3.7 | 3.6 | 258 | 0.005 | 2498 | |
1-Ethoxypentane | CCCCCOCC | 3.9 | 3.7 | 236 | 2.5 | ||
1,2-Diethoxymethane | CCOCOCC | 3.6 | 3.5 | 231 |
The ESI† presents the receiver operating characteristic (ROC) curves of the association and polarity likelihood parameters, illustrating the trade-off between true positives and false positives. SPTPC-SAFT achieves a 100% true positive rate for associating molecules and approximately a 90% true positive rate for polarity. Given that we use classification in the normally continuous spectrum for polarity, a 100% true positive rate is not expected. Therefore, our model architecture enables SPTPC-SAFT to accurately learn when molecules exhibit associating or polar interactions and assign appropriate pure component parameters.
The comparison between SPTPC-SAFT and GC-Sauer on the two sets of molecules indicates a substantial difference between the performance of the GC-Sauer and SPTPC-SAFT methods when extrapolating beyond the interpolation set (Fig. 9): while the GC method performs decently within the interpolation set, with a mean APD of 12.8% compared to 7.3% of SPTPC-SAFT for the vapor pressure, it falls short when extrapolating to more complex molecules, resulting in a much larger mean APD of 48.0% compared to 11.1% for SPTPC-SAFT. Similar performance benefits are observed for SPTPC-SAFT in predicting liquid densities. Here, for the interpolation set, SPTPC-SAFT has an mean APD of 4.0% compared to 6.4% of GC-Sauer and, for the extrapolation set, 3.5% compared to 11.9% of GC-Sauer.
Our results demonstrate that the much simpler GC method of Sauer et al.6 performs reasonably well for molecules similar or equal to those to which it was parameterized, but extrapolating capabilities are limited for more complex molecules. To cover a more comprehensive molecular space without manually defining an extensive set of (potentially higher order) groups, an approach that captures the complexities of molecules, like SPTPC-SAFT, is required. Moreover, even compared to more complex and recent machine learning approaches SPTPC-SAFT compares favorably.
Compared to the recently published methods by Felton et al.33 and Habicht et al.,32 SPTPC-SAFT compares favorably. However, since there is no consistent validation set used across the studies, there is some uncertainty in this discussion. The reported average relative percentage errors in vapor pressures by Felton et al.33 are 39% based on a similar dataset as our clean dataset, compared to SPTPC-SAFT mean APD of 13.5%. Habicht et al.32 report average relative percentage deviations below 20% for many molecular families, however, limited to non-polar, non-associating molecules for which SPTPC-SAFT has a mean deviation of 10%. Overall, the better performance of SPTPC-SAFT might lie in the direct training on experimental data and not on previously fitted PC-SAFT parameters. Thus, SPTPC-SAFT is able to use a larger amount of data points and avoids error accumulation via the additional regression step.
For four example isomere pairs, i.e., the cis and trans isomers of 1,1,1,4,4,4-hexafluorobutene, stilbene, 2-hexene, and 2-hexanedinitril, the predicted vapor pressure is shown in Fig. 10. Due to the different polarity, the isomers of 1,1,1,4,4,4-hexafluorobutene and stilbene have measurably different vapor pressures. SPTPC-SAFT is able to predict the trend in vapor pressures, which is remarkable considering that the majority of isomers in the training data is similar to 2-hexene which shows no significant difference between the two isomers. However, 2-hexenedinitrile presents a challenge for the model, as it fails to distinguish between isomers even though there is a difference in vapor pressure between the cis and trans versions. When and why SPTPC-SAFT fails in distinguishing specific isomers should be subject to further research. We observed some instances within our training data of likely mislabeling between isomers, which may impede the model's performance. Overall, the results concerning stereoisomer differentiation are encouraging, but more and better data on stereoisomers is required to unlock the full capability of the model.
![]() | ||
Fig. 10 Pressure–temperature plots of the isomer pairs (a) 1,1,1,4,4,4-hexafluorobutene, (b) stilbene, (c) 2-hexene and (d) 2-hexenedinitril. |
By making these pre-computed pure component parameters available, we aim to facilitate broader adoption and utilization of the PC-SAFT equation of state across various applications and allow for exploring vast molecular spaces.
Our model demonstrates excellent predictive performance on a validation set of 870 components, achieving a mean APD of 13.5% for vapor pressures and 3% for liquid densities. Remarkably, 99.6% of the predictions fall within a factor of 2, indicating a minimal presence of outliers.
Compared to the homo-segmented group contribution method of PC-SAFT by Sauer et al.,6 our SPTPC-SAFT model provides significantly higher quality predictions for both vapor pressures and liquid densities and compares favorably to more recent ML models. In particular, for more complex molecules, the prediction accuracy of SPTPC-SAFT is four times higher than the group contribution method. Moreover, our model can differentiate between stereoisomers, highlighting its potential for improved accuracy in predicting the properties of subtle molecular effects. We believe that SPTPC-SAFT offers a versatile and robust approach for predicting equilibrium thermodynamic properties and the corresponding pure component parameters of PC-SAFT, allowing for applications in thermodynamics, process engineering, and material science.
However, the current formulation for the prediction of dipole moments only allows for the assignment of dipole moments on a physical basis, but not the prediction of its magnitude. Furthermore, a more in-depth study of the relationship between amount and quality of the training data and the final-prediction quality as well as the uncertainty of predictions towards the data are still lacking and will be part of future research.
To make our model more accessible to researchers and industry professionals, we have precomputed pure component parameters of PC-SAFT for a large number of components.
The SPTPC-SAFT model presents a significant advancement in the prediction of equilibrium properties and corresponding pure component parameters of PC-SAFT. By leveraging machine learning techniques, our model offers improved accuracy in predicting the properties of various molecules while being capable of handling complex molecular structures and subtle differences in isomers. The availability of precomputed pure component parameters of PC-SAFT will further facilitate the adoption of our model and enable its use in a broad range of research and industry applications.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4dd00077c |
This journal is © The Royal Society of Chemistry 2025 |