Shiyi
Qin
^{a},
Shengli
Jiang
^{a},
Jianping
Li
^{a},
Prasanna
Balaprakash
^{bc},
Reid C.
Van Lehn
^{a} and
Victor M.
Zavala
*^{ab}
^{a}Department of Chemical and Biological Engineering, University of Wisconsin–Madison, 1415 Engineering Dr, Madison, WI 53706, USA. E-mail: victor.zavala@wisc.edu
^{b}Mathematics and Computer Science Division, Argonne National Laboratory, 9700 S Cass Ave, Lemont, IL 60439, USA
^{c}Leadership Computing Facility, Argonne National Laboratory, 9700 S Cass Ave, Lemont, IL 60439, USA
First published on 30th November 2022
Graph neural networks (GNNs) have been widely used for predicting molecular properties, especially for single molecules. However, when treating multi-component systems, GNNs have mostly used simple data representations (concatenation, averaging, or self-attention on features of individual components) that might fail to capture molecular interactions and potentially limit prediction accuracy. In this work, we propose a GNN architecture that captures molecular interactions in an explicit manner by combining atomic-level (local) graph convolution and molecular-level (global) message passing through a molecular interaction network. We tested the architecture (which we call SolvGNN) on a comprehensive phase equilibrium case study that aims to predict activity coefficients for a wide range of binary and ternary mixtures; we built this large dataset using the COnductor-like Screening MOdel for Real Solvation (COSMO-RS). We show that SolvGNN can predict composition-dependent activity coefficients with high accuracy and show that it outperforms a previously-developed GNN used for predicting only infinite-dilution activity coefficients. We performed counterfactual analysis on the SolvGNN model that allowed us to explore the impact of functional groups and composition on equilibrium behavior. We also used the SolvGNN model for the development of a computational framework that automatically creates phase diagrams for a diverse set of complex mixtures. All scripts needed to reproduce the results are shared as open-source code.
In a typical GNN architecture for prediction of molecular properties,^{36} the characteristics of the atoms and of the bonds are propagated based on the chemical structure of a single input molecule, followed by featuring embedding via nonlinear transformation. The embedded features are then fed to fully-connected layers to construct predictive models. GNNs have achieved better performance than conventional descriptor-based approaches in various benchmark datasets.^{37,38} When dealing with multiple components, several approaches have been devised; a typical way to encode multi-molecule information is to simply average or concatenate the features of individual molecules and to use these as system-level features for property inference with fully connected or attentive layers.^{14,15,19} Previous studies have also incorporated weighted sums or concatenation to take into account composition information when needed.^{19} However, these approaches do not capture molecular interactions in an explicit manner, which may limit the predictive power of GNNs for systems in which intermolecular interactions play an important role.
In this work, we present a GNN architecture that explicitly incorporates molecular interactions via the combination of atomic-level (local) graph convolution and molecular-level (global) message passing for property prediction of multi-component chemical systems. To connect local features with global features, we construct a molecular interaction network as an intermediate step. The molecular interaction network is a complete graph in which each node represents a molecule and each edge represents a hypothetical intermolecular interaction (e.g., hydrogen bonding). This representation serves as a physics-informed topological prior that aids feature extraction from multi-component systems. The composition information is also encoded in the architecture as additional node feature for the molecular interaction network. We hypothesize that, with this type of data representation and feature propagation guided by physical intuition, the proposed architecture may better model mixture properties while taking composition information into consideration.
We evaluate the proposed GNN architecture through a comprehensive case study on miscibility calculations for multi-component systems. We choose activity coefficients as the target thermodynamic property of interest, which measures the deviation of a liquid mixture from ideal solution behavior. Activity coefficients are one of the fundamental properties of a mixture and therefore can lead to the derivation of equilibrium conditions (e.g., phase diagrams), which are important in physical chemistry and engineering for understanding and optimizing chemical separations.^{39} Previous studies have developed ML-based methods to predict infinite-dilution activity coefficients for binary mixtures, including matrix completion on the activity coefficient matrix^{28,40} and multilayer perceptrons on the system descriptors.^{41} However, these methods did not account for molecular structural information directly, and the latter is limited to systems of water in ionic liquids. A more recent study by Medina et al.^{29} used GNN models to tackle this problem; this approach, however, used a data representation that involves a simple concatenation of individual graph features after local embedding (i.e., the GNN architecture does not explicitly captures intermolecular interactions). Furthermore, all these previous studies have focused on predicting infinite-dilution coefficients, which do not take into consideration composition information (this limits their use in more sophisticated thermodynamic predictions such as phase diagrams). To the best of our knowledge, GNNs have not been explored as a method to predict composition-dependent activity coefficients nor have they been extended to predict activity coefficients for systems with more than two components. The proposed GNN architecture is generalizable to multiple component systems and captures composition.
Through our case study, we demonstrate that the proposed GNN (which we call SolvGNN) outperforms prior architectures (that lack an explicit graph representation of molecular interactions) in terms of prediction accuracy. Our study leverages a large dataset that was developed using the COnductor-like Screening MOdel for Real Solvation (COSMO-RS). SolvGNN also enables better modeling of mixture compositions due to the incorporation of global message passing on the molecular interaction network with hydrogen bonding information. We also show that SolvGNN can be applied to both binary and ternary liquid-phase mixtures to predict composition-dependent activity coefficients. To interpret our SolvGNN predictions, we perform counterfactual analysis^{42} to identify the impact of functional groups on activity coefficients. To demonstrate the applicability of SolvGNN, we developed a framework that can automatically predict phase behavior for complex binary and ternary mixtures. Example outcomes of the framework, such as binary P–x–y diagrams, can be used to study solvent miscibility and to help identify azeotrope compositions to guide the design of targeted mixtures and chemical separations. We share scripts and datasets as open-source code to enable the reproduction of the results and to conduct benchmarks.
To visualize the coverage of the chemical space, the solvents were grouped into 22 categories based on a predefined list of functional groups (details are provided in the ESI Section 1†). The visualization is provided in Fig. 1a. The sampled binary mixtures are represented by connections between nodes. The number of solvents in each category and the number of sampled pairs are reflected by node size and edge thickness; this illustrates that our random mixture sampling covers a wide range of solvent pairs in different categories. We also visualized the chemical space of the solvents by performing a t-distributed stochastic neighbor embedding (t-SNE)^{44} dimensionality reduction technique on the Morgan fingerprints,^{9} also known as extended connectivity fingerprints^{10} in Fig. 1b. The 2D map from t-SNE shows separation between some solvent categories, such as nitriles and aromatics. However, because some solvents contain more than one identifiable functional group, they may potentially be grouped into another category. As a result, the clustering in a few other categories is less clear, but in general the scattered distribution here suggests the inclusion of diverse and complex chemical structures.
Fig. 1 Dataset visualization. (a) Solvent categories based on the primary functional group of individual solvents with examples. The categorization method is detailed in ESI.† The size of a node reflects the number of solvents in that category, and the thickness of an edge reflects the number of sampled pairs between two categories. (b) 2D map obtained by t-SNE dimensionality reduction^{44} applied to molecular fingerprints. |
We categorized the sampled binary and ternary mixtures based on whether each component in the mixture is polar or nonpolar (obtained from RDKit^{45}), as summarized in Table 1. We computed the percentage of each mixture type; this information was used for stratified sampling, which creates training/validation folds by preserving the percentage of samples for each mixture type (this ensure that the model learns different types of molecular interactions). Overall, most mixtures contain at least one polar component, indicating the presence of strong intermolecular interactions (e.g., dipole–dipole forces).
Mixture type | Count | Percentage | |
---|---|---|---|
Binary (280000) | Polar–polar (p–p) | 162547 | 58% |
Polar–nonpolar (p–n) | 101913 | 36% | |
Nonpolar–nonpolar (n–n) | 15540 | 6% | |
Ternary (160000) | Polar-polar–polar (p–p–p) | 71136 | 45% |
Polar–polar–nonpolar (p–p–n) | 66040 | 41% | |
Polar–nonpolar–nonpolar (p–n–n) | 20848 | 13% | |
Nonpolar–nonpolar–nonpolar (n–n–n) | 1976 | 1% |
COSMOtherm^{47} (version 2019), a software that implements COSMO-RS, was used to obtain composition-dependent activity coefficients for the individual components of each sampled mixture. Prior to COSMO-RS calculations, chemical structures were generated from CirPy (version 1.0.2), a Python library that serves as the interface for the Chemical Identifier Resolver (CIR);^{48} this searches the National Institutes of Health (NIH) database for the chemical structures and provides the optimized coordinates for the atoms. We next conducted DFT calculations using TURBOMOLE^{49} (version 7.5) at the BP-TZVP theory level with the Becke–Perdew (BP) functional and the resolution of identity approximation under ideal screening condition (ϵ^{∞}, COSMO continuum solvation model). A single-point calculation was then conducted with the def2-TZVPD basis set and fine cavity parameter to create the σ-profiles. Activity coefficients were then calculated given the σ-profiles of individual components, the mixture compositions in the liquid phase, and temperature (298 K).
As shown in Fig. 2, each input mixture is represented as molecular graphs of individual components with nodes v ∈ V, edges e ∈ E, and node feature matrix that encodes atom and bond information such as atom types and degrees.^{50} A local graph convolution^{51} was applied to each of the input molecular graphs, and the node features were updated through
(1) |
(2) |
We compared several approaches to capture molecular interactions. The first approach is illustrated in Fig. 2a and referred to as SolvCAT. In this approach, ’s undergo a feature concatenation with composition x to form a fixed-length latent feature vector. For a ternary system, for example, . The system-level feature vector is then sent to fully connected neural network layers for activity coefficient predictions. The second approach is illustrated in Fig. 2b and referred to as SolvGCN. In this approach, a molecular interaction network was constructed to explicitly simulate molecular interactions between the components in a system. The molecular interaction network is a complete graph where each node v_{mol} ∈ V_{mol} denotes a molecule, each edge e_{int} ∈ E_{int} denotes the existence of certain intermolecular interaction, and molecular-level node feature matrix:
(3) |
A global graph convolution is applied using the same updating rules as eqn (2); in this case, u_{mix} is obtained by concatenating the latent node features h_{mol}'s after global graph convolution.
The third approach is illustrated in Fig. 2c and referred to as SolvGNN. Building on SolvGCN, for this approach we developed a more informative representation of the molecular interaction network; we encoded hydrogen-bond (H-bond) information, one of the strongest form of dipole–dipole interactions, as the edge feature. For a ternary system, this feature is formulated as:
(4) |
(5) |
(6) |
In all three cases mentioned above, the embedded features after “intermolecular interactions” are sent to the fully connected readout layers for the final activity coefficient (γ_{i}) prediction.
Because the data sets contain a large number of binary or ternary mixtures at different compositions, it is computationally expensive to generate the corresponding molecular graphs for every training/validation instance. As a result, we designed our data loading and model training algorithm to lower the training time. Upon data set initiation, we generated and stored all 700 molecular graphs at once in a dictionary format. When a training/validation instance was passed to the algorithm, the molecular graphs were obtained from the dictionary using the index and only require simple manipulation (e.g., calculation for intermolecular H-bond) to form the desired mixture data. Doing so largely reduced redundant calculations and saved time (from days to a couple of hours).
We trained separate models for binary and ternary mixtures for simplicity and for a fair comparison between different GNN architectures. In the cases of SolvGNN and SolvGCN, the model architectures are the same (with the same number of learnable parameters) for binary and ternary mixtures given the permutation invariance nature of graph convolutions that perform node-level computation. However, in the case of SolvCAT, the model architectures vary for binary and ternary mixtures as molecule-level embeddings are concatenated together for the final inference, which results in a larger number of learnable parameters for ternary mixtures. Applying the same read-out layers to each molecule-level embedding can guarantee permutation invariance and interchangeability of binary/ternary inputs but worsens the model performance drastically, so we decided to keep the concatenation design choice for SolvCAT while augmenting the data through random permutation of component orders during training. Potentially, the binary and ternary data sets can be merged together as one data set with a single model trained to predict either case easily if we use SolvGCN or SolvGNN, but requires setting the features of one of the components to zero for binary mixtures and keeping the ternary architecture if we use SolvCAT, which may result in biased selection of the component to be masked. Therefore, the integration of binary and ternary mixtures is beyond the scope of the current project, and we would like to consider this as future work in the further development of the tool.
(7) |
(8) |
In this study, the saturation pressure P^{sat}_{i} for each component was obtained using the Antoine Equation with coefficients collected from the National Institute of Standards and Technology (NIST) via web scraping.^{56} We sampled the liquid-phase compositions x_{i} and calculated the equilibrium pressures P with the specified compositions at 298 K. For ternary systems, we computed phase behavior following the same method for the binary systems by sampling the mixture compositions followed by equilibrium pressure calculations.
The performance of the three GNN architectures was evaluated on the binary mixture data set by the cumulative frequency plot, as shown in Fig. 3a. The infinite dilution activity coefficients for these systems were included as extreme concentrations. More specifically, we assigned a mole fraction of 0 to the infinitely dilute component and a mole fraction of 1 to the other component (values were also reversed for each pair as well to capture both infinite dilution activity coefficients). In the cumulative frequency plot, the absolute errors of the natural logarithms of the activity coefficients, lnγ_{1} and lnγ_{2}, (between true and predicted values from CV) for each data point were first averaged, and the cumulative frequencies for the averaged error values were then plotted in the ascending order. Among the three GNN architectures, SolvGNN exhibits the best performance; specifically, it shows that almost 97% of the data points are predicted with an error of less than 0.1. SolvCAT performs slightly worse, with 91% of the data points falling within the 0.1 error range. SolvGCN shows the worst performance, with only around 45% of the data points predicted with an error less than 0.1. These observations are also supported by the mean absolute errors (MAEs), which are 0.03, 0.05, and 0.31 for SolvGNN, SolvCAT, and SolvGCN (respectively). We also performed the same experiments on the binary mixture data set without infinite dilution activity coefficients, and the results are comparable (R^{2} = 0.98, MAE = 0.03, RMSE = 0.08; see Fig. S1†). Additionally, we developed a baseline model using XGBoost (R^{2} = 0.64, MAE = 0.21, RMSE = 0.50; see Fig. S2†), which was substantially less accurate than SolvGNN. These results are detailed in the ESI.†
Fig. 3 Model comparison and parity plots for binary and ternary mixtures. (a) Cumulative frequency plots for the average lnγ_{i} errors for binary (black) and ternary (red) mixtures to compare SolvCAT, SolvGCN, and SolvGNN. Additionally, the parity plots for individual lnγ_{i}'s between the true (COSMO-RS) and predicted (SolvGNN) values from CV are displayed for binary (b) and ternary (c) mixtures. The points are colored by the type of mixtures defined in Table 1 based on polarity. |
The above results indicate that the inclusion of the global interaction network with H-bond information in SolvGNN provides an effective method for improving the prediction accuracy for activity coefficients. When H-bond information is excluded, the pure global graph convolution worsens the model performance, possibly due to the unbiased “averaging” without any physics-informed resemblance to intermolecular interactions. Additionally, when setting all the edge features to one in SolvGNN, the CV MAE was increased by 9% and the CV MSE was increased by 15%, suggesting the significance of the physics-informed edge features in the interaction network (more details in ESI†). The added model complexity of SolvGNN is also a decisive factor for the performance difference, since message passing enables and propagates edge features through an edge neural network. On the other hand, SolvCAT, despite the lack of explicit global graph convolution that depicts intermolecular interactions, still exhibits satisfactory predictive power. This is consistent with an earlier study,^{19} which has found that aggregation over latent features provides an effective approach to handle information of mixture composition. However, SolvCAT is not strictly permutation invariant to the order of the input components, even though the data were augmented by random order switching during training.
Fig. 3b shows the parity plot of lnγ_{i}'s from SolvGNN for binary mixtures. All the predictions shown are from the validation process yet still exhibit high accuracy, with average lnγ_{i} MAE being 0.03 and average lnγ_{i} RMSE being 0.10. The data points are colored by the mixture type defined earlier. In general, the values of lnγ for nonpolar–nonpolar interactions are close to 0 (ideal behavior) and have smaller MAE, while the values of lnγ for mixtures with polar components spread across the entire data range and have slightly larger MAE. With respect to composition, mixtures that are rich in one of the components (10%/90%) exhibit a slightly higher MAE (∼0.032), whereas the mixtures with equimolar components exhibit a relatively lower MAE (∼0.025). We also identified a couple of outliers in the plots; these mixtures contain amines with hydrogen-bonding solutes or solvents. The extreme lnγ values of these mixtures can be the result of limitations of COSMO-RS, which has been specifically noted to incorrectly simulate the interactions of secondary and tertiary amines when hydrogen-bond donors or acceptors are present in the system.^{57}
Besides the regular CV using stratified sampling that relies on the type of mixture, we also tested the generalizability of the SolvGNN using an alternative CV method. Here, for each CV fold, we trained the model on only two of the three mixture types (polar–polar, polar–nonpolar, or nonpolar–nonpolar; see Table 1) and validated the rest. Results have shown that, although the model could achieve similar training losses to the regular CV, the validation accuracy was reduced accordingly. For the case where we trained the model with polar–polar and polar-nonpolar mixtures (94% of the data set) while validating on nonpolar–nonpolar mixtures, the model demonstrated suitable transferability by comparable validation losses (MAE = 0.04). However, when we trained the model on only polar–polar and nonpolar–nonpolar samples (64%) while validating on polar-nonpolar samples, the validation MAE increased by 0.16, which indicates the distinct nature of polar–nonpolar interactions and suggests that it is non-trivial and therefore cannot be omitted during model training. Additionally, we examined the condition when the model was trained on only polar–nonpolar and nonpolar–nonpolar mixtures (42%) and validated on polar–polar mixtures. The convergence plot (Fig. S3†) indicates over-training with high validation losses; such behavior was expected because the training size is less than half of the data set, and the majority of the training samples lack H-bond acceptors or donors, which are present in most of the validation set. This result again suggests that polar–polar and polar–nonpolar mixtures, despite possessing strong intermolecular interactions such as H-bonding in both cases, are intrinsically different and therefore are both required in the training process. These results are in general agreement with chemical intuition.
To further demonstrate the generalizability of the proposed SolvGNN, we conducted two additional data splitting methods that enforce all validation data to be unseen systems or components. sFor both experiments, SolvGNN still outperforms other architectures and still exhibits strong predictive performance based on the parity plots (detailed in the ESI†).
For SolvGNN, comparable model accuracy was obtained even though the number of training/validation samples was reduced for the ternary mixture data set compared to the binary mixture data set, as shown in Fig. 3c. The CV R^{2}, MAE, and RMSE are similar to the results from binary systems, with corresponding values around 0.99, 0.03, and 0.10. When breaking down the predicted values based on the mixture type, we found that samples containing only nonpolar components tend to have smaller errors and systems containing only polar components have larger errors. Mixtures with both polar and non–polar components have MAEs and RMSEs lying somewhere in between the extremes. When grouping by composition, mixtures that are rich in one of the components tend to have slightly higher prediction errors than equimolar mixtures. This observation is consistent with model performance on binary mixtures without infinite dilution data and could be caused by the fact that the majority of the training data are not equimolar systems.
Overall, SolvGNN exhibited satisfactory performance in making predictions for activity coefficients of binary and ternary systems, given the advantage of explicitly including H-bond information (as a representative and primary intermolecule force) via global message passing on the molecular interaction network. To the best of our knowledge, this is the first time that such a graph-based architecture (permutation invariant to the component order) is used to make predictions for composition-dependent activity coefficients (compared to models that predict infinite-dilution activity coefficients only) and for ternary systems (compared to binary systems).
Model | MAE | SDEP | MSE | RMSE | MAPE | R ^{2} |
---|---|---|---|---|---|---|
Previous GNN^{29} | 3.91 | 26.73 | 729.69 | 27.01 | 22.66 | 0.82 |
SolvGNN | 3.25 | 19.52 | 391.45 | 19.79 | 11.40 | 0.89 |
% Difference | −17% | −27% | −46% | −27% | −50% | +9% |
In general, our results provide evidence that SolvGNN can be used to predict infinite-dilution activity coefficients in a satisfactory manner, thus illustrating that the architecture is versatile. Comparison of the evaluation metrics indicates that there is a benefit in including intermolecular interactions in the GNN architecture. These results also provide evidence that the SolvGNN architecture can be used to learn not only from simulation data (e.g., COSMO-RS) but also from experimental data.
Fig. 4 Counterfactual analysis. Type I (red) shows mixtures with the most similar structures but the most different activity coefficients from the base mixture whereas Type II (green) shows the opposite. The corresponding solvents are labeled on the 2D t-SNE map introduced in Fig. 1 to help illustrate similarity. |
On the other hand, Type II counterfactuals also reveal interesting trends to identify mixtures with dissimilar chemical structures but similar γ_{i}'s. When fixing one of the components, counterfactuals 4 and 5 both acquire an alternative component that is nonpolar. In both cases, one of the aromatic components is replaced by a non-aromatic structure as the result of the effort to minimize similarity, but since the replacement is also nonpolar, the mixtures exhibit near-ideal behavior as reflected by the activity coefficients. Lastly, when we relaxed the constraint and allowed both components to vary, counterfactual 6 picks out the mixture from the data set that shows two nonpolar yet unlike chemical structures with near-ideal behavior.
In general, the counterfactual analysis has shown coherent physical insights regarding how compositions and structural features may lead to variations in activity coefficient, and these findings in turn agree with our chemical understanding of mixture behavior. Such interpretation, especially Type II counterfactuals, can be used to apply SolvGNN to procedures such as the selection of a candidate good solvent for a desired solute. For example, counterfactuals could be used to identify an antisolvent given a known good solvent for a specific polymer for polymer recycling applications.^{58,59} The antisolvent is expected to be miscible with the solvent while immiscible with the polymer, and therefore counterfactual Type I may be identified as the candidate antisolvent.
Fig. 5 Example phase diagrams generated from SolvGNN. (a–c) P–x–y diagrams of three binary mixtures, each representing a different type of mixture (polar–polar, nonpolar–nonpolar, and polar–nonpolar). The equilibrium pressure is computed with modified Raoult's Law using the predicted activity coefficients from SolvGNN. The phase diagrams are compared with those generated from two other state-of-the-art tools, including COSMOtherm that implements COSMO-RS^{57} and Aspen Plus that implements UNIFAC^{60} (as well as other activity models^{61–65}). The vapor compositions (y_{i}) are represented as circles and liquid compositions (x_{i}) are represented as squares. (d–f) Predicted activity coefficients for individual components at different compositions from SolvGNN, COSMOtherm, and Aspen. “x” denotes activity coefficients at infinite dilution. In all phase diagram calculations, the lnγ_{i}'s are obtained by averaging the predictions of SolvGNN trained from each CV fold, and the standard deviations are visualized as the error bars. |
Fig. 5a–c includes representative example phase diagrams of polar–polar, nonpolar–nonpolar, and polar–nonpolar binary mixtures. At 298 K, a water–methanol mixture deviates positively from ideal solution behavior and shows higher equilibrium bubble pressure as a result of unfavorable unlike-molecule interactions. By contrast, a benzene–toluene mixture exhibits near-ideal behavior, as indicated by a bubble line that is almost linear, which suggests a homogeneous solution where molecular interactions between like and unlike components are viewed the same. Additionally, we showcase a cyclohexane-ethanol mixture that forms an azeotrope, which was successfully identified by SolvGNN, and the predicted azeotrope composition (x_{cyclohexane} ∼0.65) is consistent with the estimates from COSMOtherm (COSMO-RS) and Aspen Plus (UNIFAC). In all three cases, the predicted phase diagrams obtained by SolvGNN are consistent with the phase diagrams generated using COSMOtherm (COSMO-RS) or Aspen Plus (UNIFAC), and the MAE values in the equilibrium pressure range from 0.001 to 0.004 bar. We also observed that, compared to Aspen Plus (UNIFAC), SolvGNN tends to underestimate equilibrium pressure values, whereas COSMOtherm tends to overestimate these values. Upon inspecting the activity coefficients for the sample mixtures (Fig. 5d–f), we found that, although the activity coefficients were trained only on four compositions plus infinite dilution, SolvGNN was able to make relatively accurate predictions for compositions in a continuous space. We also compared experimental equilibrium data^{66} for cyclohexane-ethanol at similar temperatures for which data are available (293 K and 303 K) and found similar behavior and a similar azeotrope composition; these data as well as a few additional phase diagram examples along with their activity coefficient predictions are shown in ESI.†
Next, we computed vapor-liquid equilibrium (VLE) data for a ternary mixture of water–acetone–methyl isobutyl ketone (MIBK). Similar to the phase diagram calculations for binary systems, we sampled different liquid compositions and calculated the equilibrium pressures using modified Raoult's Law. For simplicity, we picked two pressures and computed corresponding liquid and vapor compositions; numerical comparisons between SolvGNN and COSMOtherm (COSMO-RS) are summarized in Table S4.† For the selected pressures, the predicted vapor-phase compositions (y_{i}) have an MAE around 0.02 when comparing SolvGNN predictions to COSMOtherm data.
In summary, we were able to create binary phase diagrams (at 298 K) with a full range of compositions using SolvGNN that was only trained on a few sampled input ratios. The provided framework has shown great potential for high-throughput screening of mixtures for use cases including azeotrope identification and non-ideal behavior investigation for liquid mixtures. Incorporating SolvGNN into such phase equilibrium calculations bypasses the need to identify functional groups with human expertise and obtain interaction parameters (as needed in UNIFAC) or to conduct DFT calculations (as needed in COSMO-RS), especially when the chemicals in a mixture are relatively uncommon. Moreover, this framework could be used in conjunction with open-source process models (e.g., BioSteam^{67}) as an addition to the existing computational models (e.g., UNIFAC) for generating thermodynamic data.
Compared to the current state-of-the-art approach for general activity coefficient estimations (e.g., UNIFAC and COSMO-RS), SolvGNN achieves comparable model performance and is easy to use without any additional calculations for missing parameters or DFT. We also benchmarked SolvGNN on the same experimental dataset that was used in an earlier study for developing a GNN that predicts only the infinite-dilution activity coefficients of binary mixtures;^{29} SolvGNN outperforms the previously developed GNN in almost all evaluation metrics, proving the importance to use prior knowledge (in this case explicit topological prior pertinent to intermolecular interactions) when designing GNN architectures. These findings demonstrate the ability of SolvGNN to learn from simulation (e.g., COSMO-RS) and experimental data.
Moreover, we provided an open-source computational tool for creating phase diagrams (P–x–y) using SolvGNN as an example to show its potential for real-world applications. The generated phase diagrams were consistent with those obtained from COSMOtherm and Aspen Plus (with the selection of UNIFAC as the thermodynamic method), which further illustrated the generalization ability of SolvGNN that was only trained on a minimal subset of composition cases. Besides phase diagrams, we provided algorithms to obtain counterfactuals to aid model interpretation, which may help extract physical insights that are less known and help design solvent mixtures.
The architecture and study can be expanded in a number of ways. For example, so far we have only obtained activity coefficients at room temperature, and thus SolvGNN does not have temperature dependence. However, obtaining temperature-dependent activity coefficients from COSMO-RS and re-training SolvGNN with an additional temperature variable would be a relatively trivial, given that the computational framework is in place. Another limitation for phase equilibrium predictions is related to the availability of Antoine coefficients; in circumstances where Antoine coefficients are missing e.g., no measurements for the substance or outside of the temperature range, we cannot compute the corresponding phase diagrams. A potential solution could be to develop another GNN architecture for Antoine coefficient predictions or expand the output dimension of our current SolvGNN to make these predictions. Additionally, since the model weights that are related to concentrations are not constrained, unphysical activity coefficient trend may be present from the current SolvGNN predictions. This issue may be addressed in future studies through conditioning constraints on the model weights or theory-infused network architecture. Furthermore, the presented counterfactual analysis only searches the chemical space within the data set, and therefore to obtain more meaningful results, we will adapt some of the more established chemical search algorithms^{42,68} that have been designed for single chemicals to the case of mixtures. Future studies will also explore the use of SolvGNN for other mixture properties and investigate different possible representations of intermolecular interactions (e.g., Lennard-Jones potentials as additional edge features or replacing edge features by molecular-level node features). We are also interested in using these types of architectures to design solvents that can selectively solubilize target molecules in combination with generative models.^{69–71}
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d2dd00045h |
This journal is © The Royal Society of Chemistry 2023 |