Yunsung
Lim
and
Jihan
Kim
*
Department of Chemical and Biomolecular Engineering, Korea Advanced Institute of Science and Technology (KAIST), 291, Daehak-ro, Yuseong-gu, Daejeon 34141, Republic of Korea. E-mail: jihankim@kaist.ac.kr
First published on 21st June 2022
Transfer learning (TL) facilitates the way in which a model can learn well with small amounts of data by sharing the knowledge from a pre-trained model with relatively large data. In this work, we applied TL to demonstrate whether the knowledge gained from methane adsorption properties can improve a model that predicts the methane diffusion properties within metal–organic frameworks (MOFs). Because there is a large discrepancy in computational costs between the Monte Carlo (MC) and molecular dynamics (MD) simulations for gas molecules in MOFs, relatively cheap MC simulations were leveraged in helping to predict the diffusion properties and we demonstrate performance improvement with this method. Furthermore, we conducted a feature importance analysis to identify how the knowledge from the source task can enhance the model for the target task, which can elucidate the process and help choose the optimal source target to be used in the TL process.
Design, System, ApplicationMetal–organic frameworks (MOFs) are promising materials for various applications and due to their highly tunable nature, there is a potentially infinite number of MOFs that can be synthesized by combining metal nodes and organic linkers. Recently with the advent of various deep learning methods, the process to select and design the best candidate materials for a given application has been explored by many different research groups. However, most of these deep learning methods require large amounts of data which limits their scope to those studies where the time it takes to obtain said data is short. Transfer learning (TL) is a way to overcome these limitations, where one can in principle utilize the knowledge gained from one problem where obtaining data is cheap to solve a different but related problem where obtaining data is expensive. For the test case, we used methane gas uptakes of over 20000 hypothetical MOFs and used this dataset to improve the prediction of methane diffusion coefficients of MOFs, in which the latter requires long computational times. In our opinion, this approach can be expanded to accelerate the discovery or design of new MOF candidates with various applications where data are sparse or difficult to obtain. |
In this context, various deep learning methods have been recently developed and applied as principle methods to analyze these MOF databases.12–19 Burner et al. predicted adsorption properties related to CO2 using a deep neural network that uses combinations of geometric and chemical descriptors.20 Rosen et al. did quantum chemical calculations within MOFs to construct a database (QMOF database) and applied machine learning (ML) with the database to prove the effectiveness of the ML approach to discover MOFs with exceptional electronic properties.21 Lee et al. utilized a deep neural network as an efficient tool to explore a vast MOF space to design novel MOFs as an adsorbent for methane adsorption.11 However, these conventional deep learning methods all require a sufficiently large amount of datasets to train a model with an acceptable level of accuracy. For computational simulations such as grand canonical Monte Carlo (GCMC) simulations that require a relatively short wall time, this is not too much of a concern. However, certain simulations such as molecular dynamics (MD) and quantum chemical calculations take significantly large computational resources and as such can be the major bottleneck that hinders the progress for machine learning studies.22 Within this line of reasoning, it would be useful if one can use the dataset from simulations that require a low computational cost and transfer this knowledge to build machine learning models that facilitate the prediction of properties that require large computational resources.
This kind of procedure to exploit knowledge from a large data set to enable a model trained properly for a smaller data set is called transfer learning (TL).23 Recently, Yamada et al. developed the Python library for TL tasks in various types of materials (e.g. small organic molecules, polymers, and inorganic crystalline materials), XenonPy.MDL. More than 140000 models that are pre-trained with various property data exist in the library. Then, they used the pre-trained model from their library to improve prediction performance with small data by leveraging the knowledge of the big data.24 Jha et al. applied TL to develop a model that can accurately predict experimental formation energy values.25 For MOFs, Ma et al. tested the transfer of knowledge from the same physical properties (adsorption properties) within hypothetical MOFs using a deep neural network. They observed that TL can work between two different guest molecules, H2 and CH4.26
In this work, we seek to develop a model that predicts diffusion properties in MOFs by transferring the knowledge learned from adsorption properties. As mentioned previously, diffusion properties require running long MD simulations and as such, compared to adsorption property simulations (via GCMC), the computational cost it takes to prepare a dataset for machine learning purposes is very large. As a case study, the methane (CH4) molecule was used as a test gas molecule to demonstrate the capabilities of our workflow. Methane is one of the most widely used target molecules for chemical separations (e.g. H2/CH4, N2/CH4, CO2/CH4, CH4/other hydrocarbons)27–30 and as such, it is important to compute the diffusion coefficients of these small molecules. Moreover, given that many of the porous materials have large diffusion barriers that separate the pores of the materials, many of the self-diffusion coefficients are very small which require running very long MD simulations.31 In this computational work, we discover that the prediction of the self-diffusion coefficients (diffusion property) with a small size dataset (<500) can be improved by up to 25% using knowledge from the gas uptakes (adsorption property) of a relatively large size dataset (>20000). Moreover, feature importance analysis is conducted to elucidate why the transfer learning is effective between diffusion and adsorption properties. This transferability can open up a new opportunity to build machine learning models that entail computing properties with a high computational cost such as results from the ab initio quantum chemistry methods or experimental results.
Fig. 2 Scatter plot between the methane uptake and self-diffusion coefficient under the dilute condition. The color bar denotes the density of MOFs within the vicinity of the points. |
To leverage the knowledge in predicting the target property (self-diffusion coefficient), a model pre-trained with the source property (gas uptake) should be prepared. As such, the methane gas uptakes of hMOFs at 12 different pressures from 0.25 bar to 100 bar were obtained. 23845 hMOFs from the PORMAKE and ToBaCCo database were used and the data set was divided into training, validation, and test set with the ratio of 72:8:20 for cross-validation. The uptake results were normalized to a value between 0 and 1 during the training process. For each pressure, different machine learning models were trained and all of the models showed high performance (R2 score >0.9) (see Fig. 3). One can also see a slight increasing trend of the R2 score for higher pressure, which follows the results from previous work.34,35 As such, we can conclude that our pre-trained models were trained properly with the source domain (hMOFs and methane gas uptake) and as such, they were ready to be used for training with the target domain (experimentally synthesized MOFs and self-diffusion coefficient).
For performance evaluation criteria for both TL and direct learning (DL, training without a pre-training model) models, the R2 score was mainly used. There were enhancements in the performance to predict diffusion properties when leveraging knowledge from adsorption properties at several pressures for sizes 300 and 500. However, in the case of size 100, no significant improvements were found in the performance despite using the pre-trained model, and at certain pressures, the performance deteriorated compared to DL. As shown in Fig. 4a, the highest R2 score of the TL model among 12 pressures for size 100 is 0.175 which is a 19.8% improvement from the R2 score of the DL model (R2DL,size100 = 0.146). Although there are improvements in R2 scores for certain pressures amongst size 100 models, R2 scores of the worst cases were negative which means that there was no significant relationship between the true value and the predicted value. Meanwhile, the highest R2 score among the TL models at sizes 300 and 500, respectively, improved to 24.8% and 15.7%, respectively, from the R2 score of the DL models (R2DL,size300 = 0.406, R2TL(100bar),size300 = 0.507, R2DL,size500 = 0.491, R2TL(2bar),size500 = 0.568) and the lowest R2 score still showed a positive value. This type of improvement was retained when the performance metric was changed to root mean squared error (RMSE) and mean absolute error (MAE). There were decreases in RMSE and MAE for the top 2 TL models among 12 pressures compared to the DL model (Fig. 4b and ESI† Fig. S2). Considering that a small RMSE denotes high performance, the highest RMSE among the TL models at sizes 300 and 500 shows, respectively, only 1.05% and 2.32% higher than the RMSE of the DL model (RMSEDL,size300 = 0.568, RMSEDL,size500 = 0.527). However, there was a 19.7% increase in RMSE at size 100 compared to the RMSE of the DL model (RMSEDL,size100 = 0.631, see Fig. 4b). Likewise, in the case of MAE as the evaluation metric, the highest MAE among the TL models at size 100 is 16.0% higher than the MAE of the DL model (MAEDL,size100 = 0.463), while those at sizes 300 and 500 show no increase (MAEDL,size300 = 0.416, MAEDL,size500 = 0.377, see ESI† Fig. S2). The phenomenon that the pre-trained model cannot work properly with a small data size was ascribed to the difference between the gas uptake and self-diffusion coefficient. Considering that the first layer of the pre-trained model was frozen during the TL task, the model cannot find a meaningful relationship between MOF representation from the first layer and self-diffusion coefficients with a small size of dataset. However, the model can learn unique patterns between MOF representation and self-diffusion coefficients as the size of the data becomes larger.
For further analysis, we compared the R2 scores of the TL model and the DL model for every 50 random draws with the best and the worst TL cases (see Fig. 4c and d). The pre-trained models with gas uptakes at 100 bar and 0.25 bar are respectively shown to be the best and the worst source datasets to lead to accurate self-diffusion coefficient prediction. The pre-trained model at 100 bar was found to have ranked at least 2nd place in all nine cases while the pre-trained model at 0.25 bar was ranked last place or just above the bottom in eight out of nine cases (nine cases: R2 score, RMSE, and MAE results for sizes 100, 300, and 500). Out of these results, the data from size 300 were further analyzed because there was the largest improvement in prediction performance by applying the pre-trained model compared to DL. As shown in Fig. 4c, 300-TL-100 (navy) showed a higher R2 score than 300-TL-0.25 (green) in 39 out of 50 randomly drawn sets. The R2TL,test of 300-TL-100 ranged up to 0.722 which is much higher than the R2TL,test of 300-TL-0.25, up to 0.658. The trend was maintained even when the evaluation metric was changed to RMSE and MAE. 300-TL-100 (navy) showed lower value than 300-TL-0.25 (green) in 39 out of 50 randomly drawn sets with respect to RMSE and 43 with respect to MAE (see ESI† Fig. S3). To identify how often the TL models outperform the DL models, R2TL–R2DL was calculated (see Fig. 4d) where R2TL–R2DL > 0 means that the TL model shows better prediction performance than the DL model. The knowledge from the gas uptake at 100 bar helps the model surpass the model without any knowledge for 37 out of 50 randomly drawn sets. However, 300-TL-0.25 surpassed the DL models in only 25 sets, and as such, we can expect that no significant difference occurred by borrowing knowledge from the gas uptake at 0.25 bar, but rather the knowledge disturbed the prediction of the self-diffusion coefficient.
Furthermore, to investigate the effectiveness of the TL as the data size increases, we compared the previous results with much larger data sizes of 1000 and 1500. For these sizes, only 5 random draws were conducted considering that the raw data set only contains self-diffusion coefficients of 1563 frameworks and many of them should be overlapping in randomly drawn sets. Overall, although there were enhancements in prediction performance in data sizes larger than 300, the gap in performance between TL and DL is reduced as the size of the data increases (see Fig. 5). The best TL model (300-TL-100) showed an improvement of 25.0% in the R2 score compared to the DL model for size 300, but the improvement is reduced to 13.4% for size 500 and the degree of performance improvement approximately converged in size 1500. Nevertheless, just as the R2 score of the TL model in size 300 is higher than that of the DL model in size 500, it is still valid in a small data size where TL can reduce the process of collecting data, which is a bottleneck of deep learning. Moreover, considering that the standard for moderate prediction accuracy is R2 score >0.5,36 it was an unacceptable model to predict the self-diffusion coefficient with only 300 data size, but with the help of the pre-trained model, the model can be equipped with a moderate predictive power for self-diffusion coefficients. Even if different evaluation metrics (RMSE and MAE) are applied, the TL model in size 300 still performs better than the DL model in size 500 (see ESI† Fig. S4). Considering that the simulation time for the self-diffusion coefficient is >1800 times larger than that for gas uptake, better performance can be achieved with 62.62% of computational costs when 200 self-diffusion coefficient calculations are substituted with the 23845 gas uptake calculations (more details are shown in Section S3 of the ESI†).
Using the PFI results, we tried to explain why there is a performance difference among the TL models even if they shared the same type of physical properties from the GCMC simulation (albeit with different pressures). As such, the PFI result from the DL model trained with just the self-diffusion coefficient data was set as a reference and compared with the PFI results from the pre-trained models that were regarded as the best (100 bar) and the worst models (0.25 bar) in the previous section. The model that shows the highest R2 score among 5 models with size 1500 was selected as a reference model to construct a robust PFI result. Comparing the importance of the top 10 features based on PFI values of the reference model, both the reference and the best model share the “void fraction” as the most important feature, and other features were rather important for the worst model. Likewise, when the tendencies of the PFI values are similar, better performance improvement can be expected in TL. Interestingly, the “gravimetric surface area” (Grav ASA in Fig. 6a and b) and “energy range at −220k–40k”, which were not regarded as impactful features in the prediction of the self-diffusion coefficient, show a high importance (see Fig. 6a and b).
In addition, the Euclidean distance between two vectors, which consists of PFI values of the previously selected 10 features from the reference and the pre-trained model, was calculated to quantify the similarity of the PFI results. A small distance can imply a high similarity between two vectors. The distance of the best model is 0.305 which is much lower than the distance of the worst model, 0.409. From the results, we can say that the best model possessed a higher similarity with the reference model than the worst model in terms of the feature importance. Thus, we can expect that the PFI can be one method to identify whether the pre-trained model can provide meaningful knowledge for the TL task or not (see Section S4, ESI†).
The hMOFs were generated using two top-down based MOF construction packages: PORMAKE11 and ToBaCCo.10 PORMAKE has an advantage in the generation of structures with a high degree of diversity due to its large elements database which contains 719 node building blocks, 234 edge building blocks, and 1775 topologies. So, it can properly reflect the high tunability of MOFs. On the other hand, ToBaCCo can generate more synthesizable structures because it limits the elements database to only 41 highly symmetric topologies and building blocks that were already used in synthesized MOFs.10,40 Altogether, a total of 23845 hMOFs were obtained from the two MOF construction tools (12605 from PORMAKE and 11240 from ToBaCCo). Since the hMOFs generated from PORMAKE and ToBaCCo contain no solvents within the structures, all solvents removed CoRE MOF 2019 dataset were selected to endow the same conditions for both databases.
The self-diffusion coefficient was obtained from MD simulation using the LAMMPS software package.44 To calculate the self-diffusion coefficient, mean squared displacement (MSD) of methane molecules was recorded and the Einstein relation was used (eqn (1)).
(1) |
Five geometric descriptors (the largest cavity diameter, pore limiting diameter, volumetric surface area, gravimetric surface area, and void fraction) were obtained from the Zeo++ software.51 Surface areas were calculated with a nitrogen probe and void fractions were calculated with a helium probe. Moreover, energy descriptors were used to apply chemical factors during prediction.52 Energy descriptors were created with two steps (see Section S5 of the ESI† for details). First, energy grids (energy values are calculated in a grid at specified regular intervals) were generated as a spacing of 1 angstrom using the GRIDAY algorithm.53 Then, every value within energy grids was converted into a histogram with 50 bins. The normalized counts for bins were used as features for the deep neural network.
To achieve transfer of knowledge, first of all, the model was trained with the hypothetical MOF databases (PORMAKE and ToBaCCo databases) as inputs and gas uptakes at certain pressures as outputs. Then, the pre-trained model was fine-tuned with experimentally synthesized MOFs (CoRE MOF database) as inputs and self-diffusion coefficients as outputs. During training in progress, the weights between the input layer and the first hidden layer are frozen and the other weights are finely tuned to find the optimal value to predict the self-diffusion coefficient. The data and associated scripts for the TL models in this work are available at https://github.com/YunsungLim/TL-from-adsorption-to-diffusion-in-MOFs.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d2me00082b |
This journal is © The Royal Society of Chemistry 2022 |