Discovery of Novel Reticular Materials for Carbon Dioxide Capture using GFlowNets

Artificial intelligence holds promise to improve materials discovery. GFlowNets are an emerging deep learning algorithm with many applications in AI-assisted discovery. By using GFlowNets, we generate porous reticular materials, such as metal organic frameworks and covalent organic frameworks, for applications in carbon dioxide capture. We introduce a new Python package (matgfn) to train and sample GFlowNets. We use matgfn to generate the matgfn-rm dataset of novel and diverse reticular materials with gravimetric surface area above 5000 m$^2$/g. We calculate single- and two-component gas adsorption isotherms for the top-100 candidates in matgfn-rm. These candidates are novel compared to the state-of-art ARC-MOF dataset and rank in the 90th percentile in terms of working capacity compared to the CoRE2019 dataset. We discover 15 materials outperforming all materials in CoRE2019.


Introduction
Artificial intelligence holds promise to improve the scientific method [1,2] and to accelerate scientific discovery.Applied to materials 1 , AI unlocks vast search spaces and enables novel applications in pharmaceuticals [3,4,5,6], batteries or carbon capture [7].
Reticular materials [8] such as metal organic frameworks (MOFs) and covalent organic frameworks (COFs) are extended periodic structures connected via strong bonds [9].They are synthesized by connecting building blocks known as secondary building units to form three dimensional periodic structures [10].By choosing the building blocks, the properties of a reticular material can be tuned to support many applications [8].
Reticular materials with high gravimetric surface area are particularly useful for applications in carbon capture, since carbon dioxide molecules adsorb at the internal surface area [11].The larger the gravimetric surface area, the more gas molecules can be absorbed per gram of material.
In this work, we use GFlowNets to generate reticular materials with high gravimetric surface area for applications in carbon capture.Our key contributions are: 1.The matgfn Phython library for training and sampling using GFlowNets.2. A workflow using matgfn to generate reticular materials using secondary building units.3. The matgfn-rm dataset of diverse and novel reticular materials with total internal surface area higher than 5000 m 2 /g.The top-100 reticular materials candidates are novel compared to the reference ARC-MOF dataset, rank in the 90th percentile in terms of working capacity compared to the CoRE201910 dataset.We discover 15 materials outperforming all materials in CoRE2019.

Background and related work
Generative Flow Networks GFlowNets [12,13] are an emerging machine learning algorithm with many applications in AI-assisted materials discovery [14].GFlowNets learn to generate composite objects objects x sampling from an unormalised distribution p(x) ∝ R(x) where R(x) is a userspecified positive reward function.A composite object x consists of symbols drawn from a vocabulary V and relationships between those symbols.For example, x can be a sequence x = [x Building hypothetical reticular frameworks Trillions of hypothetical frameworks such as MOFs or COFs can be generated by placing secondary building units [10] into nodes and edges of a three dimensional topology [15].A secondary building unit is an organic molecule or a coordination compound (a metal linked to organic atoms).A topology is a three dimensional arrangement of nodes and edges.Replacing nodes and edges with secondary building units results in a three dimensional point cloud of atoms connected by covalent or metal-organic bonds.We use the pormake secondary building units [16] and topology codes from the Reticular Chemistry Structure Resource [15].Previously, deep autoencoders [17] and evolutionary methods [16] have been used to generate frameworks using this approach.
Reference datasets We use two reference datasets in this work.These datasets are not used for training models, but as comparison once training is done, as GFlowNet generates candidates using just a reward function.The CoRE2019 dataset [18] consists of 12,023 metal-organic frameworks with carbon dioxide uptake properties calculated by Moosavi et al. [19] using Grand Canonical Monte Carlo.ARC-MOF (reported in 2022) [20] is a collection of 279,610 MOFs from previous MOF datasets.It contains both experimental and hypothetical MOFs.
3 Generating reticular frameworks with GFlowNets Python package We built a Python library called matgfn to train and sample GFlowNets.The library is built on top of PyTorch [21] and Gymnasium [22] and prioritises ease of use and code readability.We intend for matgfn to be a general Python package for generation of diverse types of materials from small molecules to framework materials.Architecturally, matgfn separates sampling, loss calculation, optimisation, and environment definition as modular Python classes.Each can be modified individually, to implement off-policy training or use improved losses, for example.We note similar architectural choices for torchgfn.

Environment for reticular framework generation
We configure a GFlowNet environment to build string sequences of the form ["N577", "N238", "N194", "E5", "E3", "E74"].Here N represents node building blocks and E represents edge building blocks in the pormake database.The string sequences are then transformed to a Crystallographic Information File (*.cif) by pormake to create a reticular framework.Not all strings create valid materials, so during generation, building blocks were restricted such that (a) each topology had the correct number of nodes and edges, (b) the building blocks were placed in the correct order and (c) each slot had a compatible building block.
Reward We calculate the Gravimetric Surface Area GSA in m 2 /g with Zeo++ [23] during the training loop of the GFlowNet.We configure Zeo++ with a probe radius of 1.525 Å and 2000 samples.The GFlowNet is given the following reward: where H(x) is the Heaviside step function H(x) = 0 if x < 0, 1 if x ≥ 0 and C is a cutoff.Zeo++ and pormake sometimes raise errors due to large distances between atoms.The reward is zero when an error occurred to encourage the GFlowNet to avoid materials with unrealistic bond lengths.
Relationship with CO 2 capture We check whether the gravimetric surface area predicts CO 2 uptake by analysing approximately 30,000 MOFs from three databases: CoRE2019 [24], ARABG [25] and BW20K [26].We performed univariate linear regression of CO 2 uptake at 16 bar using each of the geometric and chemical descriptors.The best performing descriptor was the gravimetric surface area with coefficient of determination is 0.88, RMSE is 2.41 mol kg -1 and Spearman's rank correlation coefficient of 0.97. Figure 1 shows the CO 2 uptake as a function of gravimetric surface area.We validated the regression using 50 rounds of 10-fold cross validation, with each cross-validation consisting of an 80-20 split between training and test data.The mean coefficient of determination is 0.88 ± 0.0002 and mean RMSE is 2.41 ± 0.022 mol kg − 1.The training and test values of coefficient of determination and RMSE are the same to two decimal places and the standard deviation of these metrics during cross validation are very small which shows that the correlation is robust and stable.

The matgfn-rm dataset
Training We trained a GFlowNet using Trajectory Balance loss [27] and an LSTM flow model.We use a learning rate of 5 × 10 −3 for both the flow model and the partition function.We train for a maximum of 100,000 episodes and stop when the mean loss over 10,000 episodes is lower than 1.8.Eleven topologies were chosen: CDZ-E, CLD-E, EFT, FFC, TSG, TFF, ASC, DMG, DNQ, FSO, URJ.For each topology, two GFlowNets were trained, one with edges and one without.The performance is shown in Supplementary Information.Once the GFlowNets have been trained, they were sampled to generate matgfn-rm dataset of over 1 million hypothetical reticular frameworks.

Diversity analysis
We compare the top-100 and top-100,000 candidates from matgfn-rm to the ARC-MOF dataset.For each CIF file, we compute the average minimum distance (AMD) descriptor [28] of length 100.We then perform dimensionality reduction to two dimensions using t-SNE implemented in scikit-learn [29].Figure 2 shows the result.There, the top-100 and top-100,000 matgfn-rm materials are separated from most materials from ARC-MOF.This shows that we discover new materials compared to existing datasets.
Simulated CO 2 capture performance In order to confirm the expectation of efficient CO 2 capture from an adsorption proxy (i.e., the gravimetric surface area), we run Physics-based Grand Canonical Monte Carlo simulations for the top-100 generated materials in the matgfn-rm dataset [30,31].We simulated single-component adsorption isotherms for pure CO 2 , from which we extract the CO 2 working capacity, and dual-component adsorption isotherms for dry flue gas (15% CO 2 and 85% N 2 ), from which we extract the CO 2 / N 2 selectivity.All simulations were performed at 300 K, with pressures ranging from 0.15 to 16 bar.The working capacity was calculated as the difference in uptake of (single-component) CO 2 between 16 and 0.15 bar, while the selectivity was calculated as , where Q i is the uptake of species i at 0.15 bar and f i is the concentration of species i in the input flue gas stream.Figure 3 shows the distribution of absolute (working capacity) and relative (selectivity) capture metrics for the top-100 matgfn-rm materials.All top-100 materials are (modestly) more selective towards CO 2 than N 2 and exhibit very high CO 2  working capacities, corresponding to the 90 th percentile of the experimentally-realised CoRE2019 dataset [19].Fifteen of the top-100 matgfn-rm materials have working capacities that are higher than all materials found in the CoRE2019 dataset.In particular, we highlight in Figure 4 the covalent organic framework 005-ffc-10217 that achieved the highest CO 2 working capacity of the top-100 matgfn-rm materials, around 44 mol/kg.Relaxation and validity check Due to the hypothetical nature of the generated MOFs, the crystalline structures are not guaranteed to be perfect.We therefore used the mofchecker library [32] to perform basic consistency checks on the generated CIFs.According to mofchecker, all of the top-100 matgfn-rm are porous (metal-)organic materials.However, due to the hypothetical interatomic distances sometimes being larger (or shorter) than the typical bond lengths, some atoms are flagged as either overor under-coordinated.In order to obtain a more realistic structure, we performed atomic coordinate and unit cell relaxation using the M3GNet [33] interatomic potential.Relaxing the structures solves most of the structural problems, with 98% presenting neither atomic overlaps nor over-coordination of C, N and H atoms, respectively.In particular, for the high-performing 005-ffc-10217 structure, relaxation led to a 23% reduction in the unit cell volume, bringing the CO 2 working capacity down to 37.5 mol/kg, which is still larger than those found in the CoRE2019 dataset.The relaxed pore size of 005-ffc-10217 is approximately 87 Å.

Conclusion
In summary, we built a workflow using GFlowNets to generate diverse and novel reticular frameworks with gravimetric surface area greater than 5000 m 2 /g.As a key result, the top-100 candidates of the resulting matgfn-rm dataset have working capacities in the top 90 th percentile of CoRE2019 reference dataset.Moreover, 15 of the top-100 matgfn-rm materials have working capacities that are higher than all materials found in the CoRE2019 dataset.Further tests are underway to confirm the stability and synthesizability of the materials generated in our study.Nevertheless, our results clearly demonstrate the potential of GFlowNets for materials discovery in carbon capture applications.

Figure 1 :
Figure 1: Regression of simulated high pressure CO 2 uptake to gravimetric surface area

Figure 2 :
Figure 2: Two dimensional T-SNE embedding of the average minimum distance of ARC-MOF (green), the top-100,000 (gray) and top-100 (orange) materials from matgfn-rm.

Figure 3 :
Figure 3: Simulated CO 2 working capacity and CO 2 / N 2 selectivity for the top-100 matgfn-rm materials.The (red) dashed line represents the highest working capacity found in the CoRE2019 dataset, which is surpassed by 15 of the top-100 matgfn-rm materials.

Figure 4 :
Figure 4: A render of the relaxed structure of 005-ffc-10217, the highest performing structure in the matgfn-rm dataset.

Figure 7 :
Figure 7: Performance of the GFlowNet trained on the CDZ-E topology.

Figure 8 :
Figure 8: Performance of the GFlowNet trained on the CDL-E topology.

Figure 9 :
Figure 9: Performance of the GFlowNet trained on the EFT topology.

Figure 10 :
Figure 10: Performance of the GFlowNet trained on the FFC topology.

Figure 11 :
Figure 11: Performance of the GFlowNet trained on the TSG topology.

Figure 12 :
Figure 12: Performance of the GFlowNet trained on the TFF topology.

Figure 13 :
Figure 13: Performance of the GFlowNet trained on the ASC topology.

Figure 14 :
Figure 14: Performance of the GFlowNet trained on the DMG topology.

Figure 15 :
Figure 15: Performance of the GFlowNet trained on the DNQ topology.

Figure 16 :
Figure 16: Performance of the GFlowNet trained on the FSO topology.

Figure 17 :
Figure 17: Performance of the GFlowNet trained on the URJ topology.
1 , x 2 , . . .x n ] or a graph.The object x is built by through Markov Decision Process restricted to a directed acyclic graph.Transition probabilities p(x i+1 | x) are approximated by a neural network called a flow model.GFlowNets need fewer evaluations of the reward function to generate samples with high reward, novelty and diversity when compared to alternatives such as Markov Chain Monte Carlo, Proximal Policy Optimisation or Bayesian Optimisation [12].