Identifying porous cage subsets in the Cambridge Structural Database using topological data analysis

As rationally designable materials, the variety and number of synthesised metal–organic cages (MOCs) and organic cages (OCs) are expected to grow in the Cambridge Structural Database (CSD). In this regard, two of the most important questions are, which structures are already present in the CSD and how can they be identified? Here, we present a cage mining methodology based on topological data analysis and a combination of supervised and unsupervised learning that led to the derivation of – to the best of our knowledge – the first and only MOC dataset of 1839 structures and the largest experimental OC dataset of 7736 cages, as of March 2022. We illustrate the use of such datasets with a high-throughput screening of MOCs and OCs for xenon/krypton separation, important gases in multiple industries, including healthcare.


‖ − ‖ ∞ = max {| − ′ |, | − ′ |}
We can, therefore, compute the sup norm of all the partially matched points. For unmatched points of coordinates (b, d), we take the sup norm to their closest point to the diagonal. The closest point has coordinates ½( + , + ). For the unmatched points, we therefore take: Given P, Q, M and the defined sup norm, we define the cost of M as: Finally, the bottleneck distance db between P and Q is defined as the cost of the most efficient partial match:

Imidazole-based cages
The imidazole-based cages describe structures where the metal atoms are connected to the organic ligands via at least four nitrogen atoms, two of which should be part of an imidazole. As most targeted cages have at least four metal atoms, four identical units of such atoms connected to an imidazole are repeated. Figure S3 gives three examples of structures obtained with this query. Note the variety of shapes: EHIHIN 1 is a tetrahedral cage, LAVMOM 2 has the shape of a funnel, and ZULJAT 3 is a helicate. 1,878 hits were obtained from this search.

Pyridine-based cages
The pyridine-based query describes structures where the metal atoms are connected to four pyridine compounds, each of which is then connected to a carbon atom. In addition, each entry should have at least three metal atoms. The green dashed line in Figure 5 separates the queries above and below, meaning only one of these pyridine units is necessary. These two queries should be combined in ConQuest as an AND statement. Figure S4Figure shows two examples of structures targeted with this query. 116 hits were obtained from this search.

Banana-shaped cages
The term 'banana' was coined by Han et al. to describe the shape of the ligands, and not actually the overall cage. For the sake of simplicity, we will refer to these structures as banana-shaped. An example of such a cage is shown in Figure S5Figure a. Other non-S5 banana-shaped cages can also be found with this query; an example of a spherical cage is given in Figure S5Figure b. 379 hits were obtained with this search.

Bis(imino)pyridyl-based cages
The bis(imino)pyridyl-derived query describes structures where the metal atoms are part of a group containing two imidazole units which share the metal-nitrogen bond, and a pyridine unit which shares a bond and a nitrogen atom with each of the imidazole units. Figure S6 gives two examples of cages obtained with this query. Note that ZOKDEL 8 in Figure S6b is referred to by the original authors as a macrocycle. The presence of a hollow in the macrocycle means it qualifies as a cage, in the case of our definition. 192 hits were obtained with this search.

Dioxolane/dioxane-based cages
The dioxolane/dioxane-based query addresses the case of structures where the metal atoms are connected to either a 1,3-dioxolane or a 1,3-dioxane, as well as variants of these heterocycles where certain carbon atoms can be replaced with nitrogen atoms. Figure S7 shows three examples of cages of different shapescylinder (ADODUS 10 ), helicate (ANITAT 11 ) and tetrahedral (BOBZUP 12 )obtained with this query. 525 hits were obtained with this search.

Cyclotriveratrylene-derived cages
This query tackles specifically the emerging field of cyclotriveratrylene-derived coordination cages. As the essence of these cages lies in their organic ligand, the query consists in the description of the cylotriveratrylene ligand, accompanied by the presence of at least two metal atoms. Figure S8 gives two examples of such cages with different shapes. These cages are prone to structures with multiple cavities. Figure S8c gives an example of such a structure, where two cages, each with two distinct pores, are linked via an organic ligand. 85 hits were obtained with this search.

Large cages
Some cages are too large and do not have an assigned 2D chemical diagram, which means a substructure search in ConQuest will miss them. However, these structures have the word 'exceeded' in their textual description. This search returned 612 hits.

S4 Additional ConQuest queries used for reducing the search space of OCs in the CSD
The following queries for organic cages and rings were added to the general queries for OCs. Dotted lines correspond to 'any' type of bond. Superscript c means the corresponding atom should be cyclic. Superscript a means the corresponding atom should be acyclic. Subqueries highlighted in a red box refer to 'must not have' criteria. 'TN' means the corresponding atom is attached to N other atoms only.

S5 GCMC simulations
We used the multi-purpose code RASPA to perform GCMC simulations of the said mixture in the selected MOCs and OCs. 29 We used an atomistic model of each clean structure where the atoms were kept fixed at their crystallographic positions. We used the standard Lennard-Jones (LJ) 12-6 potential to model the interactions between the framework and fluid atoms. The parameters for the framework atoms were obtained from Dreiding Force Field (DFF) 30 and, when not available, from the Universal Force Field (UFF). 31 The Lorentz-Berthelot mixing rules were employed to calculate fluid-solid LJ parameters, and LJ interactions beyond the cutoff value of 12.8 Å were neglected. The simulation box for each structure is defined so that the cell lengths are larger than twice the cutoff distance. 20,000 Monte Carlo cycles were performed, the first third of which were used for equilibration and the remaining steps for production. Monte Carlo moves consisted of insertions, deletions and displacements. In a cycle, N Monte Carlo moves were attempted, where N is defined as the maximum of 20 or the number of adsorbates in the simulation box. To calculate the gasphase fugacity we used the Peng-Robinson equation of state. 32

S6 CC3 vs M6L4
In this section, we attempt to explain the high variance observed for the same M6L4 cages observed in Figure 11 (as opposed to the low variance demonstrated by the CC3 cages). For this, we visually compared two M6L4 structures: one at the relatively lower selectivity of 25 (CSD refcode: COPPAA 33 ) and the structure with the highest selectivity (CSD refcode: AJENIO 34 ). The two structures are presented in Figure S13. Although the individual cages share the same ligands, metal nodes and space groups, the size of the cells and the void fraction differ. Using the CCDC software Mercury of structure visualisation and analysis, 35 we computed the surface surrounding the porous areas in both structures. The result in Figure S13 shows that the two surfaces differ significantly in shape. While a continuous channel runs through COPPAA from left to right, this channel is cut short in AJENIO. By comparing the two structures, we found that this difference in channel morphology is due to the difference in the bending of the organic ligands. To go from Figure S13a to Figure S13b, one can imagine pulling on the ligands at their centre in their perpendicular direction. This movement is indicated in Figure S13a by the yellow arrows. This difference in ligand bending possibly caused the observed differences in cell lengths, leading to an overall larger cell in the case of AJENIO, as well as larger pore volumes. These structural differences seem to have a large impact on the observed selectivities: a difference of 1 to 3% in cell lengths is related to a 33% difference in void fraction and one selectivity that is 21 times higher than the other. While the exact mechanism behind the difference in selectivity could be further investigated, the main take-away from this example is that slight differences in ligand bending lead to differences in the pores morphology that can have a significant impact on the calculated performance of the structures. These different bending angles could themselves be caused by different synthesis conditions, or could correspond to different states of a flexible structure.
Such high-impact structural variations were however not observed in the CC3-type structures. There are two possible reasons for this: 1. As shown in Error! Reference source not found.b and c, CC3 structures have shorter ligands which are therefore harder to bend. 2. CC3 structures crystallise in cubic systems, which provide more efficient packing and less leeway for structural variations. Figure S14 shows the differences in packing in the two systems. This results in cages that are structurally extremely close, despite having been obtained under different conditions. The low structural variance in turns explains the observed low selectivity variance. While we were able to shed some light on the spread of selectivity values observed for M6L4 cages, this case study revealed how sensitive simulations can be to slight structural differences among similar or identical structures obtained under different conditions, or captured in different flexibility states. These cases show the limit of assuming a host structure as rigid in molecular simulations, but also the distribution of different possible states for a given structure.

Figure S14
Differences in packing between CC3-type structures and M6L4-type structures. a. CC3 in its cubic system and b. COPPAA 33 in its tetragonal system. The cages are coloured for easier visualisation. The corresponding adsorption sites (obtained with SITES ANALYZER) 36 are shown in c. for CC3 and d. for COPPAA.