Ranking the synthesizability of hypothetical zeolites with the sorting hat

Zeolites are nanoporous alumino-silicate frameworks widely used as catalysts and adsorbents. Even though millions of siliceous networks can be generated by computer-aided searches, no new hypothetical framework has yet been synthesized. The needle-in-a-haystack problem of finding promising candidates among large databases of predicted structures has intrigued materials scientists for decades; yet, most work to date on the zeolite problem has been limited to intuitive structural descriptors. Here, we tackle this problem through a rigorous data science scheme—the “Zeolite Sorting Hat”—that exploits interatomic correlations to discriminate between real and hypothetical zeolites and to partition real zeolites into compositional classes that guide synthetic strategies for a given hypothetical framework. We find that, regardless of the structural descriptor used by the Zeolite Sorting Hat, there remain hypothetical frameworks that are incorrectly classified as real ones, suggesting that they might be good candidates for synthesis. We seek to minimize the number of such misclassified frameworks by using as complete a structural descriptor as possible, thus focusing on truly viable synthetic targets, while discovering structural features that distinguish real and hypothetical frameworks as an output of the Zeolite Sorting Hat. Further ranking of the candidates can be achieved based on thermodynamic stability and/or their suitability for the desired applications. Based on this workflow, we propose three hypothetical frameworks differing in their molar volume range as the top targets for synthesis, each with a composition suggested by the Zeolite Sorting Hat. Finally, we analyze the behavior of the Zeolite Sorting Hat with a hierarchy of structural descriptors including intuitive descriptors reported in previous studies, finding that intuitive descriptors produce significantly more misclassified hypothetical frameworks, and that more rigorous interatomic correlations point to second-neighbor Si–O distances around 3.2–3.4 Å as the key discriminatory factor.

calculated from the derivative of the Buckingham contributions at the inflection points. To compute the energies, we performed a constant pressure optimization of the cell, atom cores, and atom shells. In the cases where the constant pressure optimization did not converge, we performed a constant volume optimization from scratch, for which only the geometry of the atom cores and shells were optimized.
For the hypothetical frameworks in the DEEM database, we generally found very good agreement between our energies (obtained by optimizing only the atom shells) and those calculated by Pophale et al. S2 , with a mean absolute error of 0.08 kJ/mol Si, a median absolute error of 0.07 kJ/mol Si, and a standard deviation in the discrepancy of 0.9 kJ/mol Si.
We computed the SOAP vector for each atomic environment centered on a Si atom, considering two cutoffs: 3.5Å and 6.0Å. For each environment cutoff, we computed both the two-body radial spectrum and the three-body power spectrum with a Gaussian atom width of 0.3 and a cutoff transition width of 0.3. For both the power spectrum and radial spectrum we used a basis of 8 radial functions, truncated through a PCA of the density coefficients S3 from a radial basis of 32 Legendre polynomials in the discrete variable representation (DVR). The power spectrum representations additionally use 9 angular functions in the density expansion.

Data Cleaning
Fig. S1 shows a histogram of the discrepancy between our calculated GULP energies and those reported for the frameworks in the DEEM database. The five frameworks for which the discrepancy was greater than 10 kJ/mol Si are highlighted with their database ID number, and were subsequently discarded from all of our analyses.  Figure S1: Histogram of errors representing the discrepancy between our GULP calculations of the framework molar energy for the approximately 330k structures in the DEEM database of hypothetical zeolites. Structures with energy discrepancies larger than 10 kJ/mol Si are highlighted in the histogram and were discarded from the machine learning analyses.
ronments within the structure. The purpose of the histogram is to identify those frameworks within the DEEM database that are "identical" to an IZA framework, which is suggested by very small DEEM-IZA distances. The histogram in Fig. S2 shows two peaks, one between 10 −7 and 10 −6 and the other at 10 −3 . As a result of this observation, we set the distance cutoff for identifying "identical" frameworks in the "no-man's land" between the two peaks.
Following this reasoning, 5 × 10 −6 appears to be a reasonable choice.

Train/Test Split
The train/test split used in the different parts of the machine learning workflow vary slightly based on the ultimate goal of the particular exercise.
S-3  Figure S2: Histogram of distances between the 330k frameworks in the DEEM database of hypothetical zeolites and the 230 IZA structures. The distance is taken to be the Euclidean distance between the average SOAP vectors (using a 6.0Å cutoff) between the frameworks. The distance cutoff for declaring structures as "identical" is 5 × 10 −6 .
Linear regression of molar volumes and energies, PCA decomposition The training set includes all 10,000 frameworks used in Helfrecht et al. S4 , and the linear regression test set includes all 230 IZA frameworks and 250 DEEM frameworks selected randomly from the full 330k dataset and that are not present in the train set, excluding the five frameworks with large energy discrepancies. None of the frameworks in the training set of 10,000 are subject to elimination based on their energy discrepancy.

Support vector machine, PCovR
The training set includes the DEEM frameworks used in Helfrecht et al. S4 that are not also determined to be identical to an IZA framework (through the Euclidean distance between SOAP vectors, see above) and that have an energy discrepancy less than 10 kJ/mol Si, and

Convex hull construction
The convex hull construction is built using all of the IZA and DEEM frameworks not eliminated during the data cleaning step, i.e., the union of the train and test sets used for the SVM and PCovR models. The construction is based on that of the Generalized Convex Hull, S5 though it is deterministic in nature, rather than probabilistic. In other words, we construct the convex hull in the space defined by the framework energies and the first two PCovR components for the PCovR model based on the full 6.0Å SOAP power spectrum feature vectors and the corresponding four-class decision functions for the classification exercise on these same feature vectors. In contrast to the original GCH framework, however, no repeated sampling or pruning of the hull vertices is performed.

IZA Predictions from DEEM
The overlap between the IZA and DEEM datasets can be assessed with both supervised and unsupervised learning. In Fig. S3, we show a projection of all IZA and DEEM frameworks (minus the DEEM frameworks with large energy discrepancies) onto the first three principal components of the 10,000 DEEM frameworks from Ref. S4 , with a full 3D projection in the upper panel and individual projections onto the xy, yz, and xz planes in the bottom panels.
The IZA frameworks, colored by class designation, appear clustered toward one end of the PCA space, though still largely within the space occupied by the DEEM frameworks. This overlap suggests that the structural diversity of the DEEM database largely encompasses that of the IZA database. However, the PCA projection only provides a qualitative means of assessing the "IZA-ness", or proximity of a DEEM framework to the IZA space. In contrast, a support vector machine does provide this notion of proximity quantitatively through the decision function.  Figure S3: PCA projection of IZA and DEEM structures onto the first three principal components defined by a subset of 10,000 DEEM frameworks.
It is also possible to obtain a baseline understanding of the overlap between the IZA and S-6 DEEM spaces through a regression exercise. Table S1 shows the mean absolute errors (MAE) of molar volume and molar energy predictions on the IZA frameworks and a comparably sized, randomly selected set of DEEM frameworks from a ridge regression model again trained on the 10,000 DEEM frameworks from Ref. S4 . Both 3.5Å and 6.0Å SOAP power spectrum feature vectors are used to learn the molar volumes and energies. The ridge regression models are able to predict the molar energies of the IZA frameworks across all three subcategories (with the exception of IZA4, i.e., RWY) despite being trained only on DEEM frameworks.
Relatively speaking, the models have more difficulty in predicting the molar volumes of the IZA frameworks compared to the energies, though the predictions themselves are not entirely unreasonable, particularly in the 6.0Å case, again suggesting that there is substantial overlap between the IZA and DEEM SOAP feature spaces.

Linear Regression setup Energies and volumes
Predictor and target data were centered by the (columnwise) mean of the train set and scaled (divided) by the L2 (Frobenius) norm of the centered data divided by the square root of the number of samples. The regularization parameter of the regression is optimized by an order-of-magnitude grid search using five-fold cross-validation with a search range of 10 −10 to 10 0 . The optimization targets are the mean absolute errors of the energy/molar volume predictions respectively.

Compositions
The data pre-processing was performed in the same manner as in the regression exercise for the molar volumes and energies, but the regularization parameter was optimized with a search range of 10 −10 to 10 5 and two-fold cross-validation. The optimization target is the mean absolute error of the composition predictions.

Principal Component Analysis Setup
For the SOAP-based descriptors, the data were pre-processed in the same way as in the linear regression exercise for molar volumes and energies. The first three principal components were computed. For the "classical" descriptors (energy-volume and LIDs), the individual columns were standardized (i.e., to have mean zero and unit standard deviation) relative to the training set. The first two principal components were computed. S-8

Support Vector Machine Models Setup
The construction of SVM models accounting for specific correlations or combinations of correlations amounts to using subsets of the SOAP feature vectors (computed with librascal, S6 which includes a utility for identifying which features correspond to which correlations) for the model training and evaluation. For the four-class case, we constructed one-vs-rest SVM models.
The SOAP feature data are additionally columnwise centered relative to the train set with a weighted mean (using the same weights as for the balanced, class-specific SVM penalties) before being scaled (divided) by the L2 (Frobenius) norm of the centered data divided by the square root of the number of samples in the train set. For the classical descriptors, the individual columns were standardized relative to the training set. We used SVMs from scikit-learn S7 with a convergence tolerance of 10 −3 . The regularization parameter C (scale factor for the misclassification cost) was optimized with an order-of-magnitude grid search using a stratified two-fold cross-validation (to ensure each fold contained roughly the same proportion of frameworks from each class) with a search range of 10 −4 to 10 4 . In the multi-class case, all the individual binary classifiers composing the model share the same hyperparameters; they were not optimized individually. The same folds are used in the cross-validation procedure for each model in the ensemble.  Figure S5: Receiver operator characteristic (ROC) curves for the two-class IZA vs. DEEM SVM classification exercise where subsets of the power spectrum or radial spectrum were used for the SVM training and classification. Panel (a) shows the ROC curves for SVM models based on SOAP features with a 3.5Å environment cutoff, and Panel (b) shows the ROC curves for SVM models based on SOAP features with a 6.0Å cutoff. The power spectrum features yield better predictions than the radial spectrum, with Si-O-Si correlations being particularly important for the classification.  8,000 to 17,000 for the 6.0Å models and from 19,000 to 30,000 for the 3.5Å models.

Confusion Matrices
The radial spectrum (two-body) models that include information on O correlations also correctly classify approximately 100 IZA frameworks, but tend to misclassify many more DEEM frameworks than the power spectrum models. As noted in the discussion of Fig.   S5, the radial spectrum models that account only for Si atom information show the worst performance, correctly classifying less than 100 IZA frameworks and misclassifying more than 60,000 DEEM frameworks.  Figure S6: Confusion matrices from the two-class SVM classification where subsets of the power spectrum or radial spectrum were used for the SVM training and classification. The rows of the confusion matrices correspond to the true class labels, while the columns correspond to the predicted class labels and are denoted by the superscript † . The matrix entries are colored according to the proportion of the true labels that have been predicted as a particular class, while the interior text gives the absolute number of such classifications.
framework as belonging to IZA2 (Si, O, and heteroatoms), rarely does the SVM classify an IZA1 framework as IZA3 (containing O but no Si). Similarly, IZA3 frameworks are often misclassified as IZA2, but are much less often misclassified as IZA1. This suggests that the composition of a framework indeed affects its topology, and that such effects are preserved even upon conversion to an all-silica form.  Figure S7: Confusion matrices from the four-class SVM classification where subsets of the power spectrum or radial spectrum were used for the SVM training and classification. The rows of the confusion matrices correspond to the true class labels, while the columns correspond to the predicted class labels and are denoted by the superscript † . The matrix entries are colored according to the proportion of the true labels that have been predicted as a particular class, while the interior text gives the absolute number of such classifications. While the classifier often correctly classifies the DEEM frameworks as such, it has more difficulty distinguishing between the IZA subcategories.
As a simple check to ascertain whether the SVM classification predictions we observe S-13 are made based on genuine structural differences between the IZA and DEEM frameworks, we also performed two-and four-class classification exercises on a random split of only the DEEM frameworks: the DEEM structures were divided randomly into two (or four) different artificial classes of approximately the same size, and an SVM was trained on the randomly assigned labels for the DEEM frameworks in the training set for each model in the ablation study. In particular, we take the DEEM frameworks from the train and test sets used to build the SVM and PCovR models and assign classes for these frameworks randomly as either 1 or 2 (in the two-class case) or 1 to 4 (in the four-class case) from a discrete uniform distribution, so that the class populations are approximately equal. We then construct the SVM models as described previously. The resulting confusion matrices for predictions of these randomly assigned labels for DEEM frameworks in the test set are provided in Figs.

S8 and S9. For every model in both the two-and four-class cases, the SVM is not able
to assign class labels more accurately than a random guess. The inability of the SVM to distinguish between the arbitrarily assigned artificial DEEM classifications suggests that the classification predictions we observe in the IZA vs. DEEM exercises are based on genuine structural differences between the IZA and DEEM frameworks in our dataset.  Figure S8: SVM confusion matrices for a random two-class split of a subset of frameworks from the DEEM database. The SVM cannot distinguish between the arbitrarily assigned classes.
S-14  Figure S9: SVM confusion matrices for a random four-class split of a subset of frameworks from the DEEM database. The SVM cannot distinguish between the arbitrarily assigned classes.

SVM "Brains"
Fig. S10 provides the real-space representation of the SVM decision-making process for the 3.5Å and 6.0Å radial spectrum models; Fig. S10(e) plots the same data as in Fig. 4( Figure S10: Decision traces d(r) for 25 random IZA and 25 random DEEM frameworks and the class-averaged radial density ρ(r) reconstructed from the radial spectrum SOAP features for frameworks in the train set. The plot background is colored according the the value of the SVM weights w(r), and is truncated to better show sign changes: weights falling outside the colorbar limits are colored corresponding to the appropriate end of the colorbar. The SVM decision boundary d(r) = 0 is given by the horizontal dashed line; the SOAP environment cutoff is given by the vertical dashed line. All subplots are labeled with the model cutoff and the particular correlations used to train the corresponding SVM. The labels "Si+O * " and "Si * +O" are used to indicate that the SVM was trained on both Si-Si and Si-O correlations, but the plot shows only the decision traces for the Si-O or Si-Si correlations, respectively. In these cases, the decision traces for the Si-Si correlation plots (labeled "Si * +O", subplots (d) and (h)) do not begin at zero, but rather at the r → ∞ limit of the Si-O correlations (labeled "Si+O * ", subplots (c) and (g)), with the r → ∞ limit of the Si-Si correlations providing the final decision function values for the SVM models trained on both Si-Si and Si-O correlations. This particular arrangement allows one to compare the relative importance in the species-wise correlations in the SVM decision-making process.

PCovR setup
In order to solve class imbalance for the dataset, we chose to replicate samples from the minority class to achieve approximate class parity rather than a minority oversampling technique like SMOTE, S8 as this would introduce hypothetical samples into the set of IZA frameworks, distorting our baseline that the IZA class contains only structures that are known to be experimentally synthesizable. The preprocessing of the feature and target data is performed on the replicated data and in the same way as in the SVM case. It should be noted that the target data in the multi-class case (the multi-class SVM decision functions), are scaled globally (i.e., by the Frobenius norm) rather than columnwise; they are still centered columnwise. The PCovR mixing and regression regularization are optimized using a two-fold cross-validation scheme. The optimal regularization is determined through an order-of-magnitude grid search with a search range of 10 −10 to 10 0 , and the optimal mixing is similarly determined through a grid search with the set {0.0, 0.1, 0.2, . . . 0.8, 0.9, 1.0}.

Candidate Selection
Fig. S11 shows a schematic of the workflow used to identify the synthesis candidates from the database of DEEM frameworks. The resulting hierarchy of candidates is given in Table S4, which show the five DEEM frameworks in the 55-60Å 3 /Si, 60-65Å 3 /Si, and 65-70Å 3 /Si volume categories ordered by their distance to the convex hull along the energy direction.
The database ID of the framework in each volume category that is closest to the hull is given in bold. These highlighted frameworks are shown in Fig. S12 Figure S11: Schematic of the SVM-PCovR-GCH infrastructure. The GULP energies and SOAP descriptors are computed for each framework, and the SOAP descriptors are used as input to both SVM and PCovR models. The decision functions resulting from the SVM classification are additionally used as input to the PCovR model, where they are combined with the SOAP features to develop a latent space projection that serves as the basis for a GCH construction using the GULP energies as a measure of thermodynamic stability. The structures near the convex hull can then be compared against the SVM classification predictions to create a hierarchy of synthesis candidates. Table S4: Candidate DEEM frameworks having molar volumes between 55-60Å 3 /Si, 60-65Å 3 /Si, and 65-70Å 3 /Si. The five candidates closest to the convex hull in each volume category are listed. For each DEEM candidate, its ID, molar volume V (unitsÅ 3 /Si), predicted silicon fraction f DEEM , distance from the convex hull along the energy axis E hull (units kJ/mol Si), and two-class decision function value d are listed, alongside the IZA code, true Si fraction f IZA , and true house designation of the three nearest IZA frameworks in SOAP space with corresponding distance d IZA . IDs listed in bold correspond to those shown in Fig. S12.

Computational tools
The machine learning analyses were performed in Python S10,S11 with generous use of the numpy, S12-S14 scipy, S15 scikit-learn, S7 chemiscope, S9 and ase S16 packages. Plots were generated with matplotlib S17 and plotly, S18 and atomic snapshots were generated with VESTA. S19 THO 8158735 SBN 8054476 RHO 8312395 Figure S12: DEEM candidates for the three volume groups and their nearest-neighbor IZA framework in the full 6.0Å power spectrum SOAP space. S-22