Inorganic synthesis-structure maps in zeolites with machine learning and crystallographic distances

Zeolites are inorganic materials known for their diversity of applications, synthesis conditions, and resulting polymorphs. Although their synthesis is controlled both by inorganic and organic synthesis conditions, computational studies of zeolite synthesis have focused mostly on organic template design. In this work, we use a strong distance metric between crystal structures and machine learning (ML) to create inorganic synthesis maps in zeolites. Starting with 253 known zeolites, we show how the continuous distances between frameworks reproduce inorganic synthesis conditions from the literature without using labels such as building units. An unsupervised learning analysis shows that neighboring zeolites according to our metric often share similar inorganic synthesis conditions, even in template-based routes. In combination with ML classifiers, we find synthesis-structure relationships for 14 common inorganic conditions in zeolites, namely Al, B, Be, Ca, Co, F, Ga, Ge, K, Mg, Na, P, Si, and Zn. By explaining the model predictions, we demonstrate how (dis)similarities towards known structures can be used as features for the synthesis space. Finally, we show how these methods can be used to predict inorganic synthesis conditions for unrealized frameworks in hypothetical databases and interpret the outcomes by extracting local structural patterns from zeolites. In combination with template design, this work can accelerate the exploration of the space of synthesis conditions for zeolites.


Introduction
Zeolites are inorganic porous materials widely recognized for their rich polymorphism and numerous applications. [1][2][3] Their porous structure provides unique opportunities to tailor materials performance in catalysis, gas adsorption, selective membranes, and more. [4][5][6] In principle, the performance of zeolites for each application can be controlled by adequate selection of polymorph and composition. However, this selection is often hindered by the high-dimensional synthesis routes required to produce the materials. 7 Zeolites are often synthesized with hydrothermal treatments, with inorganic and organic precursors cooperating to crystallize the nanoporous structure. 8 Organic templates are known to direct the formation of certain topologies, thus biasing the phase competition landscape to favor the best-matching topology instead of another. 8,9 Because of this effect, design of organic structure-directing agents (OSDAs) led to multiple successful examples of phase-selective zeolite synthesis and control of catalytic properties, [10][11][12] especially when used in combination with computational methods. [13][14][15][16][17][18][19] On the other hand, computational design of inorganic synthesis conditions for zeolites has not yet achieved the same impact as template design. Despite their promise in controlling active site distribution, 20 phase selectivity, 21 Si/Al ratio, 22 morphology, 23 or lowering the cost of syntheses, 24 selection of inorganic conditions capable of synthesizing existing and novel zeolites is not easily modeled. 25 Recent progress in quantifying the role of inorganic synthesis conditions in zeolites includes: coupling machine learning and literature extraction; 26,27 obtaining structure-synthesis correlations from synthesis routes; 21,28 predicting effects of in-organic cations in heteroatom distributions; 17,20 or using ML to control composition and particle sizes from template-free syntheses. 22 Nevertheless, their reliance on reported data prevents them to propose inorganic conditions for the synthesis of novel or hypothetical frameworks. Whereas some inorganic synthesis-structure relationships can be derived from building units 27,29,30 or alternative structural descriptors, 28 automatically screening for new structures in hypothetical zeolite databases requires bypassing human-crafted labels such as building units. Furthermore, although graph-theoretical methods can detect composite building units (CBUs) in arbitrary structures, their computational cost may be prohibitive when exploring large datasets. Data-driven methods based in the topology of the structure also provide information on key factors that govern kinetics of zeolite crystallization, 31,32 but do not immediately inform their synthesis conditions. Finally, aggregate framework information such as density-energy plots 33,34 or local interatomic distances 35 provide few correlations between different inorganic synthesis conditions and targeted frameworks, which motivate new data-driven approaches to synthesizability prediction. 36 Thus, advancing towards a priori discovery of novel zeolite frameworks requires developing methods to: (1) uncover synthesis-structure relationships in zeolites; (2) efficiently explore the inorganic synthesis space of zeolites; and (3) bypass the absence of labeled data in hypothetical zeolite databases.
In this work, we correlate inorganic synthesis conditions to zeolite structure using a mathematically strong representation for comparing periodic crystals, the Average Minimum Distance (AMD), 37 derived from the Pointwise Distance Distribution (PDD) 38 of crystal structures (see Fig. 1). The PDD is independent of a unit cell, continuous under small perturbations, theoretically complete for generic crystals, and distinguished all periodic crystals in the Cambridge Structural Database. Importantly, it can be computed much more efficiently than graph-based approaches, only requiring a fast nearest neighbor search. 39 We show that the AMD vectors of zeolites can be used to predict inorganic synthesis conditions and recall a comprehensive dataset of synthesis conditions from the literature. Then, we demonstrate that unsupervised and supervised machine learning (ML) methods can be used to create structure-synthesis relationships independently from OSDA design. Finally, we propose inorganic synthesis conditions to realize hypothetical frameworks based on distances toward structures whose synthesis is known, thus proposing interpretable synthesis-structure models to guide the synthesis of new zeolites.  Figure 1: Computational methods used to extract relationships between zeolite structures and their associated inorganic synthesis conditions. a, Using the concept of AMD and the distance between these invariants, we compute a distance matrix between known zeolites. b, This information is combined with literature data and ML methods to correlate structural patterns with inorganic conditions.

Results and Discussion
Inorganic synthesis maps from unsupervised learning To address these problems, we created a procedure to compare zeolites and extract synthesis-structure relationships without relying on any structural labels except for the atomic positions ( Fig. 1). Using the concept of PDDs and AMDs, we calculated the distance between two zeolite structures by comparing their AMD (see Methods). This comparison between average local environments is computationally efficient but descriptive enough to distinguish all crystals in the CSD. 38 Then, we assumed that zeolites sharing similar local structures exhibit similar inorganic synthesis conditions. To test this hypothesis, we computed the distance matrix between 253 known frameworks in the International Zeolite Association (IZA) database using AMD, and then performed a qualitative analysis of the results. We found that the AMDs correlated weakly with differences of density and with the SOAP distance between structures, but showed almost no correlation with graph-based distances from previous work 42 (Fig. S1). Moreover, we noted that zeolites sharing the lowest distances according to our metric have often been synthesized together (see Table S1 in the Supporting Information). Recovering pairs of structurally similar frameworks such as ITH-ITR, SBS-SBT, or MWF-PAU at low distance already suggests that the similarity metric is qualitatively sound. To generalize this observation, we charted a map of zeolite structures based on their distances. Figure 2 shows the minimum spanning tree created by converting the AMD distance matrix into a graph with weighted edges. Although the tree shows discrete connections and may not be accurate in the presence of outliers, it facilitates a qualitative interpretation of the results and may provide insights about synthesis-structure maps. Even without considering synthesis labels of the data in Fig. 2, known relationships between zeolites emerge naturally from the structural tree map. Zeolites known for their similar building patterns are clustered together in the minimum spanning tree, demonstrating that their AMD values capture the space of zeolites without learnable features.
Examples of such clusters include the ABC-6 zeolites, structures containing lov building units, six-membered rings frameworks (e.g., GIU cluster), Ge-or boron-containing zeolites (e.g., BEC or IRR and SFN-SSF branch, respectively), to name a few (see also distances, thus providing a more quantitative view to the minimum spanning tree of Fig. 2.
Then, to create labels for synthesis conditions, we started with a dataset of extensive synthesis conditions extracted from the zeolite literature from Jensen et al. 43 After augmenting the data with frameworks not typically reported in publications, such as those found as minerals, we analyzed the frequency of occurrence of each synthesis condition for each framework.
Although the initial dataset had information on both organic and inorganic conditions, we disregarded the organic templates when labeling the data, thus assuming that inorganic and organic conditions can, to an extent, be predicted independently of each other. Furthermore, given the scarcity of data for some synthesis conditions, we focused only on the 14 inorganic conditions that have been used to synthesize at least 10 zeolites, namely Al, B, Be, Ca, Co, F, Ga, Ge, K, Mg, Na, P, Si, and Zn. Finally, we verify whether flat clusters formed by points with a maximum distance of each other share the same positive labels. This intuition is quantified by computing the homogeneity between data points given clusters formed by a given distance threshold 44 (see Methods). If all clusters had only positive labels, their homogeneity would be 1, whereas zero homogeneity indicates perfect mixing of positive and negative labels. Figure 3a shows that clusters with at least one positive data point become more homogeneous as the distance threshold decreases. This supports the qualitative view that structures considered similar according to the AMD values also share similar synthesis conditions more often than not. On the other hand, as clusters become larger and the increasingly dissimilar structures are grouped together, the homogeneity decreases. Whereas the distribution of labels for some inorganic agents such as Al, Si, Be, F, or Na exhibit higher homogeneity at low distances (see Fig. S7), others such as Co, Mg, or Zn show little predictive power. This suggests that structural distances computed with the AMD have stronger correlations with certain synthesis conditions than with others. Although the lack of true negative data points may also influence the computed homogeneity, these results demonstrate quantitatively that some synthesis-structure relationships can be established in zeolites using AMDs.
Additional investigations of the data explain the patterns in homogeneity obtained above. Groups formed by zeolites such as NAT, EDI, and THO, or AFO, AEL, AHT show that non-silica zeolites also share structural patterns that may be harder to obtain in silica-based structures.
This unsupervised analysis demonstrates that our distance metric relates zeolites with similar inorganic synthesis conditions without any learnable parameters. Although zeolite structures contain several outliers and lack true negative data, the structural patterns still provide a strong prior for exploring the synthesis conditions. In particular, as inorganic synthesis conditions can be inferred by the similarity between crystal structures, they can also help downselect structures for zeolites yet to be realized.

Interpretable classifiers for predicting inorganic synthesis conditions
One disadvantage of the pure unsupervised learning approach is the inefficient utilization of the available labels. Although similarity between crystal structures is a good indicator of common synthesis conditions, the dissimilarity between structures can also provide insights on which structures are less likely to be synthesized with a given composition. To perform this analysis, we use the labeled data to train supervised learning methods that predict the synthesis conditions of a zeolite given its distances to known frameworks. Specifically, we trained logistic regression, random forest, and XGBoost classifiers on literature data to predict each class label individually. However, training models on the literature labels has two caveats: (1) the data is often unbalanced, i.e., the number of positive data points is much smaller than the number of negative data points; and (2) the negative data is not truly negative, as its lack of literature reporting does not imply that a zeolite cannot be synthesized under the synthesis conditions in analysis. To account for these problems, we trained balanced classifiers by subsampling the dataset for each synthesis conditions, thus ensuring that training sets had the same proportion of positive and negative data points, but validation/test sets were allowed to have more negative samples than positive ones. In that case, because models were tested on different negative splits, they were prevented from memorizing "negative" data points as truly negative, as exemplified by the case of MEI When evaluated against a held-out test set, the model with best set of hyperparameters still exhibits high ROC and PR AUCs for a variety of synthesis conditions (Fig. S10). Nevertheless, this set of hyperparameters is far from being the only one that performs well in these conditions (Fig. S11). As discussed in the analysis using unsupervised learning, the ability to correctly label zeolites whose synthesis contains Co or Zn is smaller than other labels, as indicated by the worse performance of all classifiers in labeling these conditions. However, some synthesis conditions that were not well-predicted by the unsupervised learning method, such as Mg, can now be predicted using XGBoost models, despite its low recall (

Proposing inorganic synthesis conditions for hypothetical zeolites
Given that structural similarity is correlated to inorganic synthesis in zeolites and that supervised learning methods are able to predict synthesis using only distances as inputs, structures against all known zeolites, we created a distance matrix that is used as input for the unsupervised and supervised learning methods shown in the previous section. As in the case of known zeolites, AMDs are correlated with differences of density, but are not solely determined by them (Fig. S14). Using AMDs, a low-dimensional map can be created for all hypothetical structures, thus providing an intuitive way to visualize the space of structures. Figure S15 shows a 2D projection of the distribution of hypothetical zeolites based on their distance matrix using UMAP. This plot shows that distance features are able to sort the space of zeolites according to energy and density despite not using this information as explicit inputs. The visualization also illustrates that most hypothetical frameworks do As demonstrated in this work, zeolites in the neighborhood of known frameworks are likely to share similar synthesis conditions as those known structures. Thus, downselecting frameworks for given synthesis conditions can benefit from the unsupervised and supervised methods developed here. This approach can be used in combination with previous "synthesizability metrics" of zeolites, such as local interatomic distances 35 or other data-driven predictions. 46 However, we chose to evaluate them independently, as these synthesizability predictions do not take into account that certain known frameworks may be considered "unfeasible" depending on the synthesis conditions. 34,47 For instance, structures containing three-connected rings, such as those with building units lov or vsv, could be ranked as "unsynthesizable," despite being achieved with beryllium or borogermanate conditions. Thus, to propose synthesis conditions for zeolites, we evaluated all hypothetical frameworks for all synthesis conditions using an ensemble of 100 binary classifiers per inorganic condition (see Methods). As each classifier is trained on different negative data splits, the resulting classification varies for each model, allowing us to assess the degree of agreement between the models. By taking the average of the predictions, we obtain the probability of synthesizing the zeolite with that synthesis condition.   Fig.   S6), structure #308,105 is predicted to be more likely to be synthesized as a silicate than #313,030. Both contain the lta and sod cages characteristic of the LTA zeolite, but differ by the presence of a second cage similar to sod, shown in Fig. 5b. Whereas this new building unit resembles an expanded sod cage with distorted six-membered rings in #308,105, hypothetical framework #313,030 shows a new cage, formed by the merging of two sod cages, not seen in known zeolites. This increased distance towards known structural patterns drives the prediction of feasible synthesis using Si as unlikely, even when the distance towards the LTA zeolite is lower. This example shows how the combination of AMD values and classifier predictions facilitates the exploration of hypothetical zeolites using reference structures.
Beyond exploration of the zeolite space, the models also uncover existing and new synthesis-structure relationships. Figure 5c shows three examples of hypothetical frameworks predicted to be synthesized using three different elements: Be, Ge, and K. To obtain these frameworks, we filtered only frameworks within densities of 14 and 17 T/1000Å 3 that are predicted to have 100% probability of synthesis with the given element. Then, we ranked the frameworks by their relative energy. Despite not using explicit labels on the CBUs, the supervised learning models recovered the known heuristics of building units and inorganic synthesis conditions. For instance, framework #261,338, predicted to be synthesized in presence of Be, is formed mostly by lov building units, as found in other Be-zeolites such as RSN, LOV, or NAB. This same framework is predicted to be unlikely as a silicate, possibly following the trends seen in JSR or NPT structures. Hypothetical zeolite #64,550, predicted to be synthesized with germanium, also shows features similar to known ones. In addition to its three-dimensional pore structure, with 12×12×10 intersecting pores, the structure shows the d4r CBU typical of other structurally similar germanosilicates, such as POS or UOV, but with 7 symmetrically inequivalent T sites. Finally, one unrealized framework predicted to be synthesized with potassium is structure #303,768. Although this hypothetical structure does not exhibit typical CBUs, the local structures similar to d8r CBUs are predicted to be favored by K, in analogy with similar relationships in known zeolites. This demonstrates how data-driven models can not only recover known relationships between CBUs and inorganic conditions, but also propose new synthesis-structure relationships in zeolites based on distance patterns towards known structures. When used to analyze the entire space of hypothetical frameworks, the models show that the distribution of predicted inorganic synthesis conditions is uneven across the space of zeolites (Fig. S19)

Conclusions
Mapping the space of inorganic conditions in materials synthesis is an outstanding challenge due to the complexity of chemical interactions during synthesis. In the case of zeolites, synthesis conditions are known to affect structural patterns in the materials, but finding correlations between structural patterns and inorganic syntheses often relies on heuristics.
In this work, we used unsupervised and supervised learning methods to propose inorganic synthesis conditions for zeolite synthesis. In particular, we showed how a mathematically strong distance metric between crystals can predict inorganic synthesis conditions in zeolites.
This enables structural comparisons beyond human-crafted labels of building units or pore

Pointwise Distance Distributions and Average Minimum Distances
Any periodic crystal structure is modeled as a periodic set S of atomic centers considered as zero-sized points, with atomic types as optional labels. Any linear basis of vectors v 1 , v 2 , v 3 in 3-dimensional space generates a lattice Λ = {c 1 v 1 + c 2 v 2 + c 3 v 3 | c i are integers} and unit The resulting m × (k + 1) matrix PDD(S; k) is called the Pointwise Distance Distribution, a statistical distribution of rows with weights describing each point's environment. As an example, Fig. 6 shows the computation for a point in the square lattice S whose first k = 8 neighbours have distances 1, 1, 1, 1 (in green) and √ 2,

Zeolite structures data
The dataset of 253 known zeolite structures used in the unsupervised learning method was obtained from the International Zeolite Association (IZA) database. 55 The dataset of hypothetical frameworks used in this work was developed by Pophale et al., 33 and re-optimized using a neural network force field trained on DFT-SCAN data by Erlebach et al. 45 Because not all of the 253 known zeolites used previously were optimized by Erlebach et al., we used their subset of 236 known frameworks when computing distance matrices from the hypothetical frameworks and the known frameworks. In the literature analysis, a zeolite is classified as having a certain synthesis condition when at least 25% of its synthesis recipes exhibit that condition (excluding OSDAs). This label is used as a categorical variable when performing the classification task.

Unsupervised learning
A minimum spanning tree between zeolites was constructed by first creating a fully connected, undirected graph with weighted edges, where weights correspond to the distances between two structures. The tree was then obtained using NetworkX's (v. 2.5) 56 minimum spanning tree algorithm, which minimizes the total length of the tree.
The dendrogram of known zeolites was produced by creating a linkage matrix from the distance matrix using the Ward algorithm as implemented in SciPy (v. 1.10.0). 57 The resulting clusters in Fig. S6 were obtained by forming flat clusters with maximum AMD distance of a given threshold.
The homogeneity of the clustering was computed by calculating the Shannon entropy of flat clusters created with a given threshold, 44 as implemented in scikit-learn (v. 1.2.0). 58 As the literature dataset is not balanced and lack true negative points, the homogeneity was only computed for clusters containing at least one positive data point. This ensures that a large homogeneity corresponds to recall of positive data points, which prevents biasing this metric in imbalanced datasets.
Dimensionality reduction was performed using UMAP, 59 as implemented in the umap-learn package in Python (v. 0.5.3). The 2D UMAP plot was produced by comparing hypothetical frameworks using the cosine distance of their normalized distances to IZA structures, and using 10 neighbors as parameter.

Supervised learning
Classification of inorganic synthesis conditions was performed by training separate classifiers for each synthesis condition. The features used during training were the distances towards the 253 known frameworks, as computed with the AMD method described above.
To obtain a statistically meaningful result, only elements used to synthesize at least 10 zeolites were considered. In particular, 14 inorganic conditions are considered: Al, B, Be, Ca, Co, F, Ga, Ge, K, Mg, Na, P, Si, and Zn.
Train-validation-test sets were created starting with a 60-20-20 ratio, respectively, then subsampling the training set to have an equal number of points with positive and negative labels. Although techniques such as reweighting or resampling could have been employed to obtained balanced training sets, removing data points is a simple approach that prevents classifiers from treating negative data as "true negative", resembling positive-unlabeled learning strategies.
Hyperparameter optimization of synthesis classifiers was performed using a gridsearch method over relevant spaces of hyperparameters for logistic regression, random forest, and XGBoost 60 methods. The full range of hyperparameters investigated in this hyperparameter search is shown in Tables S2, S3 and S4, following the notation in the scikit-learn          S13: Zeolite tree map labeled with the Pearson correlation coefficient between AMD and SHAP values per synthesis conditions. A negative correlation (shown in green) indicates that smaller AMD distances (i.e., high similarity) lead to higher SHAP values (i.e., higher likelihood of a positive classification). The correlation coefficient is the average correlation of AMD and SHAP values for 100 XGBoost models for each synthesis condition. Figure S14: Relationship between AMD distances computed between known and hypothetical frameworks, and their density difference. Both known and hypothetical zeolites were optimized with the NNPscan method by Erlenbach et al. Figure S15: Low-dimensional projection of the hypothetical zeolite space using their distance towards known zeolites as features. The red dots indicate zeolites present in the IZA database. The low-dimensional plot was obtained using UMAP (see Methods), and recovers the density and energy of the frameworks simply by comparing them against known structures. Figure S16: Energy-density plots for zeolites, reproducing the plot from Erlenbach et al. The analysis of the data using these two variables shows a high concentration of mid-energy, mid-density zeolites in the hypothetical structures database.