Satyanarayana
Bonakala‡
^{a},
Michael
Aupetit‡
^{b},
Halima
Bensmail
^{b} and
Fedwa
El-Mellouhi
*^{a}
^{a}Qatar Environment and Energy Institute, Hamad Bin Khalifa University, PO Box 34110, Doha, Qatar. E-mail: felmellouhi@hbku.edu.qa
^{b}Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, Qatar
First published on 2nd February 2024
Data-to-knowledge has started to reveal significant promise in materials science. Still, some classes of materials, such as Metal–Organic Frameworks (MOFs), possess multi-dimensional interrelated physicochemical properties that pose challenges in using data clustering methods. We considered an in-house generated database of MOFs consisting of geometrical (pore size and dimensions), chemical (atomic charge of the framework), and adsorption properties (CO_{2} uptake, heat of adsorption) to evaluate the challenges and limitations of various clustering techniques and propose a solution based on visual clustering. As a starting step, we examined data via principal component analysis (PCA) to understand the interrelationships among a set of dimensions without prior knowledge. This dimensionality reduction method was unsuccessful in visually discovering clusters of MOFs. Then, we tested two combinations of data projection and clustering methods: T-distributed stochastic neighbour embedding (t-SNE) and Density-Based Spatial Clustering of Applications with Noise (DBSCAN) on the original dimension input data (t-SNE//DBSCAN), and Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) clustering the 2D embedding data obtained from Uniform Manifold Approximation and Projection (UMAP) (UMAP → HDBSCAN). Both the t-SNE//DBSCAN and UMAP → HDBSCAN pipelines are found to have overlapped clusters, which lack reproducibility and are parameter-sensitive. In contrast, we relied on a Gaussian mixture model (GMM) that uses the eigenvalue decomposition discriminant analysis (EDDA) method. This method is stable and not strongly dependent on the prior definition of the hyperparameters. We propose a novel interactive divide-and-conquer approach, the combination of GMM-EDDA and a form of linear discriminant analysis to enable visual split or merge decisions for each pair of Gaussian clusters. The end-user engages in the clustering process using trustworthy visualization where clusters appear as separated only if they are also well separated in the data space. Further, the identified meta-clusters were characterized using correlation heatmaps and violin plots of their distribution along each data dimension. Our methodology paves the way to address the clustering and data visualization challenges of highly overlapped and correlated databases.
Among the current machine learning techniques, clustering^{11} has the advantage of focusing entirely on the material's feature space. This helps in discovering patterns within data that may be hidden or counter-intuitive to researchers, making it particularly valuable in areas such as materials discovery or design. Recently, Baird et al.^{12} identified potential and chemically unique compositions among the existing inorganic chemical compounds of Materials Project using a Python tool, Stochastic Clustering Variance Regression (DiSCoVeR). This tool was developed by amalgamating chemical distance metric Element Mover's Distance (ElMD)^{13} clustering via density-dependent dimensionality reduction 2D embedding (Uniform Manifold Approximation and Projection (UMAP)^{14} or t-distributed stochastic neighbor embedding (t-SNE)^{15}), and a regression model.^{16} Focusing on MOFs, Thomas et al.^{17} extracted a quantitative understanding of structure–property relationships of an AB2 MOF data set with the help of data visualisation methods such as UMAP and t-SNE. In addition, Seyed et al.^{18} and Sauradeep et al.^{19} used the t-SNE projection method by considering topological and molecular chemistry features. However, all these studies resulted in overlapped clusters of the MOF and other inorganic materials datasets that lack delineation of the isolated cluster. Visualization embeddings were used as a validation tool to present the clustering results^{17–19} and as part of the clustering process itself.^{12,20} This demonstrates the importance of visualization for domain expert end-users when clustering high-dimensional data and illustrates the difficulty in interpreting clustering validation indices^{21} like Silhouette^{22} or Calinski-Harabasz^{23} which could be used instead of these visualizations. However, as we will show in Section 3, embedding distortions^{24} can strongly impair the quality of embedding-based clustering and the visual interpretation of clustering results.^{25,26} Moreover, humans and computational clustering techniques can strongly diverge even on a simple cluster counting task in two-dimensional scatterplots,^{27} emphasizing the importance of including humans in the clustering process. Hence, we propose a human-in-the-loop visual clustering process that overcomes these issues for extracting the structure–property relationships.
Dimensionality reduction (DR) techniques like t-SNE, UMAP, and Pairwise Controlled Manifold Approximation Projection PacMAP^{28} have shown effective visualization results on various real-world datasets. As already mentioned, the main issue with DR techniques is the loss of information or embedding distortions^{24,29} that can result in actual data clusters being represented as overlapping in the embedding, as typical of PCA,^{30,31} while other nonlinear neighbor embedding techniques like tSNE, UMAP or PacMAP may also be subject to a cluster split.^{32} As it is essential to get a trustworthy visualization of the cluster structure to support visual clustering by end-users, we propose an approach based on a set of logistics-based linear projections specifically designed to avoid cluster overlap when embedding each pair of pre-computed clusters.
Another challenge in visual clustering is the amount of manual and unguided operations the end-user requires. Typical tools let the end-users explore the data by navigating various embedding spaces, tuning their parameters, and interactively selecting the data.^{33,34} Here, we propose a simpler hybrid process where data are first clustered, and only pairs of clusters are visualized and subject to a simple binary merge-or-split decision by the end-user. The resulting visualization-based decisions are also easier to share with other experts to reach a consensus clustering.
Finally, a combination of high-throughput atomistic simulations and data visualization methodology was developed to find the clusters in the MOF database that possess correlated physicochemical properties. MOF structures were considered from a primarily studied Computation-Ready Experimental Metal–Organic Framework (CoreMOF) database, which was refined from the experimentally synthesized Cambridge Structural Database (CSD).^{35} A top-down approach was followed, and only single Cu metal containing MOF candidates were examined as Cu metal is abundant, low-cost with non-toxic properties, and, most importantly, has high complexation strength.^{36} Cu-MOFs can be synthesized with commercially available reagents and possess a high surface area.^{37} The high and excellent stability of Cu-based MOFs is complemented by antimicrobial activity, which means they are stable for environmental and biomedical applications.^{38}
In this work, we executed a series of data projection methods, PCA, t-SNE, and UMAP, as well as data clustering tools, DBSCAN and HDBSCAN, to understand the clustering mechanism in our MOF dataset. After a careful understanding of the pitfalls of these studies, we propose a hybrid approach, a combination of GMM-EDDA clustering and linear projections, to unravel the existence of the non-linear overlapping clusters. The outline of our workflow of dataset building and the different tools used for data clustering, projection, and visualization are shown in Fig. 1.
z = A^{T}x | (1) |
The two PCs with the highest eigenvalues are used to visualize the projected data as a scatterplot.
Fig. 2 Two pipelines used for clustering and data visualization: (t-SNE//DBSCAN) DBSCAN clustering of the 5D data and independent t-SNE projection of these clusters in 2D space following the process used by Roter et al.^{51} (top); (UMAP → HDBSCAN) UMAP projection of the 5D data in 2D space, then HDBSCAN clustering of the 2D embedded data following the process used by Baird et al.^{12} (bottom). |
Σ_{k} = λ_{k}D_{k}A_{k}D^{t}_{k} |
Fourteen possible models are available in MCLUST^{57} and displayed in Fig. 3. A subset is detailed in Table 1.
Fig. 3 The fourteen MCLUST GMM models. For two groups in two dimensions, this graphic displays the typical ellipse of constant density per group for each of the several models proposed by EDDA. Courtesy of Bensmail et al. (1996).^{58} |
Model | EM | Dist | Vol | Shape | Orient |
---|---|---|---|---|---|
λI | • | Spherical | Equal | Equal | NA |
λ _{k} I | • | Spherical | Variable | Equal | NA |
λDAD ^{ t } | • | Ellipsoidal | Equal | Equal | Equal |
λ _{k} DAD ^{ t } | • | Ellipsoidal | Variable | Equal | Equal |
λD _{k} AD ^{ t }_{k} | • | Ellipsoidal | Equal | Equal | Variable |
λ _{k} D _{k} AD ^{ t }_{k} | • | Ellipsoidal | Variable | Equal | Variable |
λ _{k} D _{k} A _{k} D ^{ t }_{k} | • | Ellipsoidal | Variable | Variable | Variable |
The Bayesian information criterion (BIC)^{59} can be used to select automatically the optimal number of clusters and the best covariance structure (EDDA) of their components.
Unfortunately, both approaches suffer from distortions of the projection technique due to the reduction of the dimensionality.^{24} For instance, PCA does not allow the discovery of clusters with non-convex shapes or non-linear separation structures. Separated clusters in the data can appear to be overlapping and indistinguishable in PCA plots.^{30} In contrast, t-SNE and UMAP are more likely to shatter clusters in separate components.^{32} Thus, the clusters we can see in t-SNE or UMAP embeddings are not trustworthy.^{25,26} Moreover, these embeddings suffer from a lack of reproducibility^{61} due to the stochastic and non-convex nature of their optimization process. Recent work proposes projection techniques^{29} and quality measures,^{26} which help mitigate these issues, but they cannot be entirely avoided. Although DBSCAN and HDBSCAN are deterministic clustering methods, their results strongly depend on the choice of their parameters. For these reasons, the use of t-SNE//DBSCAN^{51} and UMAP → HDBSCAN^{12} for clustering is not reliable for discovering actual clusters in the data space (Fig. 6).
In contrast, GMM directly extracts clusters in the multidimensional data space and gives insight into the local covariance structure of the clusters in that space. It is stable across multiple runs and uses a grounded model selection process to determine the best number of clusters. However, when the clusters do not follow Gaussian distributions, GMM may partition these clusters into several overlapping Gaussian components. Interpreting the GMM result then becomes challenging.
We propose to complement the GMM clustering with a visualization step to support the clustering decision of the analyst. Our approach is summarized in Fig. 4.
Fig. 4 Our methodology follows five steps: (A) clustering of the data with GMM-EDDA in n-dimensional data space; (B) projection of each of the K(K − 1)/2 pairs of clusters independently using logistic regression and PCA forming “LogDA” scatterplots; (C) visual inspection of these LogDA plots without color-coded classes by the end-user to decide about the cluster overlap (the two classes could be merged) or separation (the two classes could remain split); (D) collection of the merge (M) or split (S) decisions into a ClassMat^{62} adjacency matrix; (E) representation of the cluster structure as a node-link diagram. All processes are automatic except the crucial visual decision step (C). |
As already mentioned, there is no global projection that would provide a trustworthy overview of the topology of data clusters.^{24,61} However, if we consider each pair of Gaussian clusters found by the GMM, we can apply a local linear projection to visualize the data they represent and decide how they are separated or overlap in the data space. Then, we can reconstruct the overall topology of the clusters by aggregating these pairwise decisions. In contrast to automatic decision methods,^{63} the pairwise visualization allows the analyst to take ownership and responsibility of the clustering process. The decision for each pair of GMM clusters is objective and can be discussed between several analysts to reach a consensus.
Finally, our methodology is as follows: we first cluster the data using GMM-EDDA (Fig. 4A). For each pair of GMM clusters, we consider each cluster of the pair to form a distinct class of data, and we apply the logistic regression model^{64} to separate them. This model provides a probabilistic output indicating the probability for a data point to belong to either one of the clusters. We use the decision axis of the logistic regression as the first visualization axis, and we use the first principal component in the subspace orthogonal to the logistic axis as the second visualization axis. As a result, we get a scatterplot that we call a plot, where the pair of GMM clusters tend to be maximally separated along the logistic x-axis, and maximally spread along the PCA y-axis (Fig. 4B).
We gather all pairs of LogDA plots into a class-wise pairplot called ClassMat^{62}(see Fig. 4C and 7), to give the analyst an overview of the visual clustering process. ClassMat is similar to a pairplot but focused on classes instead of dimensions.
For each LogDA plot, the analyst can decide visually if two classes (GMM clusters) are separated enough to form valid clusters, or if they should be merged to form a single cluster instead. In contrast to other projection techniques like t-SNE or UMAP, if two classes appear separated in a LogDA plot, they must also be separated in the data space: visual class separation is trustworthy. The analyst can trust the class separation in these plots and infer the overall cluster structure in the data space by collecting all pairwise decisions from each LogDA plot (Fig. 4C). The decisions gathered in ClassMat (Fig. 4D) form an adjacency matrix of a graph connecting or not the components of the GMM. This graph is summarized visually as a node-link diagram (Fig. 4E) to understand the overall data cluster structure.
This work employs logistic regression to minimize the probability of overlap between clusters. It is worth mentioning some limitations related to the fact that we consider linear separation between clusters only, so we may merge two clusters if they are not linearly separable enough. However, because we are considering clusters from a Gaussian mixture model, the clusters are Gaussians hence they are naturally convex (ellipsoids), and they become non-convex only when two clusters overlap (one dense Gaussian distribution inside the area of a larger less dense Gaussian distribution for instance).
Further, we applied a dimensionality reduction algorithm to capture significant multi-dimensional data structures. In this regard, we tried the favoured method, principal component analysis (PCA),^{66} that can be fed raw data and is independent of prior data labels. The basic principle of PCA is described in Section 2.2.1. The generated principal components (PCs) are ordered by their decreasing eigenvalues. The top three principal components are used to display the Cu-MOF-10 bar dataset in Fig. 5b.
Three principal components were calculated for Cu-MOF-10 bar data that have total variance explained in the following percentages: PC1: 57.09%, PC2: 21.18%, and PC3: 12.14%. To get a clear visualization of the 3D plot, we added the projections along XY (PC1 × PC2), YZ (PC2 × PC3), and XZ (PC1 × PC3) axes. None of the projections portrays any separation of the data into clusters. Hence, we extended our studies to use a non-linear dimensionality reduction method, t-SNE and UMAP in combination with DBSCAN and HDBSCAN clustering methods^{12,51} (Fig. 2), which are more likely to help discover clusters in the data. The t-SNE//DBSCAN and UMAP → HDBSCAN methods' implementation details are described in Section 2.2.2. Firstly, our Cu-MOF-10 bar dataset with five features was supplied to t-SNE and UMAP algorithms for 2D embedding. Later, the data clusters were computed in the original input data space using DBSCAN and in the embedded data using HDBSCAN. Scikit-learn and Bioinfokit machine-learning Python libraries were used for t-SNE,^{67} UMAP,^{68} DBSCAN,^{69} and HDBSCAN^{70} implementations. The 2D view plots using the spatial coordinates either from input features or from the embedded data, and the clustered data of our Cu-MOF-10 bar dataset are shown in 6a–h. Left-side and right-side columns include the outcomes of t-SNE//DBSCAN and UMAP → HDBSCAN procedures that are shown in Fig. 6a–d and e–h, respectively. We started with the parameters mentioned in the developer webpage^{12,71} to get t-SNE and UMAP 2D embeddings. And we use the default parameters for DBSCAN and HDBSCAN given in the off-the-shelf Scikit-learn webpage.^{67,71} These methods were repeated twice to check their reproducibility. Fig. 6a, b and e, f show the resulting data clusters. We observe that all four trials produced different numbers of clusters (colors) with various shapes. This demonstrates that the t-SNE//DBSCAN and UMAP → HDBSCAN methodologies lack reproducibility and stability. They can be trapped in the local minima of the embedding process. In addition, we tried to find better parameter values for both the embedding and the clustering techniques, but it led to completely different clusters, as shown in Fig. 6b–d and f–h, demonstrating the sensitivity of these approaches to their parameterization.
Fig. 6 Parameter tuning of existing clustering pipelines from Fig. 2 is challenging. All plots show the same Cu-MOF-10 bar data either clustered with DBSCAN and projected with t-SNE (left column), or projected with UMAP then clustered with HDBSCAN (right column). Projections provide the position of the points and clustering give their color (black or purple (−1) codes for (H)DBSCAN outliers). We run several times the Scikit-learn and Bioinfokit default parameters of t-SNE and DBSCAN (a, b) and UMAP and HDBSCAN (e, f). The stochastic nature of t-SNE and UMAP leads to high variability of the projections; it prevents reproducibility and hinders an objective analysis. When trying to optimize the parameters (c, d) and (g, h), the results remain highly sensitive to the chosen parameters. In (a–d), clustering performed in the data space may be correct but it does not match with cluster patterns generated by the non-trustworthy projections, confusing the analyst. In (e–h), there is a better match between clusters (proximity patterns) and class labels (colors) because clustering occurs in the projection space itself. However, we cannot confirm whether the cluster patterns generated by the projection are trustworthy or not; this time the good visual matching of clusters and classes is misleading. Finally, these two pipelines are sensitive to parameters and not trustworthy due to the distortions of the projection techniques. These observations confirm the same misleading or deceptive results when using such projection and clustering techniques in the domain of single-cell genomics.^{61} |
Fig. 7 (a) Cu-MOF-10 bar dataset projected onto the first two principal directions of the GMM-EDDA model showing the decision boundaries with uncertainties. (b) ClassMat^{62} gives an overview of the two-dimensional LogDA plots displayed for each pair of classes obtained from the GMM-EDDA model. Colors code for the GMM clusters. For instance, the LogDA plot with red and pink points at the crossing of column C3 and row C6 shows that classes C3 (red) and C6 (pink) form distinct clusters in that linear projection (c), from which we conclude C3 and C6 are well separated forming distinct clusters in the data space as well. For that reason, the split (S) decision has been marked by the analyst in the cell at the crossing of row C3 and column C6. In contrast, classes C3 and C4 overlap forming a single cluster clearly visible from the monochrome version of the LogDA plot (d), leading to a merge (M) decision marked in the cell at the crossing of row C3 and column C4. The remaining LogDA plots and decisions are given in the ESI.† |
At this stage, it seemed that despite all these different approaches, our Cu-MOF-10 bar data do not seem to form distinct clusters in the five-dimensional feature space.
We obtained ten clusters in the final BIC-optimal GMM-EDDA shown using a linear projection in Fig. 7a. The grey boundaries represent the cluster uncertainties. As it is a linear projection of all the data, we cannot infer the cluster structure from this single plot^{24} where most clusters seem to overlap.
We apply the logistics-based discriminant analysis and PCA for each pair of GMM-EDDA clusters (Fig. 4B) to find the actual separation between the clusters. We arrange all these LogDA scatterplots into a matrix called ClassMat^{62} (Fig. 4C).
The ten GMM-EDDA clusters (Fig. 7a) result in 45 LogDA projections displayed in Fig. 7b as scatter plots arranged in the lower triangular part of the ClassMat matrix. The individual clusters in each LogDA plot are shown in a distinctive color to ease visual analysis (see Fig. S1–S5†). As per Fig. 7b and S1–S5,† the next step is for the end-user to visually decide the clusters' separation in each LogDA plot. In order not to be biased by the color of the GMM clusters in LogDA plots, we plotted them in black-and-white. Two of these plots are shown adjacent to their colored counterpart in Fig. 7c and d (all of them in Fig. S1–S5†).
All 45 pairwise LogDA monochrome plots were analysed visually to analyze the presence of cluster separation leading to a split decision (e.g. (C3, C6) pairs in Fig. 7c) or cluster overlap leading to a merge decision (e.g. (C3, C4) pair in Fig. 7d). The analyst considered some pairs of clusters to be separated based on the difference in density rather than the wide empty space between them (e.g. C6 and C7 in Fig. S4†). The cluster pairs that are displayed in the lower triangular part of the ClassMat in Fig. 7b were marked in the upper triangular part of ClassMat as ‘M’ or ‘S’ for merge or split decisions, respectively.
The ensemble of merge or split decisions gathered in ClassMat form an adjacency matrix of a graph connecting the 10 initial GMM clusters (M code for the presence of a link and S for its absence). The connected components of this graph form the meta-clusters of our data. Further, we computed the node-link representation of this graph shown in Fig. 8 using the Pyvis Python interface.^{72}
Fig. 8 The decisions recorded in ClassMat (Fig. 7b) form an adjacency matrix where M stands for a link between two GMM-EDDA components and S for no such link. A node-link diagram is used to represent these adjacency data. It summarizes the visual decisions of the analyst in an easy-to-read network whose connected components form meta-clusters: a large cluster made of GMM-EDDA components C1–C2–C3–C4–C5–C6–C7–C9 and two isolated components C8 and C10. |
Finally we discovered three meta-clusters in our data, visible as the three connected components of this graph: C_{A} = (C1, C2, C3, C4, C5, C6, C7, C9) (blue), and the isolated GMM clusters C8 (gold) and C10 (orange). The node-link diagram also suggests the possible existence of some branching structures (C5, C7, C9) and a cycle (C1–C3–C2–C6) within these data that could be investigated further in future work.
The pairwise LogDA projections of the aggregated clusters and the individual clusters are shown in Fig. 9a–c.
Fig. 9 (a–c) LogDA projections of the three pairs of final clusters discovered. Although C8 and C10 are clearly separated from each other, it would have been difficult to distinguish them from the cluster C_{A} but the LogDA plots between C8 or C10 and any of the components of C_{A} show that they form distinct clusters (ESI†). |
The LogLDA plots of C_{A}-C8 and C_{A}-C10 (Fig. 9a and b) show these pairs are not linearly separated (a black-and-white version of these plots would not let us discover the two clusters). But we know that every pair formed by a component of C_{A} and either C8 or C10 shows well separated clusters (Fig. 7b and S1–S5†) which indicates C8 and C10 are possibly nested into C_{A} or at least non-linearly separable from it. This is a case similar to the illustrative example (Fig. 4A) where no linear projection can separate the two composite clusters (gold-green) and (red-grass-yellow-brown-blue), but every pair of components taking one component from each composite cluster is linearly separable. Thanks to our divide-and-conquer methodology we can discover complex clusters using multiple class-pairwise linear projections.
We characterize the resulting aggregated clusters individually by computing the distribution of their data along the five different features (Violin plots in Fig. 10) and computing their correlation coefficient between every pair of features (Heatmaps in Fig. 11).
Visually, all three clusters, C_{A}, C8 and C10 show different violin plot distributions along each feature. The below tendencies are given with respect to the normalized values. The data from the cluster C_{A} are well concentrated along medium MC and QH, and low PS and WS nv, but are not very dependent on UPTAKE. The data from C8 are mostly determined by their concentration on medium QH, and low UPTAKE and WS. Lastly, cluster C10 gathered data with medium MC, high QH, and low PS and WS, with not much dependency on UPTAKE. These characteristics further validate the singularity and distinctiveness of the clusters discovered with our methodology.
If two clusters appear separated along the LogDA axis, they must be separated in the original data space too. However, the converse may be false: two nonlinearly separated clusters may appear as overlapping in the LogDA plot, leading to a merge decision, for example, two interlocked banana-shaped clusters or a narrow Gaussian cluster nested into a spherical-shell cluster. Still, as we consider clusters from a Gaussian mixture model which is essentially a density model, we assume the GMM components are automatically captured nearly Gaussian clusters or they cover a continuous region of the data distribution forming a single data cluster, hence they are likely convex (ellipsoids) or at least simply connected (not in multiple parts and with no hole). As a result, in the meta-cluster network representation (Fig. 4E and 8), the absent links are the most trustworthy and can be checked visually in the corresponding LogDA plots, while some clusters (nodes) might need to be split further, and some links might not exist in the actual data. If our assumption is not valid, refined analysis tools like topological data analysis techniques^{62,63} could be used to complement this pipeline.
Other clustering techniques such as DBSCAN, HDBSCAN, or the well-known K-means^{74} could be used instead of GMM. However DBSCAN or HDBSCAN would not be good candidates as they can form non-convex clusters which would not verify our base assumption for interpreting LogDA plots for split and merge decisions. K means instead would verify the convexity assumption as it partitions the data into K Voronoi cells. However, in contrast to K means, the GMM benefits from a well-grounded statistical framework^{75} equipped with Bayesian criteria to select the number of clusters. Moreover, it has been shown that K means is a special case of GMM if we use a hard class assignment between the Expectation and Maximization (EM) steps of the GMM optimization process.^{76} As an exercise, though, we ran our methodology using K means with K set to 10 as found by the optimal GMM. Results are displayed in ESI Fig. S6 and S7.†
The GMM model plays a central role in our approach. The number of parameters to estimate grows linearly with the number of components K (means and prior weights) and quadratically with the dimension n (covariance matrix). Thus, the main technical limitation is the data sample size as we expect at least a few data values to estimate each parameter, and the data dimension n as covariance matrices can become numerically unstable. In practice, the Mclust toolbox can handle data with size up to ∼10000 and dimensions up to ∼20. Data random subsampling and PCA are usually applied to reduce the data size and dimension to technically manageable values without loosing much information. Otherwise, scalable approaches have been proposed for training GMM.^{77,78} We considered a continuous feature space, but ordinal and categorical features can also be handled with variants of GMM^{75} and would also require some adaptation of the LogDA plots in our method. They are left as future work.
Our experiments showed that t-SNE//DBSCAN and UMAP → HDBSCAN clustering pipelines (Fig. 6) are sensitive to parameter settings. This indicates that developing an optimization workflow to optimize the hyperparameters for t-SNE, UMAP, DBSCAN, and HDBSCAN methods is a worthy future work.
Our method involves visual checking of possibly many scatterplots. There are perceptual limitations in terms of the size of the ClassMat visualization to render more than about 20 clusters (20 × 20 matrix of scatterplots) on a standard screen display. Several features could enhance our approach and are left as future work: allowing for interactive exploration of ClassMat using pan and zoom; using ClassMat directly to annotate interactively the split-or-merge decisions; re-ordering rows and columns automatically based on these decisions so meta-clusters appear as blocks along the diagonal as in Fig. 4D; or using indicators of visual quality measures^{31,79–82} or clustering indices^{21} to explore in priority the scatterplots with the most ambiguous cluster patterns.
Clustering indices^{21} could be used to quantify cluster patterns in LogDA plots, and the resulting meta-cluster. However, it is important to note that clustering quality metrics are designed to compare alternative clusterings of the same set of points, so they cannot be used to compare patterns from two different LogDA plots, although research is ongoing in that direction.^{83} Moreover, they cannot output a value if there are less than two clusters. As a result, it is not possible to use such a score to decide if two clusters should remain split or merged into one as no score is available in that latter case.
Finally, our proposal emphasizes the importance of human-in-the-loop decisions in clustering. Clustering techniques and clustering quality metrics^{21} are both a form of human knowledge embedded in a computational function, except they are predefined and generic, while our approach lets the end-user decide for each pair of clusters based on a wide range of possible visual patterns. We argue that letting the end-user decide by visual analysis of carefully chosen two-dimensional linear projections of cluster pairs can be used to improve the base clustering technique to detect non-linear and more complex cluster topologies, and engage the end-user in the decision. The meta-clusters obtained by aggregating base clusters come as complementary insights of the cluster structure that the base clusters could not reveal alone (Fig. 4E). Moreover, the individual decisions can still be discussed between several experts visualizing the same data (the LogDA plots), to come to a consensus decision if needed. This makes our method a novel form of interpretable clustering^{84,85} instead of a black-box clustering model the end-user must trust blindly. Still, a future study with multiple end-users and various datasets will be run to validate our method more quantitatively.
Regarding complex overlapped MOF data, the use of this methodology allowed us to discover three clusters with non-linear separation in the data space. We could characterize these clusters by their correlation and distribution among the five features. This work demonstrates that MOF materials can have highly correlated properties and cannot be categorized based on knowing only geometrical or electronic descriptors using standard clustering and projection pipelines. Our future aim is to classify the MOF materials based on the topological and electronic descriptors using the current clustering methodology as a starting step. The present method could also be applied to any virtual dataset beyond materials science.
The methodology itself could benefit from using indicators to support the analyst in examining the challenging LogDA plot in priority, while some others with obvious class separation could be decided automatically. Developing a fully interactive tool to record the analyst's decision and draw the corresponding node-link diagram would also be a plus. Supporting the analysis of branching structures and cycles formed by the resulting graph is also of interest.
Footnotes |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d3dd00179b |
‡ These authors contributed equally to this work. |
This journal is © The Royal Society of Chemistry 2024 |