Open Access Article
Junhao Caoab,
Nicolas Folastreab,
Gozde Oneyc,
Edgar Rauchd,
Stavros Nicolopoulose,
Partha Pratim Das
e and
Arnaud Demortière
*abf
aLaboratoire de Réactivité et de Chimie des Solides (LRCS), CNRS UMR 7314, Université de Picardie Jules Verne, Hub de l’Energie, Rue Baudelocque, 80039 Amiens Cedex, France. E-mail: arnaud.demortiere@cnrs.fr
bRéseau sur le Stockage Electrochimique de l’Energie (RS2E), CNRS FR 3459, Hub de l’Energie, Rue Baudelocque, 80039 Amiens Cedex, France
cInstitut de Chimie de la Matière Condensée de Bordeaux (ICMCB), Bordeaux, France
dUniversité Grenoble Alpes, CNRS, Grenoble INP, SIMAP, 38000 Grenoble, France
eNanoMegas Company, Belgium
fALISTORE-European Research Institute, CNRS FR 3104, Hub de l’Energie, Rue Baudelocque, 80039 Amiens Cedex, France
First published on 30th October 2025
This study presents a novel integration of unsupervised learning and decision-making strategies for the advanced analysis of 4D-STEM datasets, with a focus on non-negative matrix factorization (NMF) as the primary clustering method. Our approach introduces a systematic framework to determine the optimal number of components (k) required for robust and interpretable orientation mapping. By leveraging the K-component loss method and Image Quality Assessment (IQA) metrics, we effectively balance reconstruction fidelity and model complexity. Additionally, we highlight the critical role of dataset preprocessing in improving clustering stability and accuracy. Furthermore, our spatial weight matrix analysis provides insights into overlapping regions within the dataset by employing threshold-based visualization, facilitating a detailed understanding of cluster interactions. The results demonstrate the potential of combining NMF with advanced IQA metrics and preprocessing techniques for reliable orientation mapping and structural analysis in 4D-STEM datasets, paving the way for future applications in multi-dimensional material characterization.
Pattern matching strategies based on pixel-to-pixel cross-correlation coefficients between experimental patterns and simulated patterns, generated from known crystallographic structure data in Crystallographic Information File (CIF) format,2a have been extensively employed for the analysis of 4D-STEM datasets. This method, which facilitates the extraction of orientation and phase maps, has been implemented in several software packages, including Astar,2b py4D-STEM,2c and pyXEM.2d The automated crystal orientation mapping (ACOM) procedure determines the orientation of each diffraction pattern, enabling accurate crystallographic analysis of materials. However, electron diffraction patterns are inherently sparse datasets, with fewer than 10% of the pixels containing meaningful signals. Thus, the implementation of data reduction strategies, which convert sparse data into dense representations, can significantly enhance post-processing efficiency for feature extraction, clustering, and reconstruction, as demonstrated in the development of ePattern (see in SI Algorithm SI_6).42
Clustering and data reduction strategies are standard techniques for handling large and high-dimensional datasets. Their objective is to enhance data interpretability while preserving the most relevant information from the original dataset.3,4 For instance, Principal Component Analysis (PCA) is a widely used unsupervised learning technique for dimensionality reduction, transforming data into a new coordinate system to capture most of the variance in fewer dimensions.4–6 While effective in applications like image processing, noise reduction, and data compression, PCA has limitations, including its inability to capture non-linear data structures and the interpretability challenges posed by negative component values.11,12 Furthermore, when combined with clustering algorithms, PCA's results can be sensitive to the user-defined number of clusters, potentially affecting analysis robustness.13
In contrast, Non-negative Matrix Factorization (NMF)10a offers several advantages over PCA in the context of unsupervised learning and data dimensionality reduction. Unlike PCA, which allows for both positive and negative components, NMF imposes non-negativity constraints on the factorized matrices. This non-negativity constraint results in a parts-based data representation, making NMF highly effective for interpreting and extracting meaningful features in applications such as image processing, text mining, and spectral data analysis. Furthermore, NMF is better suited for handling non-linear and non-convex data structures compared to the PCA. While both PCA and NMF are linear factorization methods, NMF's non-negativity constraints and additive structure enable it to better approximate non-linear and non-convex data patterns common in practical applications. This makes NMF more suitable for tasks where data is composed of localized, interpretable parts, even if the overall manifold is non-linear. The additive nature of NMF components can capture the underlying data patterns more effectively when the data consists of overlapping or additive features. NMF's powerful ability to extract subtle orientation variations has been utilized to enhance the accuracy and reliability of detecting different crystal orientations in 4D-STEM datasets.27
In traditional clustering methods, the determination of the optimal number of clusters is inherently challenging due to several factors.27b The intrinsic complexity of the data can make the natural separations between clusters unclear, especially in the presence of overlapping clusters, noise, or varying density and shape.10b The absence of ground truth in many clustering applications requires reliance on data-driven methods to estimate the optimal number of clusters.28 To tackle these challenges, various methods have been proposed. The elbow method entails plotting the within-cluster sum of squares (WCSS) against the number of clusters to identify a point where adding more clusters yields diminishing returns.10c Silhouette analysis assesses cluster compactness and separation, selecting the number of clusters that maximizes the silhouette score.10d Incorporating domain knowledge can also guide and validate the clustering process, ensuring alignment with practical expectations.10e By integrating these approaches and validating results across multiple criteria, the determination of the optimal number of components in clustering becomes more robust and reliable.
Brute-force or sophisticated methods for determining the optimal number of clusters usually involves running the clustering algorithm multiple times, each with a different number of clusters, and selecting the configuration that yields the most favorable results. These approaches are computationally intensive. To address this issue more effectively, integrating decision-making approaches, such as multi-criteria decision-making techniques, can provide substantial advantages by automating the selection process and enhancing the robustness of the clustering outcomes. Decision-making can be considered as a problem-solving method providing an optimal solution to a specific event.14,15 After analyzing of a finite set of alternative solutions, the objective is to categorize these alternatives to establish a priority ranking among them. Generally, the conception of decision-making in unsupervised learning17 is related to extracting significant patterns, features, or underlying information,16 without specific labels, revealing the inherent characteristics or relationships hidden in the raw data.18,19 In the 4D-STEM data clustering process, decision-making involves several considerations specific to the qualities and attributes of electron diffraction pattern datasets, which encompass both crystal orientation and crystallographic phase information.20
An additional significant challenge in 4D-STEM mapping is the overlap of patterns from different crystals.20a,b In 4D-STEM, diffraction patterns are generated from probe positions scanning crystals that may be in proximity or/and superimposed configurations.2,20 Thus, assigning the correct crystallographic orientation becomes difficult when overlapping occurs.25 The diffraction patterns in 4D-STEM can demonstrate complicated features and overlapping spots, the ambiguity of which leads to requiring accurate interpreting of the orientation.24 The complexity of diffraction patterns can cause errors or uncertainties regarding the determining crystal orientations.26,27 Efficient algorithms for overlap detection are thus required to specify the precise location of each individual diffraction pattern.25a,b
In this study, we develop clustering approach using Non-negative Matrix Factorization (NMF) to analyze four-dimensional scanning transmission electron microscopy (4D-STEM) datasets for orientation mapping. We introduce an efficient method termed “K-component loss,” which, when combined with Image Quality Assessment (IQA), enables the automatic and effective detection of material characteristics and clustering within large datasets. Our methodology begins with an evaluation phase (level one) to determine initial NMF parameters. Then, we employ a k-metric derived from IQA to ascertain the optimal number of clusters (k) in a subsequent phase (level two). This approach is particularly advantageous for processing overlapping diffraction patterns, as it leverages advanced data analysis techniques to separate overlapping signals, assess the similarity of each component, and accurately extract pertinent features from the dataset. By integrating NMF with IQA, our making-decision method offers a robust framework for the analysis of complex 4D-STEM data, facilitating enhanced material characterization and more precise orientation mapping.
In the latent space, essential features of the original matrix are extracted by selecting components (denoted as k) whose number is significantly less than the rank of the original matrix V (k ≪ min(W, H)). The matrix V ≈ W × H is factorized into two relatively small matrices (W, H) compared with V (original), the dimensionality of these two matrices is W × k and k × H, respectively.30 The linear combination of W and H generates an approximated matrix V′ = W × H. W matrix can be interpreted as the feature matrix, in which the k-column represents the most k-relevant feature from the original matrix V.31 H can be interpreted as the coefficient matrix, in which the element is the weight associated with the W matrix. Moreover, the aim of obtaining the result of approximate matrix V′ is achieved by minimizing a loss function.29
Lee and Seung introduced an alternating optimization method for NMF.29 Starting with random non-negative initializations of matrices W and H, the algorithm iteratively, Alternating Least-Square (ALS), minimizes the loss function ‖V − WH‖ using multiplicative update rules. In each iteration, H is updated while keeping W fixed, followed by updating W with H fixed, ensuring that both matrices remain non-negative throughout the process. This procedure continues until the difference between V and its approximation WH falls below a predefined threshold.30
To ensure dataset integrity following Non-negative Matrix Factorization (NMF), it is essential to assess information loss between the original and factorized matrices. Incorporating L1 regularization,32 commonly utilized in machine learning to enhance model sparsity, can effectively select pertinent features, particularly in high-dimensional datasets.33 This study calculates the difference between the original and factorized matrices, resulting in a K-component loss matrix. We then compute the mean of the absolute values of its elements to quantify information loss.
According to the K-component loss, the declining trend reflects the loss variation between the NMF results and the original dataset, serving as a reference for evaluating dataset quality. As shown in Fig. 2, the curve is flatter after k = 10, indicating that the NMF reaches its performance limit, beyond which further processing offers minimal benefit. Thus, k = 10 is identified as a preliminary choice for the number of components. However, to ensure this selection does not lead to overfitting, a secondary evaluation using Image Quality Assessment (IQA) is conducted. For holistic evaluation, the K-component loss can be integrated with perceptual metrics. This combination ensures optimization aligns not only with pixel-wise accuracy but also with human-interpretable quality, making it particularly effective for applications like image denoising, hyperspectral unmixing, or document topic modeling.
IQA objectively analyzes and quantifies image quality through algorithms that estimate perceptual quality based on various features.34 Its goal is to provide mathematical metrics aligned with human visual perception.35,36 IQA facilitates the evaluation of image quality, performance analysis of image processing algorithms, and supports decision-making for quality enhancement.37 IQA methods are generally categorized into Full-Reference (FR) and No-Reference (NR) approaches.38 FR-IQA, being more established, is commonly used in machine learning for image quality evaluation. It compares a reference (original) image with a target (processed or distorted) image using metrics such as Structural Similarity Index (SSIM), Peak Signal-to-Noise Ratio (PSNR), and Mean Squared Error (MSE), which are widely applied in FR-IQA.36
To quantify the fidelity and difference between each two-diffraction pattern of NMF-reconstructed component maps, we computed four full-reference IQA indices to determine the maximum different orientation in material, where Structural Similarity Index (SSIM) measures perceptual similarity in luminance, contrast and structure. SSIM ∈ [−1, 1], where 1.00 denotes perfect structural agreement. Values ≥0.90 are considered excellent, 0.80–0.90 good, and close to −1 indicate notable dissimilarity. Peak Signal-to-Noise Ratio (PSNR) reflects the ratio between the maximum possible pixel intensity and the mean squared error. Expressed in decibels (dB). PSNR ≥ 30 dB generally signifies high-fidelity reconstructions; PSNR < 25 dB suggests significant loss, which usually provides a global view to evaluate the contribution of each clustering. Gradient Magnitude Similarity Deviation (GMSD) assesses local gradient (edge) consistency. Lower scores indicate better edge preservation: values ≤0.05 indicate excellent gradient fidelity, 0.05–0.10 good, and above 0.10 signal degraded sharpness or more different between each signal. Mean Deviation Similarity Index (MDSI) combines color, luminance and gradient information into a single deviation measure. MDSI ∈ [0, 1], where 0.00 is perfect. Values ≤0.05 denote excellent overall similarity, 0.05–0.10 good, and >0.10 poor similarity or total difference.
4D-STEM data are first transformed into a 2D array and subsequently processed using NMF to obtain the W and H matrices. The H matrix (k, M × N) represents the contribution of each basis vector from the W matrix to reconstructing the original matrix V.40 Each row of H corresponds to the weights for specific data points in V, reflecting the extent to which each basis vector contributes to the reconstruction. In this context, H encapsulates how the features represented by W are combined to describe the original data. Here, k denotes the number of clusters within the dataset, with each column indicating the weight (or probability) of a data point (diffraction pattern) belonging to a given cluster. Higher weights signify a greater likelihood of association with a specific cluster.
NMF exhibits a strong sensitivity to pertinent features during matrix factorization, effectively capturing overlapping structures within the dataset. In this context, overlaps are represented as secondary weights, while the original clusters correspond to primary weights. Utilizing the H matrix, which encapsulates all weight contributions, we extract the first and second weights and define a threshold to differentiate them. This threshold facilitates the visualization of overlapping regions within the dataset, as shown in Fig. 5.
Three different filtering approaches are compared: (1) raw data: the dataset as acquired from 4D-STEM without any processing. (2) Mean filtering: this method processes the raw data by normalizing the sum of neighboring images using a 3 × 3 kernel sliding across the scan.42 This averaging technique produces a scan of unchanged size, where each Diffraction Pattern (DP) image is the average of neighboring images.42 (3) ePattern algorithm: proposed in our team, this novel algorithm focuses on dimensionality reduction and reconstruction of DP.42 It employs a neural network-like structure consisting of an encoder, which extracts the most relevant features into a latent space, and a decoder, which reconstructs the diffraction patterns from the latent space representation.42 These filtering methods highlight the importance of preprocessing in enhancing the quality and reliability of NMF results, particularly in the context of 4D-STEM data analysis.
The two proposed methods aim to enhance data quality by eliminating noisy or irrelevant data points from the dataset. Fig. 2a illustrates the Noise Standard Deviation (NSD) values corresponding to each representative pattern extracted using NMF. Among the evaluated datasets, the ePattern dataset demonstrates the lowest NSD value (2.568), indicating that low-variance features have been effectively removed. Fig. 2c further compares the NSD values across various K-component clusters (Cluster1 ∼ Clusterk) for different methods. The intensity of the heatmap corresponds to the magnitude of NSD, with ePattern consistently showing the lowest values (depicted by red and lightest colors). This reduction in noise enables NMF to achieve more efficient factorization and enhances the interpretability of the resulting components.
Fig. 2b visualizes the impact of dataset filtering on convergence and computational efficiency during NMF processing. A comparison between raw data and preprocessed datasets (mean function and ePattern) highlights the advantages of the latter. The ideal factorization result (V′ = W × H) is closer to the original matrix (V), with minimal deviation. Notably, the loss curve for the ePattern dataset exhibits a smooth and consistent downward trend, unlike the raw data and mean function, which show numerous outliers. At the critical point of the steepest gradient change (k = 10), the ePattern curve demonstrates minimal fluctuation, underscoring its stability and robustness in noise handling. This improved convergence behavior facilitates more reliable and accurate NMF performance.
Moreover, by removing irrelevant features, the ePattern dataset enables NMF to produce more interpretable factors. When applied to the ePattern dataset, the resulting components represent distinct and meaningful patterns that are easier to interpret and analyze (Fig. 5). In addition to superior noise reduction, the ePattern dataset enhances the stability of NMF, reduces the risk of overfitting, and prevents the model from capturing artificial patterns originating from noise.45
When applied to image clustering, the quality of the reconstructed images and the accuracy of the decomposition are pivotal in determining the optimal k.22,23 IQA metrics, are utilized to evaluate the fidelity and differences in the reconstructed images. The reconstructed data V′ is expressed as V′ = W × H, where each clustering operation corresponds to (Clustering1 = W1 × H1 … Clusteringk = Wk × Hk).
Fig. 3a demonstrates that increasing k generally reduces reconstruction loss, as a larger number of components can theoretically capture more details of the original data. However, this also introduces the risk of overfitting, where the model begins to capture noise along with the signal. Higher values of k tend to improve IQA metrics, such as SSIM, up to a threshold, after which additional components may not enhance quality and might even degrade it due to overfitting.
The range of interest identified in Fig. 3a suggests that k values between approximately 6 and 14 (centered around k = 10) achieve an optimal balance between underfitting and overfitting. Within this range, reconstruction loss decreases significantly while avoiding overfitting. This range also reflects a trade-off between capturing essential features and minimizing the incorporation of noise.
Fig. 3b–e analyze k based on four IQA algorithms. For instance, Fig. 3e examines PSNR, a metric used to measure the fidelity of reconstructed images by comparing them with the original. Higher PSNR values indicate reduced distortion and noise, signifying that the NMF components have effectively captured the essential features of the original data.34 In image compression and reconstruction contexts, PSNR values above 40 are considered excellent, whereas values below 20 are deemed unacceptable.34 For NMF, PSNR values higher than 40 indicate that the reconstructed images retain a high degree of similarity to the original data, which is crucial for determining the optimal k.
The results suggest that k = 10 (or slightly below this value) achieves an equilibrium between preserving essential features and avoiding noise overfitting. While PSNR provides a global perspective on the fidelity of image reconstruction, other metrics like MDSI, GMSD, and SSIM complement the analysis by focusing on different aspects of image quality.
Fig. 3c presents the results of MDSI, which evaluates global differences between images, including intensity and spatial information.47 MDSI values range from 0 to 1, with higher values indicating greater similarity.47 For NMF clustering, the goal is to maximize the distinctiveness of clusters, ensuring that diffraction patterns within clusters are noticeably distinct. At k = 8, MDSI (value = 0.3062) captures global features effectively while minimizing distortion.
In contrast, Fig. 3b and d focus on GMSD and SSIM, which measure localized and structural differences. GMSD quantifies deviations in gradient magnitudes between reference and reconstructed images, making it suitable for capturing changes in image structure caused by distortions.48,49 Lower GMSD values indicate greater similarity, while higher values highlight increased dissimilarity.50 At k = 8, GMSD achieves an ideal value of 0.1139, signifying effective structural fidelity.
Concurrently, the SSIM provides a robust evaluation of the similarity between two images by assessing their structural information.46 Notably, SSIM is highly sensitive to subtle structural differences, making it an effective tool for detecting slight variations between images. The SSIM index ranges from −1 to 1, where a value of 1 represents perfect structural similarity, and −1 indicates complete dissimilarity.46,51
In the context of clustering optimization, the analysis aims to minimize redundancy among clustering points on a global scale, with the objective of maximizing the sum of distinctly different clustering points.52,53 As illustrated in Fig. 3b and d, based on the ultimate values for GMSD = 0.1139 and SSIM = 0.718, the analysis indicates that k = 8 represents an optimal choice for k, as further validated in Fig. 4.
The integrity of the 4D-STEM dataset, characterized by well-defined and distinct diffraction patterns, is paramount for achieving accurate component separation. Fig. 4 demonstrates the robustness of NMF, validated by quantitative image quality metrics, in extracting and mapping structural features in complex materials such as cathode materials, where lattice parameters of different phases can be very close to each other. The choice of k = 8 was guided by a systematic evaluation of the trade-off between capturing essential structural details and mitigating overfitting. The PSNR values are consistently high (mostly >40 dB), demonstrating that NMF reconstructions retain fine diffraction features with minimal distortion. The coherently high values are ranging from ∼37 dB to ∼57 dB, with most above 40 dB. Such high PSNR values indicate that the NMF reconstructions closely approximate the raw patterns while preserving the high-frequency details that are crucial for identifying weak Bragg reflections. Similarly, the SSIM scores are very close to 1 (0.979–0.993), confirming excellent structural similarity between NMF and raw patterns, which also demonstrates that the structural information in the diffraction patterns, particularly the form and relative intensity distribution of Bragg disks, is well preserved in the NMF outputs, despite the denoising process. Meanwhile, the MDSI and GMSD values remain low across all clusters, indicating negligible perceptual differences. For the MDSI evaluation, the low values (0.095–0.199) further support the conclusion that perceptual differences between raw and reconstructed patterns are minimal. MDSI is sensitive to contrast and luminance changes, and the low deviations here suggest that NMF maintains intensity relationships in the diffraction patterns. In parallel, in terms of GMSD, with values consistently below 0.03, GMSD confirms that the local gradient structures (edges and sharp intensity transitions in diffraction spots) are highly consistent between raw and reconstructed data. This metric is particularly relevant for diffraction analysis, where preserving the sharpness of Bragg disks is essential for accurate reciprocal space mapping.
The visualization in Fig. 4 encapsulates the outcome of NMF applied to the dataset, where diffraction patterns from various sample regions are color-coded to represent distinct components or orientations. Each region is associated with the most representative diffraction pattern derived from the clustering process, as shown in the bottom row of the figure (labeled 1 through 8). This mapping confirms that NMF differentiates regions based on their structural similarity. The distinct colors and their corresponding diffraction patterns further validate that the selected k = 8 captures the essential crystallographic orientations and phases in the sample.
Moreover, Fig. 4 underscores the critical role of dataset quality in enabling accurate component identification. High-quality diffraction patterns, characterized by sharp and well-defined features, enhance the ability of NMF to discern subtle variations in orientations with local disorientations. The sensitivity of the model to structural and orientation features at k = 8 ensures a precise balance between capturing intricate details and minimizing noise. Consequently, this approach facilitates meaningful and reliable orientation mapping, emphasizing the synergy between advanced computational techniques and high-quality experimental data.
Fig. 4 illustrates results from NMF applied to the filtered dataset (via ePattern), while Fig. SI_7 shows unprocessed raw data. Comparison with the raw dataset reveals the importance of preprocessing: unprocessed patterns suffer from noise, obscuring weak reflections and complicating segmentation. Filtering enhances diffraction spot visibility, reduces background, and improves both interpretability and clustering accuracy.
NMF on the filtered data yields clearer, more consistent reconstructions than on raw inputs. IQA metrics confirm this: PSNR values (∼37–57 dB, mostly >40 dB) show reconstructions approximate raw patterns while suppressing noise, SSIM scores (0.979–0.993) indicate strong preservation of Bragg disk features and low MDSI (0.095–0.199) and GMSD (<0.03) values show minimal perceptual or gradient differences.
Filtered clustering maps display sharper domain boundaries and better phase separation than noisy raw maps, directly improving structural insight. Overall, dataset reduction, through filtering and NMF, balances denoising with structural fidelity, ensuring both visual clarity and quantitative reliability for tasks such as strain mapping, orientation classification, and phase identification. It is thus a prerequisite for extracting robust physical insights from 4D-STEM via unsupervised clustering.
In the context of NMF, where V = W × H encodes the weight information for each cluster. Each element H(i, j) represents the probability of a specific pixel belonging to a given cluster. By reshaping H into k individual weight matrices (H1, H2, …, Hk), each matrix corresponds to a unique cluster and captures its spatial distribution as a 2D representation with dimensions (x, y).
To evaluate cluster overlap, the method systematically compares the maximum weight and the second-highest weight at each pixel location across all clusters. A ratio is computed as second weight/first weight, with thresholding parameters ranging from 75% to 95% to delineate regions where the second-highest weight contributes significantly. This enables the detection of areas where clusters are not well-separated, highlighting potential overlaps.
Fig. 5a illustrates the structure of H, represented as a matrix with dimensions (k, x × y), where k is the number of clusters and x × y represents the flattened spatial dimensions of the dataset. For each spatial position (x, y), a corresponding weight vector in H indicates the likelihood of that position belonging to each cluster. For instance, if H(1, 1) = 0.5, it signifies that the first diffraction pattern has a 50% likelihood of belonging to the first cluster, while H(k, 1) reflects the probability of the same diffraction pattern belonging to the k-th cluster.
Following the reshaping of H into individual cluster weight matrices, Fig. 5b displays these matrices (Cluster1, Cluster2, …, Clusterk), each showing weights specific to a single cluster. Threshold values of 75%, 80%, 85%, 90%, and 95% are applied to identify regions of significant overlap. The corresponding spatial regions are then visualized using color-coded overlays to represent varying degrees of overlap.
In Fig. 5c, the resulting map highlights regions of cluster overlap based on the second-weight thresholding. Different colors denote the degree of overlap, with red (75%), orange (80%), yellow (85%), and blue (95%) representing increasing thresholds. This visualization clearly delineates areas where the second-highest cluster weight plays a significant role, providing critical insights into the spatial complexity and potential interactions between clusters within the dataset. This section highlights the application of NMF to decompose 4D-STEM data, resolve cluster overlaps using second-to-maximum weight ratios, and visualize spatial interactions through color-coded maps.
As shown in Fig. SI_8, the results in Fig. 5 highlight that pre-processing, through denoising pre-treatment, is an indispensable step in the workflow. It suppresses noise, preserves meaningful secondary contributions, and enables the decomposition of overlapping diffraction patterns into distinct components. This treatment transforms ambiguous boundary regions into valuable sources of information, thereby allowing a more robust and physically meaningful analysis of structural complexity.
For instance, if H(1,1) > H(i,1) (i = 2, 3, 4, …, k), this indicates that the first diffraction pattern, located at position (1, 1) in the original dataset, belongs to the first cluster. Using this approach, all diffraction patterns associated with a given cluster can be identified and subsequently organized into a 3D array, where each layer corresponds to an individual diffraction image. For example, if there are N diffraction patterns of dimensions 512 × 512 in the first cluster, the resulting array will have dimensions 512 × 512 × N.
To analyze these diffraction patterns further, the mean pixel intensity can be computed at each position (i, j) across all images in the cluster. This involves averaging the pixel values at position (i, j) across all NN diffraction patterns. Mathematically, the mean intensity at position (i, j) is given by:
Similarly, in the results of NMF, the element with the maximum weight in each column H(k, j) is identified. This maximum weight is then used to scale its corresponding column W(i, k). The resulting products are employed to reconstruct the diffraction pattern for the current clustering k. This process is repeated for all diffraction patterns within the current clustering, yielding a new diffraction pattern that encapsulates the characteristic information of that clustering.
Returning to the original dataset enables a comparative analysis between the initial orientations and the NMF results (see in SI). This comparison not only validates the proposed method but also reinforces its effectiveness (Fig. 4). Furthermore, it contributes to a deeper understanding of material characterization within the framework of clustering analysis.
The integration of unsupervised multi-clustering strategies is pivotal in this context, as it facilitates a nuanced understanding of overlapping cluster structures inherent in 4D-STEM datasets. By analyzing spatial weight matrices and applying threshold-based visualization techniques, this study identified regions with significant overlap, thus enabling the identification of interaction zones and structural patterns within the data. These insights provide a more granular perspective of cluster distributions and inter-cluster relationships, which are crucial for refining decision-making processes in NMF-based analysis pipelines.
Moreover, this study underscores the importance of data preprocessing in enhancing the robustness and interpretability of unsupervised clustering results. Three preprocessing methods, raw data, mean function, and ePattern, were evaluated, with the ePattern method yielding the most consistent and reliable outcomes by significantly reducing noise (lower NSD values) and removing low-variance features. This demonstrates that high-quality datasets not only improve the stability of NMF results but also enable more effective multi-clustering strategies by focusing on meaningful data patterns.
Decision-making strategies in this study were further strengthened by employing IQA metrics as quantitative tools to guide the determination of k. The metrics reveal that while higher k values initially improve reconstruction accuracy, there is a threshold beyond which additional components contribute negligible quality improvements and risk overfitting. This informed decision-making approach ensures that NMF-derived results remain both computationally efficient and scientifically interpretable.
In conclusion, our study highlights a comprehensive framework that combines dataset preprocessing, unsupervised multi-clustering, and decision-making strategies to optimize NMF-based analysis of 4D-STEM datasets. By addressing overlapping cluster structures and leveraging data quality enhancements, this methodology not only improves the robustness and reliability of factorization results but also provides actionable insights into complex structural properties of cathode crystals in the 4D-STEM data. These findings establish a foundational approach for future research leveraging NMF in complex, multi-dimensional datasets and reinforce the significance of systematic preprocessing and decision-making frameworks in achieving reliable and interpretable outcomes.
Code availability: The ePattern_Clustering is available for free download at https://doi.org/10.5281/zenodo.17214464.
Supplementary information (SI): provides detailed descriptions of the Non-Negative Matrix Factorization (NMF) algorithm, the dataset processing workflow from 4D to 2D, and the calculation methods for the Noise Standard Deviation (NSD), Peak Signal-to-Noise Ratio (PSNR), Mean Deviation Similarity Index (MDSI), Gradient Magnitude Similarity Deviation (GMSD), and Structural Similarity Index (SSIM). It also includes the global scheme of the NMF-clustering analysis and the implementation details of the ePattern algorithm. See DOI: https://doi.org/10.1039/d5dd00071h.
| This journal is © The Royal Society of Chemistry 2025 |