Open Access Article
J.
Vrábel
*ab,
E.
Képeš
ab,
P.
Nedělník
a,
A.
Záděra
c,
P.
Pořízka
*ab and
J.
Kaiser
ab
aCEITEC, Brno University of Technology, Purkyňova 123, 612 00 Brno, Czech Republic. E-mail: pavel.porizka@ceitec.vutbr.cz; jakub.vrabel@ceitec.vutbr.cz
bInstitute of Physical Engineering, Brno University of Technology, Technická 2, 61669 Brno, Czech Republic
cInstitute of Manufacturing Technology, Brno University of Technology, Technická 2, 61669 Brno, Czech Republic
First published on 22nd May 2025
The ability to measure similarity between high-dimensional spectra is crucial for numerous data processing tasks in spectroscopy. Many popular machine learning algorithms depend on, or directly implement, a form of similarity or distance metric. Despite its profound influence on algorithm performance and sensitivity to signal fluctuations, the selection of an appropriate metric remains often neglected within the spectroscopic community. This work aims to shed light on the metric selection process in Laser-Induced Breakdown Spectroscopy (LIBS) and study consequences for data analysis and analytical performance in selected applications. We studied six relevant distance metrics: Euclidean, Manhattan, cosine, Siamese, fractional, and mutual information. We assessed their response to changes in sample composition, additive noise, and signal intensity. Our results show specific vulnerabilities of commonly used metrics, such as the Euclidean metric's high sensitivity to additive noise and the cosine metric's sensitivity to spectral shifts. The Siamese metric stood out in the majority of studied cases and outperformed others in a direct comparison within the spectra classification task. This work provides basic guidelines for selecting metrics in various contexts. The methodology is general and can be directly extended to other spectroscopic techniques that possess comparable data properties.
The rapid advancement in instrumentation and growing demands for the analytical capabilities of spectroscopic techniques necessitated the broad adoption of machine learning (ML) techniques,13,14 especially artificial neural networks (ANNs). This is particularly evident in LIBS, where large datasets (∼millions of measurements) containing spectra with a strongly non-linear signal response are common. Examples of prominent ML techniques successfully implemented in LIBS include PCA,15,16 SOM,9,17 SVM,18,19 ANNs,20–23 CNNs,24,25 SIMCA,26 ICA,27 and PLS-DA.28,29 An equally substantial focus on ML is present in complementary spectroscopic techniques; e.g., for Raman, some impactful studies are reported in ref. 30–32 and for IR spectroscopies, ref. 33–35.
Generally, a considerable portion of ML models use a form of similarity‡ computation. In supervised learning, we may need to compute the distance between unknown spectra and labeled representatives to determine class correspondence. In unsupervised learning, for example, a reconstruction error can be considered (in autoencoders36 or RBM37,38). Prior to computing the distance (or more generally, the similarity), a metric must be selected.39 It is crucial to recognize that no single distance metric is universally optimal for all types of data or analysis objectives. Despite the widespread use of Euclidean distance in spectroscopic applications, it often proves inadequate for high-dimensional, sparse datasets, where alternative metrics may better capture domain-specific structures. Selecting the right metric can markedly influence a model's behavior and lead to substantial performance gains, underscoring the need to tailor distance measures to the characteristics of each problem.
The authors of ref. 40 proved that for high-dimensional spaces and arbitrarily distributed data, the concept of proximity between two points becomes less meaningful (when using traditional distance metrics such as Euclidean or Manhattan). This phenomenon is one of the aspects of the curse of dimensionality (COD), which indicates that high-dimensional spaces are inherently sparse.§ As a result, the contrast diminishes because the ratio of distances between the nearest and farthest points to a given reference tends to approach one. In spectroscopy, the effects of the COD are further amplified by the feature sparsity of the data, where only a fraction of the whole spectra usually contains unique information.38,41
Furthermore, from a spectroscopic perspective, the concept of similarity between two distinct spectra is poorly defined or entirely absent in the literature. It is worth questioning whether a slight change in total spectral intensity or the removal of a single spectral line makes two spectra more dissimilar. The answer to such questions depends on the specific task being addressed, which motivated us to study the behavior of selected similarity metrics across various case scenarios that allowed us to isolate individual effects on the metrics. Ultimately, we introduce a novel similarity metric (in the context of LIBS), based on Siamese networks, that outperforms other metrics in the majority of studied tasks.
We use LIBS as a representative spectroscopic technique due to its ability to measure large datasets in a very short time and the rich information contained in its spectra. However, the presented methodology can be generalized to other spectroscopic techniques (e.g., Raman and FTIR), provided that the data exhibit the relevant properties studied in our prior work,13 such as high dimensionality, sparsity, and redundancy. The range of applicability is not strictly defined, but as a rule of thumb, we consider spectra with dimensions larger than about 1000 channels to be sufficiently high-dimensional. This heuristic is based on our preliminary experiments, which show consistent behavior at 1024 channels, a common configuration in single-channel Czerny–Turner spectrometers. Determining an exact threshold is beyond the scope of this work. In this study, we focus on broadband echelle spectra with dimensions far exceeding the threshold.
The number of spectroscopically relevant features (spectral lines) is just a fraction of the total wavelength variables, which represents the sparsity. The redundancy property has two forms; value redundancy, where a selected line is represented by many mutually correlated wavelengths (variables), and line redundancy, which stands for the possibility of having multiple spectral lines representing the same physical property (e.g., the presence of a chemical element).
Some of the most common metrics in spectroscopy include Euclidean distance, Manhattan distance, Spectral Angle Mapper (SAM, equivalent to cosine similarity), Mahalanobis Distance (MD), and the information-theoretic Spectral Information Divergence (SID). In vis-NIR spectroscopy, the work reported in ref. 45 compares Euclidean, MD, SAM, SID, and Principal Component (PC)-based alternatives for soil spectra, with the aim to relate the distance response to sample composition. This study found MD to be the least effective, while PC-based methods outperformed the rest. More recently, a similar study46 (also in vis-NIR) utilized Euclidean, MD, SAM, and PC-based alternatives, leading to Euclidean, SAM, and PC-MD selected as almost-optimal. However, these results are not directly transferable to LIBS due to the substantially different nature of studied NIR spectra, which contained only a fraction of spectroscopic features/lines in comparison to LIBS spectra (with hundreds of spectral lines). Furthermore, a PC-based transformation preserves the Euclidean distance and angles in the original spectral space, unless too many higher components are omitted. For LIBS spectra, keeping just 10 principal components usually captures ∼99% of the dataset's variance.15 Because of this, the Euclidean distance is unaffected by such transformation.
A comprehensive review of the similarity metrics relevant to hyperspectral imaging was done in ref. 47. This included Euclidean, Manhattan, fractional, cosine, and several more exotic metrics with limited practical use cases. Similar to our approach, they used both synthetic data (consisting of Gaussians) and real and measured reflectance spectra of pigment patches. Despite the amount of studied details and effects (e.g., peak translation and peak intensity change) the study is inconclusive for LIBS data due to the considerably lower complexity of utilized spectra and missing quantitative comparisons.
Siamese networks were recently used in mass spectrometry,48 but traditional metrics such as cosine similarity (referred to by a different term in the original paper) were shown to outperform them. We extend the Siamese network architecture by using the triplet loss and demonstrate that it can significantly outperform all standard metrics in most of the studied scenarios. Unlike previous work, we directly compare selected metrics in a classification task. Furthermore, we introduce a novel LIBS distance dataset specifically designed to study metric sensitivity to changes in sample composition, marking a unique contribution to the field.
(a) d(x, y) = 0 if and only if x = y.
(b) Symmetry d(x, y) = d(y, x).
(c) Triangle inequality d(x, y) + d(y, z) ≥ d(x, z) (in certain cases, a more general condition, such as the Schwarz inequality, needs to be used37).
On any set X containing at least two elements, we can define an arbitrary number of distance functions. Therefore, it is essential to specify which metric is used when discussing the distance between two points. Examples of metrics defined on Rm, the m-dimensional real space, include:
(a) Minkowski metric is a broad class of metrics defined as:
![]() | (1) |
(b) Manhattan metric is a special case of eqn (1), where p = 1:
![]() | (2) |
(c) Euclidean metric is a special case of eqn (1), where p = 2:
![]() | (3) |
The Euclidean metric represents the natural distance, which corresponds to the shortest straight line between two points. This is a consequence of the fact that in the classical limit, we live in a 3-dimensional Euclidean space. An important lemma for the Euclidean distance is that it is invariant to rotations of the m-dimensional space.
In many applications, it is advantageous to relax one or more of the metric conditions (e.g., the triangle inequality) and utilize a pseudo-metric. Examples of pseudo-metrics are fractional metrics (i.e., Minkowski with p ∈ (0,1)) or the cosine similarity.
(d) Cosine similarity is a pseudo-metric, which relies on the dot product of two vectors. When considering a spectrum as a point in n-dimensional space, connecting this point to the origin yields a vector. Then a normalized dot product of two vectors is
![]() | (4) |
Note that this is equivalent to a cosine of the angle between the two vectors; therefore, dCS(x, y) = cos
θ. A complementary quantity, the cosine distance is often defined as 1 − dCS(x, y).
In the spectroscopic literature, several metrics equivalent to cosine similarity (such as SAM, normalized correlation, etc.) are commonly used, with no qualitative difference in performance.45,46 A distinct property of the cosine similarity is its invariance to a total spectral intensity change. This is exceptionally useful for dealing with laser energy fluctuations in LIBS.
(e) Mutual information (MI) quantifies the amount of information that one distribution provides about another.49 The formal definition of MI for discrete random variables is
![]() | (5) |
MI is closely related to the Shannon entropy. While the entropy quantifies the uncertainty within a single variable, MI measures how the entropy of one variable is reduced by knowing the other variable. If the variables are independent, their shared information content is zero. In contrast, if they are highly dependent, knowing one variable significantly reduces the uncertainty about the other.
We use a simplified approach to calculate MI based on an image registration algorithm.50 In the first step, a joint histogram of both spectra and two individual histograms are computed and normalized. Then, these are used as joint and marginal probability distributions, respectively. Note that the binning parameter in the histogram computation significantly affects the result and should be optimized for a given task (we use 100 bins). MI is then directly computed using these quantities and provided definitions.
(f) Siamese neural networks (SNNs) are versatile ANN-based models designed for similarity comparison.51,52 SNNs consist of two or more identical subnetworks that share parameters. These subnetworks are trained to create an embedding that minimizes the difference between similar inputs and maximizes the difference between dissimilar inputs. To enhance the possibility of discriminating between similar and dissimilar examples we used the triplet loss:
![]() | (6) |
The triplet loss consists of embeddings f(·), where a is an anchor, p a positive example (similar input to the anchor), and n a negative example (dissimilar to the anchor). By minimizing the triplet loss, the model learns to embed the positive example closer to the anchor than the negative example, within a specified margin α. Details of how the Siamese network was trained are provided below.
(b) Fe & Al standards: nine samples from three different manufacturers were used: SPL LABMAT (CZ), Bundesanstalt für Materialforschung und prüfung (BAM, DE), and ERM Certified Reference Materials (BE). Among these, six samples were steel standards with varying compositions of minor elements and three samples were aluminum alloys. The samples and their compositions are listed in Table 1. We specifically selected Fe and Al-dominated matrices due to their distinct differences in spectral signals.
| Producer | ID | Alloy | Fe | Al | C | Cr | Co | Mn | Mo | Ni | Si |
|---|---|---|---|---|---|---|---|---|---|---|---|
| SPL | 19/6 | Steel | 81.02 | 0.01 | 0.03 | 13.08 | 0.02 | 0.66 | 0.44 | 3.91 | 0.6 |
| SPL | 20/6 | Steel | 70.95 | — | 0.04 | 18.26 | 0.15 | 1.43 | 0.27 | 7.93 | 0.38 |
| SPL | 21/6 | Steel | 96.99 | 0.02 | 0.36 | 0.08 | — | 1.21 | 0.01 | 0.02 | 1.26 |
| ERM | EB313 | Al | 0.39 | 94.73 | — | 0.12 | — | 0.5 | 0 | — | 0.36 |
| ERM | EB316 | Al | 0.11 | 87.39 | — | 0.01 | — | 0.2 | — | 0.02 | 11.98 |
| BAM | 310 | Al | 0.07 | 98.81 | — | 0 | — | 0 | — | 0 | 0.08 |
| BAM | C2 | Steel | 78.06 | — | 0.01 | 14.72 | — | 0.69 | 0.01 | 6.12 | 0.37 |
| BAM | C3 | Steel | 74.01 | — | 0.03 | 11.89 | — | 0.72 | 0.03 | 12.85 | 0.46 |
| BAM | C8 | Steel | 69.87 | — | 0.14 | 17.96 | 0.02 | 1.7 | — | 8.9 | 1.41 |
002 values, representing intensity at a corresponding wavelength starting at 200 nm with an equidistant step of 0.02 nm. The system parameters were chosen according to prior experience with LIBS experiments; the gate delay of the camera was 1 μs and the gate width was 50 μs (minimum for the camera used). 20 mJ ablation energy was used for the Fe–Co sample set and three energies (10, 20, and 30 mJ) for Fe & Al standards.
(b) The Fe–Co generated dataset contains simulated spectra replicating the composition of their measured counterparts. The raw spectra were then modified by adding noise drawn from normal distributions N(0,0.012), N(0,0.022), and N(0,0.052). We use the notation N(μ,σ2), where μ is mean σ2 variance, and σ standard deviation. The standard deviations are scaled relative to the maximum value in the corresponding dataset.
(c) The Fe & Al dataset consists of 27 sub-categories, each defined by a combination of sample and experimental setup. These sub-categories can be distinguished either by class, based on the predominant composition (e.g., steel, and Al alloy), or by experimental conditions (such as the three laser energy levels). Each sample/setup includes 50 available spectra. It is important to note that Fe and Al matrices were chosen due to their fundamental differences in the number of spectral lines: Fe spectra are characterized by a large number of lines, whereas Al spectra have relatively few. The signal intensity is influenced by the laser energy.
(d) An LIBS benchmark classification dataset (available at ref. 57) was originally designed for the challenging out-of-sample classification of LIBS spectra. The dataset contains spectra from 138 soil samples (500 spectra per sample for training and 20
000 spectra for test in total), which are grouped into 12 distinct classes. The dataset was introduced for the EMSLIBS 2019 contest and serves as a benchmark for comparing classification algorithms in the LIBS community. Elemental compositions of samples are provided in metadata.
000 labeled spectra (100 per sample). Spectra were normalized by the total emissivity of the plasma, estimated by summing all intensity values in the spectra. Triplets were constructed based on class correspondence, with positive examples from the same class and negatives from randomly selected distinct classes. The model was selected through heuristic pseudo-optimization, informed by prior experience with ANN-based models in spectroscopy. The input size of the model is 40
000. It has two convolutional layers with kernel sizes of 50 and 10, strides of 2 and 2, and paddings of 1, each producing 50 output channels. After the first convolutional layer, a max-pooling layer with a kernel size 7 and stride 3 is applied. Each convolutional layer is followed by a ReLU activation function. The output is flattened and processed using a fully connected layer with 256 hidden units, followed by an output with 10 units. The model is trained for 50 epochs with a batch size of 128 and a learning rate of 1 × 10−4. The predictions of the model (embeddings) are compared using the L2 norm.
The models were trained using cloud GPU services (Azure and Google Colab), while predictions, which require considerably less computational power, were performed locally on a CPU. It is important to note that simulated spectra were resampled to match the model's input resolution before applying the Siamese metric.
![]() | ||
| Fig. 2 Example spectra from the Fe–Co measured dataset (right). Details of a selected spectral range. The measured signal exhibits a monotonic but nonlinear dependence on composition. | ||
The magnitude of additive noise in measured spectra is depicted in Fig. 3. For low and medium noise, the majority of relevant lines remain detectable, either by a trained expert or an appropriate algorithm. For high noise, a substantial number of lines become indistinguishable from noise. A similar phenomenon is visible in Fig. 4, for simulated spectra. The discrepancies between simulated and measured spectra originate from multiple factors (non-ideal model, non-complete database of transitions, atmospheric conditions, calibration of the spectrometer, etc.).
Unless indicated otherwise, spectra were normalized by the total emissivity prior to distance computations. For measured spectra, no intensity calibration or background subtraction was performed, apart from dark image subtraction. The simulated spectra were multiplied by an efficiency function to suppress intensity in the UV region. This was done to better match the measured spectra, as UV wavelengths are strongly absorbed by the atmosphere. In the classification task, spectra were normalized by their maximal value.
For generated spectra, only a single spectrum per composition was used. Distance matrices are presented in Fig. 6. Despite the simplicity of the employed spectrum generation algorithm, the results from simulated data are qualitatively comparable to those from measured data. Discrepancies could be attributed to the absence of noise in the simulated spectra. Simulated spectra enabled a more controlled study of similarity, as they omitted signal contributions from minor elements (that are always present in measurements) and experimental noise. This fact was subsequently used to isolate the effect of the additive noise.
To examine finer details, a reference spectrum (pure Fe) was selected and used to calculate pairwise distances between the reference and all remaining compositions in the Fe–Co dataset (Fig. 7). This essentially is a line plot of the first column in each heatmap. The analysis further supports the claim that the Euclidean metric exhibits an almost linear response to composition changes. While Manhattan and fractional distances showed slight deviations from linearity in the case of simulated spectra, they closely aligned with the Euclidean metric for measured spectra. The non-smoothness of the fractional distance can be attributed to experimental noise and numerical errors stemming from the algorithm instability. Both the cosine and Siamese distances exhibited similar trends and, for simulated data, resembled a sigmoid function. In contrast, mutual information showed a sharp increase toward higher dissimilarity, followed by an almost linear progression. This behavior is likely due to the additional entropy introduced by new spectral lines from mixed composition samples, which were absent in the pure Fe spectrum. These trends in metric behavior are crucial for the discussion on classification performance in the following sections. While the Euclidean metric is the most human-interpretable metric (due to correlation with compositional change), it may not necessarily be optimal for a classification algorithm. In contrast, for classification tasks, it is advantageous to overlook minor compositional changes corresponding to intra-class spectral variability.
In measured spectra (see Fig. 8), the mutual information metric was significantly compromised even by low noise. This is attributable to the employed algorithm for mutual information estimation, which relies on histograms. Given that the noise distribution remained consistent in all spectra, a considerable fraction of the information was washed out. The remaining lines of higher intensity were not sufficient to provide the necessary contrast. Minkowski metrics with lower p parameters were more prone to the noise-related performance decrease. While the Manhattan metric was still usable for low-noise setup, its effectiveness decreased for medium noise. The Euclidean metric lost the majority of its contrast in the high noise setups. The cosine metric demonstrated resilience to small and medium noises but started to fail for high noise. The Siamese metric proved resilient to all tested noise levels. This remarkable property was most likely a consequence of the utilized architecture of the Siamese neural network, where the dimensionality reduction at the output layer served as a denoising function.
For simulated spectra (see Fig. 9), the majority of results were analogical to measured spectra, except the Siamese metric. The response curve of the Siamese metric was non-smooth and discontinuous at certain concentrations. This is a consequence of the fundamental differences in simulated and measured data, as the Siamese network model was trained solely on measured data. To enable the use of the Siamese metric, the simulated spectra were resampled to match the dimensionality and resolution of the measured spectra, which consist of 40
002 points spanning from 200 nm in 0.02 nm increments. Such resampling cannot correct for differences in spectral intensities or the vastly different number of spectroscopic features present in real measurements. This limiting factor of the Siamese metric can be potentially treated either by fine-tuning the model on the target task or by more advanced spectra transfer approaches that can correct signal discrepancies (see 58,59).
The noise sensitivity study revealed that Minkowski family metrics are highly susceptible to noise. Therefore, we dropped them from further studies and kept only Euclidean as the best-performing representative. Note that we also provide additional distance heatmaps for noisy spectra in the ESI.†
Distances between the reference spectrum and each of the remaining spectra were computed individually using the selected metrics. For brevity, we present only the Euclidean, cosine, mutual information, and Siamese metrics; the remaining Minkowski-family metrics performed worse than Euclidean, as detailed in the ESI.†
Fig. 10 schematically shows how these pairwise distances were computed: the reference spectrum is compared to subsequent spectra from three samples, each measured at three energies, yielding 50 spectra per sample-energy combination. Error bars in the following figures represent standard deviations across repeated acquisitions for each condition.
Computed distances based on the diagram (Fig. 10) for the Euclidean metric are shown in Fig. 11. The outcome is counter-intuitive, as demonstrated by marked distances in Fig. 11 (red dashed ellipses). The Euclidean distance between the reference spectrum (steel sample, laser energy 10 mJ) and certain other steel spectra (measured at higher laser energies) was greater than the distance between the reference and marked Al alloy spectra. This could potentially lead to misclassification in a distance-based classification algorithm. While this behavior can be mitigated through proper spectral normalization (as detailed in the ESI†), preserving the original shape of spectra is sometimes necessary for specific applications (e.g., imaging) to retain spatial information.
Distances obtained from the cosine metric are shown in Fig. 12. In contrast to the Euclidean metric, spectra from the steel matrix were clearly separable from those of the Al alloys. Moreover, this separation was not affected by changes in laser energy and the corresponding changes in intensity. This is a consequence of the intrinsic data normalization in the cosine metric (as discussed in Section 2.1).
![]() | ||
| Fig. 12 Cosine distances between the reference spectrum (steel sample, laser energy 10 mJ) and corresponding spectra from the Fe & Al dataset. | ||
The mutual information-based metric was capable of separating matrices without normalizing the data, owing to its scaling invariance (see Fig. 13). However, the contrast between steel and Al alloy matrices was lower than that for the cosine metric. Note that the mutual information is natively a similarity metric, so the distance was computed as 1 − MI. The advantage of MI is its capability to compare spectra with non-matching resolution or intensity levels (as it depends only on histograms). A considerably higher error bar of the first bin was caused by the presence of the reference spectrum in the spectrum batch corresponding to the first sample/energy bin. The presence of an identical spectrum maximized the MI, which biased the mean and standard deviation values of the bin.
![]() | ||
| Fig. 13 MI distances between the reference spectrum (steel sample, laser energy 10 mJ) and corresponding spectra from the Fe & Al dataset. | ||
The highest contrast was achieved using the Siamese metric (Fig. 14). This result underlined the validity of utilizing alternative metrics. Note that the employed Siamese network model was trained on a different dataset (originally designed for the classification of soil spectra) but performed well on metal spectra.
![]() | ||
| Fig. 14 Siamese distances between the reference spectrum (steel sample, laser energy 10 mJ) and corresponding spectra from the Fe & Al dataset. | ||
The KNN model was trained on the training data subset (50 spectra per sample, 5000 in total) and was later used to predict the test data (20
000 spectra). Optimal values for the k parameter were determined on the validation data from values (2, 5, 10, 15, 20, 30,…, 90, 100, 150, 200, 250, and 300) for each metric. In Table 2, we compared validation and test performances for selected metrics. Note that we omitted the Manhattan and fractional metrics as they achieved significantly worse performance during the preliminary validation evaluation. The mutual information metric was also excluded due to the extremely high computational cost of calculating the Gram matrix for KNNs and because it was outperformed by other metrics during validation. The Siamese metric performed best, particularly when a higher number of neighbors k was considered, compared to lower k values for standard metrics. Notably, for the Siamese metric, comparable test performance was achieved across a range of k values (up to k = 200), though we report only the lowest k value. This raises new questions to be explored in future research.
| Metric | Euclidean | Cosine | Siamese | Eucl. + shift | Cos. + shift | Siam. + shift |
|---|---|---|---|---|---|---|
| Validation acc. (%) | 83.6 | 86.9 | 95.4 | 70.1 | 71.9 | 72.0 |
| Test acc. (%) | 59.2 | 61.0 | 64.7 | 53.8 | 53.3 | 54.0 |
| Best k valid | 5 | 5 | 7 | — | — | — |
| Best k test | 9 | 10 | 40 | — | — | — |
To study the impact of spectral shifts, validation, and test spectra were randomly shifted by s pixels within the range of −3 to 3. The drop in classification performance due to these shifts was less pronounced in the test data, likely due to the inherent complexity of the (out-of-distribution) classification task. While the Siamese metric still outperformed other metrics, the gap was significantly smaller for shifted spectra. The largest performance drop due to the shift was observed for the Siamese metric, followed by the cosine metric. It is worth noting that the Siamese network architecture used in this study was not optimized to handle spectral shifts. This limitation could potentially be mitigated by incorporating additional convolutional and max-pooling layers, which will be explored in future work.
Future research could address the task optimization of metrics based on Siamese networks or the development of a more advanced, task-universal metric based on a foundation model.
Footnotes |
| † Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4ja00377b |
| ‡ For purposes of this work, we use terms distance and (dis-)similarity interchangeably but we mention formal requirements for a proper distance metric in Section 2. |
| § By sparse in this context, we mean that data occupy only a tiny fraction of the space, with most of it being effectively empty. |
| This journal is © The Royal Society of Chemistry 2025 |