Open Access Article
Umberto Michelucci
*ac and
Francesca Venturini
bc
aLucerne University of Applied Sciences and Arts, Computer Science Department, 6343 Risch-Rotkreuz, Switzerland. E-mail: umberto.michelucci@hslu.ch
bInstitute of Applied Mathematics and Physics, ZHAW Zurich University of Applied Sciences, Winterthur, 8400, Switzerland. E-mail: vent@zhaw.ch
cTOELT LLC, Research and Development, Duebendorf, Switzerland
First published on 13th April 2026
Machine learning (ML) models have achieved strikingly high accuracies in spectroscopic classification tasks, often without a clear proof that those models used chemically meaningful features. Existing studies have linked these results to data preprocessing choices, noise sensitivity, and model complexity, but no unifying explanation is available so far. In this work, we show that these phenomena arise naturally from the intrinsic high dimensionality of spectral data. Using a theoretical analysis grounded in the Feldman–Hájek theorem and the concentration of measure, we show that even infinitesimal distributional differences, caused by noise, normalisation, or instrumental artefacts, may become perfectly separable in high-dimensional spaces. Through a series of specific experiments on synthetic and real fluorescence spectra, we illustrate how models can achieve near-perfect accuracy even when chemical distinctions are absent, and why feature-importance maps may highlight spectrally irrelevant regions. We provide a rigorous theoretical framework, confirm the effect experimentally, and conclude with practical recommendations for building and interpreting ML models in spectroscopy.
Extracting chemical and physical information from spectra is usually complex and requires involuted data processing pipelines that include steps such as, for example, baseline subtraction or smoothing.1–3 Interpreting the output of machine learning (ML) models applied to spectra presents several challenges: the high number of wavelengths complicates interpretation, complex models might capture nonlinear interactions, making it difficult to connect features in spectra with chemical information about the sample (models are often black-boxes, for lack of interpretability). Complex models may fit noise rather than signal, making predictions useless.4 Furthermore, it has become clear to the research community that attributing a prediction to specific wavelength bands does not have a unique solution, and different methods lead to different attributions.5 This is due to the fact that explainability approaches measure how a specific model responds to changes in intensity at individual wavelengths, which very often include regions far from chemically significant peaks that the model learns to exploit due to subtle statistical differences.6 Zehtabvar et al.7 have found that data normalisation has a strong influence on the accuracy of ML models, something that seems strange, since data normalisation does not have a relationship with physico-chemical information about measurements. Contreras et al.,6 clearly show how feature importance algorithms are susceptible to noise-induced fluctuations, although they fail to give a good explanation for this observation. They also note that because of the high dimensionality of the data, interpretation of the results is challenging.
Steinmann et al.8 studied the problem of spurious correlation in a long article and compared it to the “clever Hans behaviour”. The term Clever Hans comes from animal psychology, named after the horse Hans that apparently had learnt to understand human language. Hans (the horse) instead learnt to rely on the facial expressions of humans asking questions and was unable to give correct answers when not seeing the human face (for more interesting details on the case, you can read9). To paraphrase Steinmann et al., in some cases, ML algorithms in spectroscopy are as dumb as a horse.† The fact that ML is seemingly capable of classifying any spectra dataset with a high accuracy has sparked the appearance of uncountable articles that use ML to extract the elusive chemical or physical parameters sought without the need of involuted data analysis pipelines (see for an overview10).
How such models really generalise to new measurement setups or datasets is an open question that cannot be answered uniquely today. This depends on whether a model learns from physically meaning features (e.g., an absorption or emission line), or from artefacts of the measurement process (e.g., from the noise introduced by a specific electronics11). In the first case, the model is likely to generalise to new measurements well, whereas in the latter case the model simply overfits specific characteristics of the measurement apparatus, rendering it unreliable.
This article, for the first time, explains why the high dimensionality of spectroscopy (the number of intensity values in spectra is usually of the order of 103) is responsible for the ability of ML to classify almost all kinds of spectroscopy dataset, even in situations where the data itself contain no discernible feature to distinguish between classes (e.g., all intensities in spectra in one class noticeably higher than in another class). Fundamentally, this work demonstrates that the effectiveness of ML models when applied to spectroscopy may be due in many cases to the high dimensionality of the spectral data rather than chemical–physical spectral features. Especially flexible models, such as random forests, can obtain an almost perfect classification accuracy even using spectral regions that do not contain relevant physico-chemical features because of the high dimensionality of the data. This work demonstrates that, in high-dimensional spaces, even subtle differences in the statistical properties of the signal across classes can enable a model to achieve seemingly perfect classification accuracy, despite the spectra themselves lacking sufficient information to justify such performance. This may be particularly relevant in reflectance and fluorescence spectroscopy, where the spectra usually have broad features rather than sharp specific signatures, such as in Raman-spectroscopy.
The contributions of this article are the following. (i) We present a mathematical discussion of the role of high dimensionality in finite- and infinite-dimensional cases, with a specific discussion of how to translate those abstract results to spectroscopy. (ii) We present a series of experiments on synthetic data to show under which conditions high-dimensionality becomes relevant in classification. (iii) We show how this phenomenon appears in real fluorescence spectroscopy data. (iv) Finally, we describe how spectroscopists should change their way of analysing spectra to take into account the effect of high dimensionality.
This article is structured in the following way. In section 2 we discuss the mathematical reasons for the behaviour of Gaussian distributed data in finite- and infinite-dimensional spaces (the Feldman–Hájek theorem12,13). We then proceed to generalise the findings for the non-Gaussian distributed data and explain why it is applicable to spectroscopy data. In section 3 we describe a series of experiments on synthetic data and real data used to investigate the effect of dimensionality. In section 4 we present the results obtained on synthetic and real data. In section 5, we discuss the relevance of the results for spectroscopy. Finally, in section 6 we discuss conclusions and limitations.
The implications of the Feldman–Hájek theorem are highly relevant for spectroscopy. A spectrum can be viewed as a point in a high-dimensional space, where intensity at each wavelength (or pixel) represents one coordinate. When this dimensionality is large (for example, 103 intensities values), the geometry of the space in which spectra are defined changes dramatically. The theorem states that, under assumptions verified in the case of spectroscopy, in finite dimensions, two Gaussian distributions with slightly different means or variances always overlap to some extent and can never be perfectly classified. In contrast, in infinite (or with a good approximation in very high) dimensions, even the smallest difference in mean or covariance makes the two distributions mutually singular, meaning that they occupy disjoint regions of the space, and as such, they can be perfectly classified by an appropriate algorithm. For a spectroscopist, this provides a rigorous explanation for a common observation: ML models often achieve a very high accuracy even when spectra appear indistinguishable. The Feldman–Hájek theorem shows that this behaviour is a geometric consequence of high dimensionality: tiny instrumental artefacts, baseline shifts, or preprocessing differences can make two classes of spectra perfectly separable, even in the absence of any genuine physico-chemical distinction. Thus, the theorem clarifies why models may “succeed” mathematically while not learning from physico-chemical meaningful information.
Basically, as you add dimensions, the “volume” of a shape migrates away from the centre and traps itself almost entirely in the outer shell. In 1024 dimensions, a “solid” ball (your orange) is essentially empty; 99.9% of its contents exist only in a paper-thin layer on the surface. This means that if you pick a random point that is part of the ball, its distance from the centre will almost always be the same, since it will be with almost certainty on the shell (the orange peel) (this is something that defies completely intuition)! In other words, nearly all points of the ball have a norm ∥x∥2 (the length of the vector from the origin to a point x part of the ball) close to a typical value.
This phenomenon shows itself in high-dimensional spaces, and the probability mass (intuitively the values of the norm of the arrays ∥x∥2 = (x12 + ⋯ + xn2)1/2) for a Gaussian distributed dataset in n dimensions tends to concentrate around
(assuming for simplicity that
). For a visualisation of this phenomenon, in Fig. 1 we show the distributions of ∥x∥2 (the length of vectors x) sampled from two Gaussian distributions
and
where In is the identity matrix of dimensions n × n. For illustrative reasons, we consider the case of isotropic covariances, but this phenomenon is still happening for generic covariances. In panel (A) of Fig. 1 it can be seen that in dimension 2, the ∥x∥2 distributions have a high overlap (as intuitively clear since the two distributions have the same mean and only slightly different covariances). As the dimension increases (panel (B), (C), and (D)), distributions overlap decrease, until the dimensionality is high enough (panel (D)), and the two have almost no overlap anymore.
In general, random variables with finite variance exhibit extremely small relative fluctuations even when not Gaussian: most realisations (the measured values) lie very close to their expected value. This implies that the geometry of high-dimensional data is effectively governed by its first- and second-order statistics (mean and covariance) and that differences in these quantities dominate the behaviour of distances and overlaps between distributions. Hence, the Gaussian assumption underlying the Feldman–Hájek theorem remains a valid guide for understanding separability and equivalence of high-dimensional or averaged data, even when the underlying distributions deviate from strict normality.
The overview of the experiments is presented in Table 1. Experiments are indicated with N1, N2, N3, and N4 for noise classification experiments, with S1, S2, and S3 for synthetic spectra classification experiments, and with R1a to R5b for real data classification experiments.
| ID | Experiment | Experiment Details |
|---|---|---|
| N1 | Gaussian noise: Δσ sweep (QDA) | Classify white noise from two multivariate Gaussians with equal means (μ1 = μ2 = 1) and different isotropic and Toeplitz (with parameter ρ = 0.95) covariances; we varied the variance gap Δσ: = |σ2 − σ1| from 0 to 2 while keeping σ1 = 1 fixed. We have tested n = {5, 10, 50, 500}. We have measured classification accuracy |
| N2 | Gaussian noise: Bayes “oracle” boundary | Classify white noise from two multivariate Gaussians with equal means (μ1 = μ2 = 1) and different isotropic covariances; we varied the variance gap Δσ: = |σ2 − σ1| from 0 to 2 while keeping σ1 = 1 fixed. We have tested n = {30, 100, 500, 1000, 5000}. We classify white noise by using the ideal threshold on ![]() |
| N3 | Gaussian noise: accuracy vs. dimension n | Classify white noise from two multivariate Gaussians with equal means (μ1 = μ2 = 0) and different isotropic covariances; we varied the variance gap in Δσ: = |σ2 − σ1| = {0.1, 0.3, 0.6, 0.9, 1.2, 1.5, 2.0} while keeping σ1 = 1 fixed. We tested n = {1, 5, 10, 20, 50, 100, 200, 500, 1000, 2000, 5000}. We calculate the accuracy of QDA for various values of n. |
| N4 | Skew-normal noise: 2D parameter sweeps | Classify two classes of white noise from multivariate skew-normal.19 For this experiment, we used the parameters n = 50, μ1 = μ2 = 10, σ1 = 1, γ1 = 0.5. We then varied Δσ/σ from 0 to 2, Δμ/μ from 0 to 0.15, and Δγ1/γ1 from 0 to 8 |
| S1 | Synthetic spectra: truly identical classes | Classify two classes of spectra each composed of one Lorentzian (both centres chosen according to a normal distribution ) and an FWHM ξ = 7. We generated N = 500 spectra for each class. We choose n = {5, 10, 50, 100, 1000, 2000, 5000, 10 000}. These two sets of spectra are not distinguishable, as there are no differences in the data distributions |
| S2 | Synthetic spectra: FWHM difference | Classify two classes of spectra each composed of one Lorentzian (both centres chosen accordingly to a normal distribution ) and two different FWHM ξ1 = 7 and ξ2 = 9. We generated N = 500 spectra for each class. We choose n = 5, 10, 50, 100, 1000, 2000, 5000, 10 000. We study accuracy of classification for various value of n |
| S3 | Synthetic spectra with additive noise offset | Classify two classes of spectra each composed of one Lorentzian (both centres chosen accordingly to a normal distribution ). We have chosen n = {5, 10, 50, 100, 1000, 2000, 5000, 10 000} and a FWHM ξ = 7. We generated N = 500 spectra for each class. We then added i.i.d. Gaussian noise with a tiny class-specific mean offset (0 vs. 0.01) and the same standard deviation of 0.01 (namely the noise was chosen from the two distributions and ) |
| Ra1/Rb1 | Global pixel permutation | A single, consistent random shuffle is applied to all pixels across the entire dataset. This destroys physical contiguity (peaks and baselines) while preserving the global covariance structure. This tests if the model relies on spectroscopic shapes or high-dimensional statistical geometry |
| Ra2/Rb2 | Independent row permutation | Every spectrum is shuffled using a unique random seed, destroying both physical contiguity and inter-pixel covariance. This serves as a control to demonstrate that model success vanishes when the high-dimensional statistical structure is eliminated. |
| Ra3/Rb3 | Pixel count sweep | Classify EVOO vs. LOO (Ra1) and EVOO vs. VOO (Rb1) using an increasingly high number of randomly chosen pixels k ∈ [2, 35] from the first 50 pixels (Region ρ1, noise only). For each k, 20 independent random subsets were tested to evaluate the climb in accuracy within chemically empty regions |
| Ra4/Rb4 | Feature importance: window sweep | Classify oils using non-overlapping moving windows of increasing widths W ∈ 20, 50, 200, 400 across the entire detector. This experiment tests if near-perfect separability persists in regions lacking physical signals (0–400 px) compared to peak regions (600–800 px) |
| Ra5/Rb5 | Feature importance: SHAP | Generate mean absolute SHAP attribution maps for both experiments (A and B) across different window sizes. This experiment identifies whether the model's “important” features correlate with chemical peaks or are distributed across the high-dimensional noise floor. |
for class 1 and
for class 2. These experiments aim to testing the accuracy of classifiers in distinguishing the two classes for various values of the standard deviation σ1 and σ2 varying dimensionality n, with different approaches: QDA (experiments N1 and N3) and LDA decision boundary (N2). In N1 and N2 we have μ1 = μ2 = 1, while in N3 we have μ1 = μ0 = 0. Choosing the covariance matrix as σ2In is equivalent to saying that each intensity in the spectra (at each wavelength) is completely independent of the others. This is clearly not true. Although the Feldman–Hájek theorem is valid for a generic covariance (under certain assumptions, discussed in the appendices), it is an interesting question to ask what the effect of correlation between intensities at different wavelengths on the effect of high-dimensionality is. To study this, we performed the same experiments also with a Toeplitz geometric covariance. The latter is a matrix whose entries depend only on the absolute difference between their indices. This type of covariance is modelled with a parameter ρ ∈ (−1, 1). For an n-vector x = (x1…xn) we write
with entries| Σij = cov(xi, xj) = σ2ρ|i–j|, 1 ≤ i, j ≤ n | (1) |
The details of the experiments with the complete ranges of parameters tested are contained in Table 1.
For this experiment, we generated two classes of random noise arrays of dimensions n, analogously to the method described in the previous section, from the skewed normal distributions (SND) defined by Azzalini and Dalla Valle,19 described below.
Let us give the mathematical form of the SND used. In its univariate version the probability density function (PDF) of a SND is given by
| f(x) = 2φ(x)Φ(αx) | (2) |
![]() | (3) |
The multivariate skewed normal distribution (MSN) introduced by Azzalini and Dalla-Valle19 is defined in its general form as follows. A random vector X ∈ Rp has an SND distribution with location parameter ξ ∈ Rp, symmetric positive definite scale parameter Ω ∈ Rp×p, and skewness parameter α ∈ Rp, if its multivariate PDF is
| fx(x) = 2ϕp(x;μ, Ω)Φ{γTω−1(x − μ)}, x ∈ RP | (2) |
![]() | (4) |
Note that the matrix Ω is a dispersion matrix and equals the covariance of X only for α = 0.
For our tests, we used μ = μ1p and γ = γ1p and Ω = σ2Ip. We then evaluated the separability of the classes with synthetic experiments. For each model, we generated two classes of N = 100 samples in n = 50 dimensions with coordinates i.i.d. from skew-normal distributions: a fixed base class (μ = 10, σ = 1, γ1 = 0.5) and a perturbed class obtained by shifting parameters by Δμ, Δσ, and Δγ1 over uniform grids. We performed three two-dimensional sweeps, (Δμ, Δσ), (Δμ, Δγ1), and (Δσ, Δγ1), and, at each grid point, trained a specific model and measured out-of-sample accuracy via 5-fold cross-validation. The mean cross-validated accuracies define accuracy surfaces that quantify how separability changes with differences in mean, variance, and skewness. For reproducibility, the two classes were generated with independent random seeds, and we also report results on the relative axes (Δμ/μ, Δσ/σ, Δγ1/γ1).
To show this, we simulated two classes of one-peak spectra on a discrete axis x = (x1, …, xn) (representing detector pixels, wavenumbers or wavelengths depending on the spectroscopy type). Each spectrum is a Lorentzian profile with a randomly jittered centre and a fixed full width at half maximum (FWHM), equal for the two classes. For class k ∈ 1, 2 we draw, independently for every spectrum j, a peak centre
(with values μ = 50, σ = 10 and FWHM = ξ = 7 for numerical simulations). We used the unit-height Lorentzian.
We generate N = 1000 spectra per class with n = 100.
In a second experiment S2, we then studied the classification of the spectra constituted by a single Lorentzian peak with different FWHM. For a given number of dimensions n ∈ {5, 10, 50, 100, 1000, 2000, 5000, 10
000} we generate N = 500 spectra per class. Each spectrum is a Lorentzian profile with a centre randomly chosen from a normal distribution with parameters (μ, σc) = (50, 10) and a different FWHM ξ1 = 7, and ξ2 = 9 for the two classes. For the spectroscopist, in Fig. 2, 10 examples of each class are plotted together, to show how visually it is impossible to distinguish the two classes.
In the third experiment S3 we studied the effect of dimensionality on classification in presence of noise. In fact, in spectroscopy, for a multitude of reasons (detector dark current, electronics, etc.), noise is always present. We know from the experiments described here that it is possible to perfectly classify pure noise. It is an interesting question what happens if noise is added, for example, to a set of undistinguishable spectra (experiment S1).
For this experiment, we added to the spectra of experiment S1, an independent Gaussian noise that differs only in its mean between classes: for class 0 we add
, and for class 1
. We consider N = 500 spectra per class and n ∈ 5, 10, 50, 100, 1000, 2000, 5000.
For each n we create a balanced dataset and benchmark four standard classifiers: logistic regression (max_iter = 3000), k-nearest neighbours, a decision tree (max_depth = 5), and a random forest (100 trees, fixed seed). Performance is estimated using a 5-fold stratified cross-validation. We report the mean and standard deviation of validation accuracy across folds. All random draws use fixed pseudorandom seeds to ensure reproducibility. Because peak shape and centre statistics are identical across classes, the only class-dependent signal arises from a tiny global shift in the additive noise mean. This setting reflects common practice where models ingest spectra without explicit peak annotations; the experiment investigates how classifiers exploit minute distributional offsets when presented with high-dimensional spectral vectors.
| ID | Experiment | Key finding | Reference to results |
|---|---|---|---|
| N1 | Gaussian noise: Δσ sweep (QDA) | Accuracy climbs monotonically with Δσ and with dimension n; it reaches almost 1 already with modest gaps for n high enough. In high n white noise is easily and perfectly classifiable. We have tested n ∈ {5, 10, 50, 500} and Δσ from 0 to 2 for an homogoenous covariance and for a Toeplitz one with ρ = 0.95 | Fig. 4; Methods in section 3.1; Results in section 4.1 |
| N2 | Gaussian noise: Bayes “oracle” boundary | White noise from the two distributions is almost perfectly classifiable in high enough n; accuracy goes to 1 quickly as Δσ grows. We have tested n ∈ {30, 100, 500, 1000, 5000} and we have varied Δσ from 0 to 1.0 | Fig. 5; Methods in section 3.1; Results in section 4.1 |
| N3 | Gaussian noise: accuracy vs. dimension n | Even small Δσ becomes highly separable as n increases (dimensionality amplifies tiny distributional gaps). We have tested n from 0 to 5000 (but visualised only until 100, to make the more steeper growing curves more visible) and tested Δσ ∈ {0.1, 0.3, 0.6, 0.9, 1.2, 1.5, 2.0} | Fig. 6; Methods in section 3.1; Results in section 4.1 |
| N4 | Skew–normal noise: 2D parameter sweeps | Tiny shifts in mean, variance or skewness give near-perfect accuracy for most models; random forest saturates the fastest | Fig. 7; Methods in section 3.2; Results in section 4.2; Results in section 4.3 |
| S1 | Synthetic spectra: truly identical classes | No classifier exceeds chance level accuracy (0.5) confirming that without distributional differences, classification is impossible | Table 3; examples in Fig. |
| S2 | Synthetic spectra: width difference | Validation accuracy rises as expected with n; linear and ensemble models approach 1.0 for large n similarity in the spectra | Fig. 8; Results in section 4.3 |
| S3 | Synthetic spectra with additive noise offset | High n allows a minute noise distributional difference to enable near-perfect separability; random forest reaches approximately 1.0 with very small values of n | Fig. 9; Results in section 4.3 |
| Ra1/Rb1 | Global pixel permutation | Accuracy remains high (∼82%) despite the total destruction of spectral shapes. This empirically proves the model relies on global covariance structures rather than physical spectroscopic peaks | Results in Sec. 4.4 |
| Ra2/Rb2 | Independent row permutation | Model performance collapses to the majority-class baseline. This confirms that success in Ra3/Rb3 was due to high-dimensional statistical structure, which is destroyed by independent shuffling | Results in Sec. 4.4 |
| Ra3/Rb3 | Pixel count sweep | Accuracy reaches > 85% using only 15–20 randomly selected pixels from the noise region (ρ1). This confirms that non-contiguous, chemically empty data provide sufficient statistical separation in high dimensions | Fig. 10; Results in Sec. 4.4 |
| Ra4/Rb4 | Feature importance: window sweep | High classification accuracy (∼80%) is maintained across all windows, including those in the signal-free region (0–400 px). Larger windows (W = 400) create an accuracy plateau independent of spectral features | Fig. 12; Results in Sec. 4.4 |
| Ra5/Rb5 | Feature importance: SHAP | SHAP attribution is distributed across the entire spectrum, often assigning higher “importance” to noise regions than to chemical peaks. This highlights the “interpretability paradox” in high dimensions | Fig. 13; Results in Sec. 4.4 |
and
and for
and
, respectively. The results of the quadratic discriminant analysis (QDA) with a regularisation parameter equal to 0.4 are shown in Fig. 4. In this simulation, we fix μ1 = μ2 and sweep the standard-deviation gap Δσ: = |σ2 − σ1| over [0, 2], while varying the dimensionality n (points per array). For each Δσ we formed an 80/20 train-test split and reported the test accuracy. The latter is close to chance (≈0.5) when Δσ is close to zero and increases with both Δσ and n; higher dimensions reach approximately 1.0 rather quickly with much smaller gaps (e.g., hundreds of points per array achieve near-perfect accuracy for modest Δσ), whereas very small n require larger gaps to exceed 0.9. Panel (A) in Fig. 4 shows the results for a Toeplitz covariance with ρ = 0.9 and panel (B) for a homogeneous one. Notably, when considering Toeplitz matrices, the presence of correlation slows down the effect of high dimensionality (effectively it takes large gaps Δσ for the same n value to reach the same accuracy value), but it does not stop it (as is expected, since the Feldman–Hájek theorem is valid for generic covariances).
In the second experiment (N2) for each variance gap value Δσ, we generated two classes of N = 1000 arrays for various values of n from isotropic Gaussians with equal mean μ and standard deviations σ1 and σ2 = σ1 + Δσ. We evaluated the LDA decision boundary T for this setting
![]() | (5) |
In the third experiment (N3) we generated two classes of N = 1000 arrays from isotropic Gaussians with equal mean μ = 0 and standard deviations σ1 = 1 and σ2 = σ1 + Δσ. The results are shown in Fig. 6, for each standard deviation gap Δσ ∈ 0.1, 0.3, 0.6, 0.9, 1.2, 1.5, 2.0 and dimension n ∈ 1, 5, 10, 20, 50, 100, 200, 500, 1000, 2000, 5000. We used QDA and calculated the test accuracy on a 20% hold-out split for each (n, Δσ). The curves show that accuracy increases monotonically with n and with the gap Δσ: larger gaps reach near-perfect accuracy at much smaller n, while very small gaps require higher dimensions to move far above chance.
Fig. 7 shows that logistic regression achieves a near-perfect performance in most parameter ranges. A small mean shift (Δμ) already drives the accuracy to ≈1.0, and combinations of variance and skew differences ((Δσ, Δγ)) also reach 100% accuracy very quickly. Random forest shows very strong and robust results across the three plots, with broad yellow (1.0) regions, and it saturates quickly at perfect accuracy. kNN requires larger separations to leave chance accuracy and shows the steepest transition bands. This is consistent with the curse of dimensionality: local neighbourhoods are less informative in n = 50, so kNN needs larger (Δμ, Δσ, Δγ) to separate the classes. In contrast, decision tree shows intermediate performance with more granular contours. Random forest exhibits the highest accuracy even for very small differences in the parameters.
To summarise, already in n = 50 dimensions, which is low compared to typical spectrum dimensions (of the order of 103), tiny discrepancies in mean, spread, or skewness already make the classes almost perfectly distinguishable. Fig. 7 shows that all four models approach the 100% accuracy in wide regions of the parameter grids.
| Model | Accuracy (mean ± SD) |
|---|---|
| Logistic regression | 0.54 ± 0.03 |
| K-Neighbors classifier | 0.50 ± 0.04 |
| Decision tree classifier | 0.53 ± 0.04 |
| Random forest classifier | 0.51 ± 0.01 |
In experiment S2, we observe that increasing the dimensionality enables even simple models such as logistic regression to achieve near-perfect accuracy. As shown in Fig. 8, classification performance improves steadily with the number of intensity points (or dimensions) in all models tested. These results were obtained using 5-fold stratified cross-validation and averaged over folds, with all random seeds fixed for reproducibility. These findings illustrate how, in typical spectroscopic ML analysis, where spectra are treated as high-dimensional vectors, classifiers may exploit subtle distributional differences unrelated to the actual chemical structure.
In all models, validation accuracy increases with the size of the spectrum n, reflecting the fact that the differences in FWHM become easier to detect. Linear and ensemble methods benefit the most from increasing n, with performance approaching unity for large arrays, whereas shallow trees saturate earlier. Training accuracies follow the same trend, indicating low variance for the ensemble and stable generalisation once n is sufficiently large. This controlled study isolates a single physical change (peak width) under substantial positional jitter and shows how dimensionality alone can convert a weak per-pixel signal into a highly separable representation, providing a transparent explanation for model behaviour on spectroscopic data.
The results of experiment S3 are shown in Fig. 9. The two classes differ only by a small class-specific offset in additive noise (mean 0 vs. 0.01 with SD 0.01); the underlying Lorentzian signal (centre distribution and width) is identical. The figure illustrates a general phenomenon in spectroscopy: when models have as input spectra as high-dimensional vectors, even subtle distributional shifts (here, a 0.01 mean offset in the noise) can become highly separable as n increases. The ensemble and nearest-neighbour methods aggregate this diffuse evidence quickly; linear models eventually catch up as the
gain in averaging overwhelms noise; shallow single trees remain bias-limited.
• Region ρ1: 337 nm–380 nm: this region contains only noise and no chemical fingerprint.
• Region ρ2: 380 nm–420 nm: this region contains the Rayleigh scattering peak. As explained, this region has been removed from the spectra, to take away an easy way to a high accuracy for models.
• Region ρ3: 420 nm–630 nm: this region contains very weak fluorescence signals, and as such chemical information, mainly due to the oxydation products in olive oil.
• Region ρ4: 630 nm–775 nm: this region contains the strongest fluorescence signals (due to chlorophylls) and corresponds to the strongest chemical information.
• Region ρ5: 775 nm–800 nm: this region, similar to region ρ3, contains only weak chemical information, since only the tails of the main peak are present in this region.
To study the effect of noise and dimensionality, we will use mainly regions ρ1 and ρ3. Region ρ4 is less interesting for the discussion in this paper.
Remarkably, a random forest classifier trained on these “scrambled” spectra achieved an accuracy of 82% for EVOO vs. LOO and 81% for EVOO vs. VOO. Since a peak cannot exist in a shuffled vector, this result serves as evidence: the model is not “reading” the spectra in any chemical sense. Instead, it is exploiting the concentration of measure in high-dimensional space. This confirms that in 103 dimensions, class-specific noise patterns and instrumental offsets become perfectly separable regardless of their physical meaning. Formulated more cautiously: the noise (intended as a non fluorescence signal) provides a higher degree of statistical separability in high-dimensional space.
This global pixel permutation experiment suggests a more cautious but powerful formulation of the high-dimensionality paradox: statistical artefacts are often ‘easier’ for models to exploit than chemical signals. Because instrumental noise and baseline offsets provide a consistent, high-dimensional footprint, flexible models (like random forests) can reach high accuracy by following the path of maximum statistical separation, even when the physical ‘structure’ of the data has been completely destroyed.
Unlike the global shuffle (which yielded ∼82% accuracy), this independent shuffle caused the model performance to collapse to the baseline of the majority-class (∼60%). This contrast provides a definitive multi-stage proof: the observed high accuracies in spectroscopic ML are primarily driven by the covariance structure of non-chemical artefacts in high-dimensional space. When this structure is preserved (global shuffle), the model succeeds without chemistry; when it is destroyed (independent shuffle), the model's “infinite-dimensional” advantage vanishes. This confirms that the Feldman–Hájek effect, governed by class-specific covariance differences, is the functional root cause of the reported high performances.
It is important to distinguish between the preservation of statistical correlation and the preservation of chemical information. A global permutation destroys all physical contiguity and spectral shapes (the “chemistry”), yet it leaves the underlying covariance matrix Σ intact (although reindexed). The fact that accuracy remains at 82% after a global shuffle proves that the model is performing a geometric separation based on these re-indexed statistical correlations rather than any recognisable spectroscopic features.
The results, shown in Fig. 10, provide evidence for our thesis. As the number of randomly selected pixels increases, the classification accuracy climbs steeply, reaching approximately very high values already with ca. 15–20 pixels. This occurs despite the fact that:
1. The pixels are chosen from a region lacking any known physico-chemical signals.
2. The pixels in each subset are not necessarily contiguous, destroying any potential “hidden” spectral shapes or features.
3. The LOO-CV validation ensures that the model is generalising to unseen oil samples, rather than overfitting specific measurements.
These findings suggest that the high accuracy often reported in the application of ML in spectroscopic is not necessarily a result of the model identifying complex chemical patterns. Instead, it highlights a fundamental property of high-dimensional spaces: as the dimensionality n (here represented by k) increases, even infinitesimal distributional differences in noise or instrumental offsets between classes become almost surely separable. This experiment reinforces our cautious formulation that non-chemical noise is statistically “easier” for models to exploit than the subtle chemical signatures sought by researchers.
Fig. 10 clearly shows how a flexible model (such as a random forest) is able to use statistical differences in the data to achieve a very good accuracy, even when it should not, from a chemical point of view, be able to. This is naturally due to the fact that the different classes have different covariance matrices, as can be seen, for example, for EVOO and LOO in Fig. 11. Note that the large red areas in the lower right part of the covariances are related to the main peak, while the light red regions are due to stray light from the excitation LED. Note that for all the tests, we removed the Rayleigh peak from the spectra, but we wanted to include it into the covariances matrices for completeness.
• Universality across tasks: comparing experiment EVOO vs. LOO and experiment EVOO vs. VOO reveals a striking commonality. Despite the differing chemical complexities of these tasks, the classification accuracy in the chemically empty region (ρ1, pixels 0–400) remains consistently high (>80%) in both cases (Panels G and H). This suggests that the model is exploiting a universal instrumental covariance structure rather than task-specific chemical markers.
• The dimensionality plateau: the transition from W = 20 to W = 400 illustrates the impact of dimensionality on separability. In the 400-pixel windows, the accuracy reaches a stable plateau that is indifferent to the underlying spectral profile.
These results confirm that high importance in a ML model can often be an artefact of high-dimensional geometry. When a model can achieve 80% accuracy using only randomised noise pixels (as shown in experiment Ra1/Rb1) or empty spectral windows, the standard interpretation of model weights as “chemical signatures” becomes invalid. We conclude that in 103-dimensional space, the most stable discriminant is frequently the global statistical fingerprint of the background, creating a deceptive “path of least resistance” that bypasses the intended chemical analysis.
However, this identifies a statistical shortcut rather than a chemical signature. In high-dimensional spaces, shortcuts are more robust and easier for the model to minimise training loss than the complex, non-linear signals of chemical peaks. Thus, a high SHAP value in a noise-dominated region is not a sign of a hidden chemical feature; it is an empirical confirmation that the model has successfully exploited the geometric separability of the instrumental background.
To explain the model's decisions, we employed SHAP. For each window size W ∈ 20, 50, 200, 400, we trained a random forest classifier on the localised spectral segment Xstart: start+W. SHAP values were calculated using the TreeExplainer algorithm. To quantify the importance of a spectral window, we calculated the global mean absolute SHAP value. This metric allows us to map which spectral regions were used as primary discriminants by the model. The results can be found in Fig. 13. The feature attribution analysis was performed on both classification of EVOO vs. LOO and EVOO vs. VOO. As shown in Fig. 13, the attribution profiles reveal a significant decoupling between the spectral regions identified as important from the model and the physical chemical signal. Across all window sizes (W = 20 to W = 400 px), the model assigns high importance to spectral regions where the chemical signal is low or entirely absent. Notably, in the 400-pixel window regime (Panels G and H), the importance assigned to the noise-only region (pixels 0–400) is comparable to, or even exceeds, the importance assigned to the primary fluorescence peaks (pixels 600–800). This indicates that the model is not relying on specific chemical markers, but is instead utilising the global high-dimensional background as a primary discriminant.
When feature selection or wavelength band selection is applied to spectroscopy data, the high dimensionality of spectra can produce misleading results. Since classifiers can exploit minute distributional differences, even in spectral regions that contain no physico-chemically meaningful information, commonly used importance approaches often highlight bands that are merely correlated with noise patterns or instrument artefacts. For instance, a random forest might consistently assign high importance to regions far from characteristic peaks, not because those wavelengths encode chemical signatures but because small statistical fluctuations in those regions suffice to separate classes in high-dimensional space. This effect can lead spectroscopists to misinterpret the outcome of ML models. A feature ranking that emphasises noise-driven regions can be taken as evidence of a new “hidden” marker, while in reality the model is simply exploiting spurious differences in baseline or detector noise. As a result, band-selection workflows risk reinforcing artefacts rather than guiding the discovery of meaningful chemical or physical features. This danger is especially acute when spectra are normalised or preprocessed, since those steps may amplify or redistribute noise in ways that make certain bands appear systematically discriminative.
Therefore, great caution is required when interpreting the output of band-importance methods. Any highlighted region should be cross-validated against established chemical knowledge or verified with independent measurements. Without this step, spectroscopists risk drawing incorrect conclusions, such as attributing predictive power to wavelength regions that carry no true spectroscopic signal. The findings of this study suggest that feature selection in spectroscopy, if performed without domain knowledge, can easily mislead and produce models that generalise poorly across instruments, conditions, or sample sets.
Dark signal and stray light must be mentioned in this context. In fact, they act as structured “noise” that can differ by instrument, session, or acquisition order and can alone enable near-perfect separation in high dimension and mislead band-importance analyses (e.g., highlighting off-peak regions with no chemical content). Models trained under such conditions may fail to generalise across instruments or setups, despite excellent internal validation. Practically, this calls for rigorous controls: randomise acquisitions across classes, replicate across instruments/sessions, evaluate with leave-instrument/session-out validation, and verify that accuracy collapses when noise statistics are equalised (e.g., per-scan mean/variance standardisation or explicit dark/stray-light correction). Only signals that remain discriminative under these checks should be interpreted as chemically meaningful.
Ultimately, this work should not be interpreted as a general refutation of machine learning in spectroscopy, but rather as a call for a more rigorous, evidence-based framework for model validation; we propose that high classification accuracy must be accompanied by regional sensitivity audits—such as the windowed SHAP analysis and global shuffle tests presented here—to ensure that model success is derived from verifiable chemical signatures rather than high-dimensional statistical shortcuts.
When applying machine learning to spectroscopy, it is essential to check whether models are separating classes based on chemically meaningful information or on trivial artefacts. A useful diagnostic is to test performance on wavelength regions that should be indistinguishable and contain no chemical signal; if the model still performs above chance, then separability is likely driven by noise or measurement artefacts.
Preprocessing choices also play a critical role. Steps such as baseline subtraction or normalisation can unintentionally amplify or suppress noise patterns, creating the illusion of meaningful separation. Similarly, spectral band importance methods (e.g. feature maps from random forests, SVMs, or SHAP values) may highlight regions that correspond to noise rather than true peaks, and therefore these results should be interpreted with great caution.
In conclusion, spectroscopists must remain aware that models trained on data from one instrument or measurement setup may not generalise to another. Retraining or re-validation is essential when changing experimental conditions. The safest approach is to combine machine learning with domain knowledge of peak positions, line shapes, and chemical constraints, and to begin with synthetic or well-characterised spectra where the discriminative features are known. This provides a baseline to ensure that models are learning physically relevant information rather than statistical quirks of the dataset.
• Overfitting typically occurs when the complexity of a model (number of parameters) is too high relative to the number of samples N. In this state, the model “memorises” specific noise fluctuations in the training set that do not exist in the population.
• High-dimensional separability (the Feldman–Hájek effect) is a geometric property where two distributions become mutually singular as the number of dimensions n increases. In this case, the model is not necessarily “memorising” noise; rather, it is correctly identifying that in 103 dimensions, the classes occupy disjoint regions of space due to minute differences in their global covariance or mean.
A key diagnostic to distinguish the two is the rate of convergence to perfect accuracy. In classical overfitting, accuracy usually improves as the number of samples N decreases (making the “memorisation” easier). In contrast, high-dimensional separability is driven by the number of pixels n. As shown in our experiments (Fig. 6 and 10), even with a fixed or increasing sample size, accuracy increases steadily as more spectral points are added. Furthermore, our “shuffle” experiments demonstrate that the model is exploiting global statistical distributions, which are properties of the population, rather than just local pixel-wise noise.
Also it is important to note, that even in high dimensions and with clear statistical differences, it is possible that specific model classes will not reach a high accuracy. The fact that two classes are, in principle, perfectly separable it does not mean that every model can do that, or that it is an easy task. The decision bounday may be too complex for specific model classess to detect, and thus even very flexible models might have a low accuracy in classification tasks, even if in high dimensions.
We hope that our findings serve as a useful framework for the field: we propose that the standard for “model success” must be elevated from simple cross-validation accuracy to a rigorous Regional sensitivity audit. The windowed SHAP importance maps, the global shuffle tests, and the physical feature-removal protocols developed here provide a possible blueprint for a new generation of “physically-aware” machine learning. By adopting these stress-tests, the spectroscopy community can safeguard against the publication of non-replicable “phantom” models and ensure that the power of artificial intelligence is harnessed to uncover genuine molecular insights rather than high-dimensional geometric artefacts.
Note that while this study utilises fluorescence spectra characterised by broad features and significant instrumental background, it is important to note that the impact of high-dimensional statistical shortcuts may vary in datasets containing more dense chemical information, such as the infrared fingerprint regions of biological tissues, where the higher ‘chemical contrast’ might offer more robust physical discriminants.
Footnote |
| † The authors do not want to imply that horses are not intelligent animals, only that they cannot classify spectra accurately. |
| This journal is © The Royal Society of Chemistry 2026 |