Kaiqiang
Que
ab,
Xiaoyong
He†
*b,
Tingrui
Liang
b,
Zhenman
Gao
b and
Xi
Wu
*a
aGuangxi Key Laboratory of Information Functional Materials and Intelligent Information Processing,School of Physics & Electronics, Nanning Normal University, Nanning 530001, China. E-mail: wuxiak@126.com
bSchool of Telecommunications Engineering & Intelligentization, Dongguan University of Technology, Dongguan 523808, China. E-mail: hxy@dgut.edu.cn
First published on 28th October 2025
Building on the spectral stability and sensitivity of femtosecond laser-ablation spark induced breakdown spectroscopy (fs-LA-SIBS), and the pattern recognition capacity of deep learning, a DeepRNN spectral-statistical fusion framework is introduced for high-precision classification of industrial-grade steel alloys. The framework fuses two-channel raw spectra with statistical descriptors, and employs configurable bidirectional recurrent encoders (GRU, LSTM, and vanilla RNN) to capture temporal dependencies and shape a robust decision space under end-to-end training. Under a unified evaluation protocol, the DeepRNN framework models are benchmarked against CNN and Transformer, and against traditional machine learning methods including RF, SVM, and PLS-DA; wavelength contribution analysis is performed to identify discriminative regions and interpretable importance profiles. Under the DeepRNN framework, the three encoders consistently outperform CNN, Transformer, and traditional machine learning baselines on core metrics including accuracy, cross-split consistency, and perturbation robustness, with average accuracy improved by approximately 2.35 to 3.50 percentage points compared to CNN and Transformer, and by 6.58 to 15.40 percentage points relative to traditional baselines. They also achieve favorable trade-offs among accuracy, efficiency, and deployability, with wavelength importance aligning with physically meaningful line structures. This sensor-intelligent system enables scenario-oriented deployment: vanilla RNN is chosen when accuracy is paramount; GRU is suitable for low-latency, energy-constrained online monitoring; and LSTM is preferred for the most conservative optimization trajectory and robustness under complex conditions, providing a scalable pathway for real-time industrial alloy identification and quality control.
However, unlocking LIBS's full potential for robust, high-precision classification of complex alloys necessitates overcoming challenges in spectral stability and the intricate interpretation of high-dimensional data. Machine learning (ML), particularly deep learning (DL), has become pivotal in addressing these hurdles.13 Moving beyond foundational methods (PLS-DA,14,15 SVM,16,17 RF18,19), sophisticated DL architectures are setting new benchmarks. Optimized artificial neural networks (ANN) have achieved remarkable accuracy (>95%) in ceramic provenance studies,20 while deep belief networks (DBN) significantly outperformed traditional ANN in classifying diverse steel types.21 Convolutional neural networks (CNNs) excel at extracting spatial features from raw spectra, demonstrably reducing prediction errors in soil analysis.22 Recurrent structures and attention mechanisms, exemplified by iTransformer-BiLSTM models, effectively decode complex temporal correlations and non-linearities, achieving exceptional quantification performance (R2 > 0.98, RMSE <0.1 wt%) for rare earth elements.23 Furthermore, innovative multimodal fusion strategies, such as combining LIBS with plasma acoustic signals, have substantially boosted classification accuracy for steels.24 Collectively, these advances underscore the critical role of advanced model architectures and intelligent feature/data utilization in elevating LIBS-based classification.
Femtosecond laser-induced breakdown spectroscopy (fs-LIBS) employs ultrashort pulses to minimize thermal effects and plasma interference, substantially advancing analytical fidelity. The femtosecond laser's extreme peak intensity and narrow pulse width enable rapid energy deposition, promoting immediate ionization and dissociation with negligible plasma shielding. This suppresses blackbody radiation and accelerates plasma cooling, shortening plasma lifetime. Consequently, fs-LIBS offers enhanced spectral line discernibility, richer emission features, and a markedly reduced continuous background compared to ns-LIBS – resulting in significantly improved signal intensity and signal-to-background ratio. The ultrashort interaction also diminishes background noise, allowing near-instantaneous detection. Furthermore, direct vaporization without melting minimizes sample damage and material deposition. These attributes collectively enable superior reproducibility,25 peak power,26 and spatial resolution27 over ns-LIBS. This translates to enhanced resilience against complex matrices, as evidenced by significantly lower prediction errors (<20% vs. 35% for ns-LIBS) in plant analysis.28 Building decisively on this, femtosecond laser ablation spark-induced breakdown spectroscopy (fs-LA-SIBS)29–31 synergizes femtosecond ablation with spark discharge re-excitation, dramatically amplifying emission intensity and analytical sensitivity. Critically, He et al. demonstrated that fs-LA-SIBS achieved 4–12-fold lower detection limits for trace elements in steel alloys compared to fs-LIBS alone, alongside >14-fold increases in calibration slopes, enabling high spatial resolution and simultaneous multi-element analysis at kHz repetition rates32 – positioning it as an exceptionally powerful technique for industrial high-sensitivity detection.
Capitalizing on the spectral stability and sensitivity of fs-LA-SIBS, along with deep learning's strengths in pattern recognition, this study presents a DeepRNN spectral-statistical fusion framework for high-precision identification of industrial-grade steel alloys. The framework integrates two-channel raw spectra with statistical descriptors including peak count, mean peak intensity, global mean, and global standard deviation, and employs configurable bidirectional recurrent encoders (GRU, LSTM, and vanilla RNN) to capture temporal dependencies, shaping a robust decision space under end-to-end training. Under a unified evaluation protocol, the framework is benchmarked against CNN, Transformer, and traditional machine learning methods including RF, SVM, PLS-DA, with wavelength contribution analysis used to identify discriminative regions and interpretable importance profiles. Under the DeepRNN framework, all three encoders consistently outperform CNN, Transformer, and traditional baselines across core metrics like accuracy, cross-split consistency, and perturbation robustness, while balancing accuracy, efficiency, and deployability. Wavelength importance aligns with physically meaningful line structures, supporting scenario-specific deployment: vanilla RNN for accuracy priority, GRU for low-latency, energy-constrained online monitoring, and LSTM for conservative optimization trajectories and robustness under complex conditions, thus providing a scalable technical approach for real-time industrial alloy inspection and quality management.
![]() | ||
| Fig. 1 Schematic diagram of the femtosecond laser ablation spark-induced breakdown spectroscopy (fs-LA-SIBS) experimental setup. | ||
The synchronized spark discharge unit, triggered by plasma generation, comprises a DC high-voltage source (10 kV, 0.2 A), a current-limiting resistor (R = 100 kΩ), a 10 nF capacitor bank, and a tungsten needle anode tilted at 45° (2 mm from the sample). Plasma emission is collected and collimated by two plano-convex quartz lenses (L2: f = 100 mm; L3: f = 100 mm), then guided through a fiber-optic probe into a triple-channel spectrometer. The spectrometer covers a wavelength range of 200–500 nm, divided into three sub-ranges (200–317 nm, 317–415 nm, and 415–500 nm), each equipped with a 2048-pixel CCD detector (spectral resolution < 0.1 nm) for precise atomic line detection.
804 high-resolution continuous spectral samples across the 200–500 nm range. The nine classes demonstrated near-perfect balance (disparity ratio: 0.8%), and post-calibration class proportions were constrained to 10.8–11.6%, eliminating bias and ensuring data representativeness. A stratified 5
:
1
:
4 split allocated 50%, 10%, and 40% of the data to training, validation, and test sets, respectively. To rigorously evaluate model robustness, the test set was further partitioned into four 10%-sized subsets. This design enabled multi-scenario validation of model stability and generalization, offering granular insights for performance optimization while maintaining statistical reliability.
| No. | Sample ID | Al | Cr | Cu | Mn | Mo | Ni | V |
|---|---|---|---|---|---|---|---|---|
| 1 | ZBG021a | 0.872 | 1.510 | 0.029 | 0.443 | 0.175 | 0.025 | — |
| 2 | ZBG214a | 0.017 | 1.040 | 0.029 | 0.387 | 0.016 | 0.025 | — |
| 3 | ZBG216a | — | 0.034 | 0.036 | 0.858 | — | 0.0086 | 0.002 |
| 4 | ZBG217a | — | 4.020 | 0.092 | 0.232 | 0.013 | 0.076 | 1.180 |
| 5 | ZBG235 | — | 4.100 | 0.059 | 0.184 | 1.080 | 0.068 | 4.050 |
| 6 | ZBG247 | 0.032 | 0.126 | 0.056 | 1.160 | 0.006 | 0.017 | 0.096 |
| 7 | ZBG251 | 0.011 | 0.034 | 0.017 | 0.568 | — | 0.012 | 0.003 |
| 8 | ZBG224 | — | 0.151 | 0.018 | 1.210 | — | 0.018 | — |
| 9 | ZBG234a | — | 4.210 | 0.114 | 0.284 | 3.080 | 0.146 | 1.520 |
![]() | (1) |
![]() | (2) |
| T = 0.8 × max(Inorm) | (3) |
Key metrics of the peak set P were calculated:
| Npeaks = |P| | (4) |
![]() | (5) |
![]() | (6) |
![]() | (7) |
These equations define global statistics where N is the total number of data points in the spectral sequence. I(j)norm represents the normalized intensity value at position j. µglobal calculates the arithmetic mean of all normalized intensity values, while σglobal measures their standard deviation, quantifying the dispersion around the mean.
| Xspectral = stack([λnorm, Inorm]) ∈ RN×2 | (8) |
| Xstats = [Npeaks, µpeaks, µglobal, σglobal] ∈ R4 | (9) |
The resulting spectral feature matrix Xspectral preserves the sequential information in an N × 2 format, while the statistical feature vector Xstats encapsulates the four extracted characteristics: peak count, average peak intensity, global mean, and global standard deviation.
(1) Input layer: the model ingests two complementary inputs: (i) a two-channel spectral sequence [λ, I], where wavelength and intensity are min–max normalized to [0, 1]; and (ii) a four-dimensional statistical descriptor comprising peak count, mean peak intensity, global mean intensity, and global intensity standard deviation. The sequence captures local line shapes and continuum trends, whereas the statistics summarize global distributional properties.
(2) Recurrent layer: to capture sequential dependencies and spectral context, configurable recurrent encoders (GRU, LSTM, and vanilla RNN) are employed with bidirectional processing and stacked depth (default hidden size 256). Inter-layer dropout (rate = 0.2) is used when depth > 1. The final time-step output from both directions is extracted as a global temporal representation that aggregates pre-peak and post-peak contextual information.
(3) Feature fusion layer: the temporal encoding is concatenated with the four-dimensional statistical vector to form a comprehensive representation, enabling the classifier to learn, end-to-end, how to weight dynamic sequence patterns versus global distributional cues.
(4) Fully connected layers: a hierarchical MLP (512 → 256 → 128 → 64) with ReLU and dropout (rate = 0.2) progressively condenses the fused features into a highly discriminative subspace. A final linear layer projects to the number of target classes.
(5) Output layer: the network outputs class logits. During training, a standard multi-class cross-entropy is typically employed (softmax applied implicitly to logits); probabilities at inference can be obtained by explicitly applying softmax. Parameters are optimized end-to-end via backpropagation.
| Significant element | Main spectral lines (nm) |
|---|---|
| Al | 237.312, 260.089, 308.215, 309.271, 394.401, 396.152 |
| Cr | 357.868, 425.433, 427.480, 428.973 |
| Cu | 324.754, 327.396 |
| Mn | 257.610, 279.482, 403.076, 403.307, 403.449, 404.136 |
| Mo | 202.030, 313.259, 379.825, 386.410 |
| Ni | 341.476, 345.846, 346.165, 346.346, 349.296, 352.454 |
| V | 292.402, 437.923, 438.998, 440.851 |
![]() | ||
| Fig. 4 Training dynamics of three RNN architectures under the DeepRNN framework: (a) GRU, (b) LSTM, and (c) vanilla RNN. | ||
In conjunction with Table 3 and architectural factors, the three DeepRNN framework models show a clear profile of performance and efficiency under an identical protocol. Vanilla RNN uses ungated linear recurrence that fits strongly and plateaus early at a moderate computational cost and reaches an accuracy of 99.00% with a training time of 5911.60 s. GRU employs lightweight update and reset gates to balance temporal modeling with reduced parameters and computation and achieves an accuracy of 97.85% with the fastest training time of 1604.83 s, offering the best efficiency. LSTM with input forget output gates and a cell state captures long range dependencies robustly and trains more smoothly and conservatively, reaching an accuracy of 97.99% with a training time of 10
529.67 s. Accordingly, choose vanilla RNN when accuracy is paramount, choose GRU when efficiency and real time operation are critical, and choose LSTM when robust long range dependency modeling and smoother optimization are desired.
| Model | Average accuracy | Average F1-score | Training time (s) |
|---|---|---|---|
| a GRU, LSTM and vanilla RNN under the DeepRNN framework were evaluated under identical hyperparameters: epochs = 300, dropout = 0.2, learning rate = 5 × 10−4, batch size = 512; these models used the same preprocessing, split, random seed, optimizer, and classification head to guarantee a fair comparison. | |||
| GRU | 97.85% ± 0.34% | 97.86% ± 0.34% | 1604.83 |
| LSTM | 97.99% ± 0.34% | 97.99% ± 0.34% | 10 529.67 |
| RNN | 99.00% ±0.13% | 99.01% ±0.13% | 5911.60 |
Under the DeepRNN framework, all three models exhibit strong diagonal dominance, and the micro-averaged and macro-averaged ROC curves are nearly unity. Vanilla RNN shows the sparsest off-diagonal entries; the most salient mistakes are class 0 predicted as class 7 (40 cases) and class 6 predicted as class 5 (34 cases), with most others being single-digit counts. In the low-FPR inset, its class curves cluster closest to the top-left corner with the smallest dispersion; the legend reports Micro-avg AUC = 0.9999 and Macro-avg AUC = 0.9999. GRU presents more visible off-diagonal mass, with representative errors including class 2 predicted as class 7 (41 cases), class 5 as class 6 (37 cases), and class 0 as class 1 (34 cases); its low-FPR curves are more spread out, with Micro-avg AUC = 0.9998 and Macro-avg AUC = 0.9997. LSTM lies between the two: key errors are class 6 predicted as class 5 (38 cases), class 0 as class 7 (27 cases), and class 5 as class 1 (31 cases); its low-FPR stability is better than GRU but slightly below vanilla RNN, with Micro-avg AUC = 0.9997 and Macro-avg AUC = 0.9996 (Fig. 5). Overall, while all three models achieve near-perfect ROC performance, ranking by off-diagonal sparsity and low-FPR stability places the vanilla RNN first, followed by LSTM and then GRU. This difference is primarily attributable to the task's sequence length and quasi-stationary dynamics: the plain recurrent update provides a stronger inductive bias and smoother decision boundaries, yielding more robust low-FPR calibration, whereas LSTM/GRU gating introduces mild over-suppression and variance in this setting, leading to slightly more near-boundary errors.
![]() | ||
| Fig. 5 Confusion matrices and ROC curves for three RNN variants under the DeepRNN framework: (a) GRU; (b) LSTM; (c) vanilla RNN. | ||
t-SNE was employed to project the fused penultimate representations from each DeepRNN framework models, with the aim of evaluating class separability, intra-class compactness, and boundary overlap under realistic spectral variability. Fig. 7 demonstrates that all models effectively structure the feature space overall. Vanilla RNN variant forms tighter, more well-separated clusters with clearer boundaries, whereas GRU and LSTM exhibit broader dispersion and greater overlap between adjacent grades – consistent with stronger gating mechanisms that smooth fine spectral contrasts and soften nearby decision boundaries. Local mixing near cluster edges aligns with the chemical similarity of adjacent grades rather than model instability. The dual-branch design, which concatenates the sequence encoder state with explicit statistical descriptors, yields structured manifolds, highlighting the complementary contributions of global temporal context and statistical features.
![]() | ||
| Fig. 7 Comparative t-SNE analysis under the DeepRNN framework: (a) GRU, (b) LSTM, and (c) vanilla RNN. | ||
As shown in Fig. 8, the ablation study indicates that explicit statistical features are crucial to model discriminative power. When explicit statistics are concatenated with the encoder state, GRU achieves 97.85% accuracy and 97.86% average F1-score, but drops to 11.65% and 2.32% without them. LSTM remains relatively resilient at 88.63% and 88.66% without statistics and rises to 97.99% and 97.99% with them, reflecting stronger modeling of long range dependencies and noise suppression. Vanilla RNN falls from 99.00% and 99.01% to 11.54% and 2.30%, indicating heavier reliance on statistical priors. Architecturally, LSTM's input, forget, and output gates together with its near linear cell state enable additive updates and stable gradient flow, preserving long term context and dampening high frequency perturbations. In industrial spectral applications, these explicit statistics supply stable global profiles and key peak cues that complement temporal representations, markedly improving inter class separability and class balance so that accuracy and average F1-score for all three architectures also increase substantially.
As shown in Fig. 9, we estimate relative contributions using a gradient-based attribution: we measure the output sensitivity to each of the four statistical features, average over the validation set, and normalize to percentages. The descriptors display a consistent hierarchy across models: peak count dominates in GRU, LSTM, and vanilla RNN with 81.98%, 74.15%, and 85.34%, respectively; peak intensity mean contributes 12.88%, 18.77%, and 9.74%; mean intensity and std. intensity are minor at 3.49%, 4.74%, 3.16% and 1.65%, 2.33%, 1.76%. These patterns align with the ablation findings: peak structure and counting provide the primary discriminative prior, while amplitude statistics further tighten decision boundaries; the explicit statistics branch markedly enhances all architectures. Specifically, vanilla RNN shows the strongest dependence and suffers the largest drop when the statistics are removed; LSTM remains comparatively robust without statistics yet gains substantially when peak count and amplitude cues are fused, evidencing complementarity between gated temporal encoding and explicit statistics; additionally, GRU's update and reset gates favor modeling short-to-medium dependencies, placing its no-statistics performance between LSTM and vanilla RNN, while fusing peak count and peak intensity mean noticeably sharpens its decision boundaries, yielding consistent improvements with lower parameter and compute costs.
| Model | Average accuracy | Average F1-score |
|---|---|---|
| CNN | 94.91% ± 0.15% | 94.94% ± 0.15% |
| Transformer | 95.50% ± 0.23% | 95.48% ± 0.23% |
Fig. 10a and b show strong diagonal dominance in both confusion matrices, with only a few systematic errors among neighboring classes. For the CNN, the most salient errors are class 5 predicted as class 6 (276 cases) and class 7 predicted as class 0 (153 cases); errors are concentrated on a few class pairs with large counts, forming dominant off-diagonal entries. The transformer reduces these dominant errors without introducing many new ones, yielding overall sparser off-diagonal patterns with only a few scattered low-count entries; its two most frequent errors are class 8 predicted as class 3 (86 cases) and class 6 predicted as class 5 (63 cases). The multi-class ROC results are consistent: the CNN attains a micro-average AUC of 0.9987 and a macro-average AUC of 0.9994, whereas the transformer reaches 0.9979 and 0.9984. Both the main plots and the low-FPR inset indicate high TPR when FPR < 0.05; however, the inset shows the CNN curves lying closer to the top-left corner with smaller class-to-class dispersion, suggesting slightly better stability in the low-FPR regime than the Transformer. Regarding the causes, the concentrated off-diagonal entries indicate class pairs with closely spaced decision boundaries; the CNN's local-convolutional inductive bias is robust when local evidence suffices but is more prone to confusion where stronger global context is required. By contrast, the Transformer's self-attention integrates long-range spectral cues and yields more optimal point decisions around the chosen threshold – thus achieving higher accuracy – whereas AUC evaluates probability ranking and robustness across all thresholds, so it can be slightly lower even when accuracy is higher.
Under the same training protocol, the DeepRNN framework models improves average accuracy and F1 by about 2–4 percentage points over CNN and transformer. Its advantage stems from an inductive bias toward sequential/spectral continuity and gating that preserves long-range dependencies, which stabilizes discrimination near close class boundaries and mitigates overfitting with limited data. Overall, considering the trade-off among performance, efficiency, and stability, the DeepRNN framework is preferred in this study, with CNN/Transformer used as enhancements or ensemble options in specific scenarios.
| Model | Average accuracy | Average F1-score |
|---|---|---|
| SVM | 86.92% ± 0.49% | 87.21% ± 0.49% |
| PLS-DA | 84.60% ± 0.34% | 84.60% ± 0.35% |
| RF | 91.27% ± 0.40% | 91.27% ± 0.40% |
As shown in Fig. 11, compared with the deep learning models, the traditional methods (SVM, PLS-DA, and RF) display more dispersed off-diagonal entries in their confusion matrices, with more prevalent confusions among neighboring classes. In the low-FPR insets, class curves are less tightly clustered near the top-left corner, indicating weaker stability; this is most pronounced for PLS-DA, while SVM and RF partly mitigate it but still lag behind. Consistently, their macro-average AUCs are lower than those of the deep Learning models, pointing to inferior cross-class consistency and probability-ranking robustness. The gap primarily reflects representational capacity and inductive bias: PLS-DA is constrained by linear projections and struggles with complex or overlapping boundaries; SVM enhances nonlinearity through kernels yet remains limited in modeling long-range dependencies and contextual composition for sequential/spectral data; RF captures nonlinearity via tree ensembles but tends to have weaker probability calibration and limited resolution near fine decision boundaries. In contrast, deep Learning models (CNN/Transformer/DeepRNN) encode multi-scale local-to-global cues and temporal dependencies more effectively, yielding stronger diagonal concentration, better low-FPR stability, and superior macro/micro ROC performance.
![]() | ||
| Fig. 11 Confusion matrices and ROC curves for Traditional machine learning models (a) SVM; (b) PLS-DA; (c) RF. | ||
Footnote |
| † These authors contributed equally and are joint first authors. |
| This journal is © The Royal Society of Chemistry 2026 |