Yong Ju
Lee
a,
Chang Woo
Jeong
b,
Hong Taek
Kim
c,
Tai-Ju
Lee
a and
Hyoung Jin
Kim
*a
aDepartment of Forest Products and Biotechnology, Kookmin University, 77 Jeongneung-ro, Seongbuk-gu, Seoul 02707, Republic of Korea. E-mail: hyjikim@kookmin.ac.kr
bGraduate School of Scientific Criminal Investigation, Chungnam National University, Daejeon 34134, Korea
cDepartment of Electrical Engineering, Korea University, Seongbuk-gu, Seoul, Republic of Korea
First published on 18th March 2025
Forensics relies on the differentiation and classification of document papers, particularly in cases involving document forgery and fraud. In this study, document papers are classified by integrating Raman spectroscopy with machine learning models, namely, random forest (RF), support vector machines (SVMs), and feed-forward neural networks (FNNs). Among the machine learning models, the RF model effectively calculated the feature importance and identified the critical spectral region contributing to classification, enhancing the transparency and interpretability of the result. Spectral preprocessing with the first derivative significantly improved the classification performance. The spectral range 200–1650 cm−1 was identified as a highly informative region for differentiation, reducing the number of input variables from 756 to 360 while enhancing the model accuracy. The FNN model outperformed the RF and SVM models, with an F1 score of 0.968. The results underscore the potential of combining Raman spectroscopy with machine learning for forensic document examination, offering an interpretable, computationally efficient, and robust approach for paper classification.
Conventional methods such as fiber identification, filler composition analysis, and fluorescence analysis have been widely used in document examination.9 However, these methods often require large sample sizes or destructive testing, which limit their applicability. Advanced non-destructive analytical techniques, including X-ray diffraction,10,11 elemental analysis,1–3 infrared spectroscopy,6,12,13 image analysis,5 and pyrolysis gas chromatography,14 have expanded the toolbox of forensic document examination. Despite their effectiveness, these methods generate large datasets that are error-prone and time-consuming when processed manually.
These limitations have been overcome by various chemometric approaches for handling complex datasets.15 Lee et al.16 demonstrated the forensic value of integrating chemometric techniques with spectroscopic data. The classification and regression tree (CART) method, which combines attenuated total reflectance-Fourier transform infrared spectroscopy with principal component analysis (PCA), distinguished white copy paper with a prediction accuracy of nearly 90%. Similarly, diffuse reflectance ultraviolet–visible–near infrared spectroscopy combined with PCA discriminated among writing, office, and photocopy papers with an accuracy of up to 99.7%.17
Raman spectroscopy is a vibrational spectroscopic technique that analyzes molecular interactions through scattering rather than absorption. In this respect, Raman spectroscopy differs from infrared (IR) spectroscopy. While Raman spectroscopy primarily detects the vibrations of homonuclear bonds such as C
C and S–S, IR spectroscopy is more sensitive to polar functional groups such as C
O and C–O–C.18,19 In chemometric approaches for forensic document examination, IR spectroscopy has been extensively studied while Raman spectroscopy remains underutilized. Few studies have integrated Raman spectroscopy with machine learning techniques for forensic applications.6,12,16,20–22
The present study tests the abilities of different machine learning models—random forest (RF), support vector machine (SVM), and artificial neural networks (ANN)—in the classification of document-paper manufacturers from Raman spectral data. Among these models, the RF model can calculate the feature importance, identifying the critical spectral regions contributing to classification and enhancing the transparency and interpretability of the results. By incorporating feature importance into the model development process, this study achieves robust classification performance while minimizing the computational costs. The findings demonstrate that Raman spectroscopy combined with machine learning effectively analyzes forensic documents, offering an efficient and interpretable approach for paper classification.
| No. | Sample | Country | Manufacturer | Grammage (g m−2) |
|---|---|---|---|---|
| 1 | KOR1 | Korea | A | 80 |
| 2 | KOR2 | |||
| 3 | KOR3 | |||
| 4 | KOR4 | |||
| 5 | IDN1 | Indonesia | B | |
| 6 | IDN2 | C | ||
| 7 | CHN1 | China | D | |
| 8 | CHN2 | E | ||
| 9 | THA | Thailand | F | |
| 10 | BRA | Brazil | G |
![]() | (1) |
:
3. The training and test sets were used for model construction and validation, respectively. The ratio of each data class was preserved with a stratified sampling method. Threefold cross-validation was also performed to avoid overfitting and enhance the predictive performance of the models.
To increase diversity among the DTs, the RF algorithm employs random subsampling, which prevents the simultaneous use of all input variables and fosters the development of independent trees. The subsampling method of RF is bootstrap sampling, in which data points are randomly selected with replacement from the training dataset. The DTs are trained on approximately two-thirds of the data, known as in-bag samples; the remaining data, referred to as out-of-bag (OOB) samples, are reserved for validating the performance of the tree models.28
The probability of a data point being excluded from a set of m samples during random sampling with replacement is (m − 1)/m. When this process is repeated m times, the likelihood of a sample being excluded from all iterations converges to approximately 36.8%, as expressed in eqn (2):
![]() | (2) |
The RF model was developed by training multiple DTs on in-bag samples. The predictions of all trees were averaged to obtain the final classification of new data. Following the CART approach,27 all DTs in the RF model were independently constructed without pruning. This study trialed different input variables (n_feature) of tree generation square root (sqrt), binary logarithm (log
2), and one-third (1/3) of the total spectral points, and different numbers of trees (n_tree) (10 to 500). The values of n_feature and n_tree were optimized by minimizing the OOB errors through a grid search approach.29
| I(nj) = wjCj − wLjCLj − wRjCRj, | (3) |
![]() | (4) |
![]() | (5) |
Subsequently, the final importance of the variable in the RF model is averaged over all DTs as follows:
![]() | (6) |
The F1-score is a key performance metric that effectively balances the precision–recall tradeoff.33 Precision measures the proportion of correctly identified positive cases among all predicted positive cases, and recall quantifies the model's ability to identify positive cases among all actual positive cases. Calculated as the harmonic mean of precision and recall, the F1-score more robustly determines the classification performance than accuracy, which may not adequately reflect the model's ability to handle FP and FN. The precision, recall, and F1-score metrics are, respectively, calculated as follows:
![]() | (7) |
![]() | (8) |
![]() | (9) |
All data processing and classification modeling were conducted using R statistical software (R Core Team, ver. 4.4.1, Auckland, New Zealand).
Additional peaks at 380 and 436 cm−1 correspond to torsional and flexural vibrations of the pyran ring and to bending and expansion vibrations of the CCO framework within the pyran ring, respectively.36,37 The peaks at 508 and 1117 cm−1 are attributed to the C–O–C glycosidic linkages in cellulose,38,39 and those at 1337 and 1380 cm−1 are associated with HCC, HCO, and HOC bending and with CH and CH2 stretching in the carbohydrate components (cellulose and hemicellulose).40,41
The peak at 1602 cm−1 corresponds to aromatic ring stretching of lignin, while the peak at 1660 cm−1 is attributed to ring-conjugated C–C stretching of coniferyl alcohol and C
O stretching of coniferyl aldehyde, both occurring in lignin.22,42 Finally, the peak at 2895 cm−1 represents CH and CH2 stretching vibrations in cellulose.40
![]() | ||
| Fig. 3 PCA score plots of the original Raman spectra (a) and first derivative spectra (b). Percentages in parentheses are the scores of the explained variance of each PC. | ||
Fig. 4a shows the Raman spectra of the KOR4, IDN2 and THA products and Fig. 4b presents the PC1 and PC2 loadings of the first derivative Raman spectra of the document paper. The peaks at 280 cm−1 and 1084 cm−1 (Fig. 4b) correspond to CaCO3 from inorganic fillers, which explains their positioning along the PC1 axis, reflecting differences between KOR4 and THA. The variation in ash content, in itself, serves as evidence supporting these distinctions.6,11 The peak at 1380 cm−1 was attributed to HCC, HCO, and HOC bending, as well as CH and CH2 stretching in the carbohydrate components (cellulose and hemicellulose), which partially explains the differences in positioning along the PC2 axis as shown in Fig. 3b. The residual carbohydrates were influenced by the alkali charge, temperature, and processing time during cooking and bleaching, resulting in variations in the xylan and glucomannan yield of the final wood pulp—i.e., the raw material for document paper.43 The crystallinity of cellulose pulp is also affected by these processes. The cellulose crystallinity of printing paper depends on the cooking methods and processing conditions.12,44 As document papers are kraft pulp-based, their cellulose crystallinity is likely influenced by additional factors. For instance, recycled pulp may have been used in the manufacturing process,12,45 and the products may plausibly contain bleached chemo-thermomechanical pulp (BCTMP), a high-yield pulp that retains water-soluble components, particularly acetylated galactoglucomannan.13 Recycled pulp and BCTMP, commonly employed as cost-saving measures in paper manufacturing, further impact the composition and properties of the final product.
![]() | ||
| Fig. 4 Raman spectra of the KOR4, IND2 and THA samples (a) and loadings of the first two PCs of the first derivative Raman spectra of the document paper (b). | ||
| Raman spectra | Hyperparameters | OOB error | F1 score | ||
|---|---|---|---|---|---|
| n_feature | n_tree | Train | Test | ||
| Original spectra | sqrt | 123 | 0.474 | 1.000 | 0.711 |
log 2 |
284 | 0.500 | 1.000 | 0.669 | |
| 1/3 | 380 | 0.497 | 1.000 | 0.664 | |
| First derivative spectra | sqrt | 86 | 0.271 | 1.000 | 0.843 |
log 2 |
48 | 0.285 | 1.000 | 0.838 | |
| 1/3 | 30 | 0.243 | 1.000 | 0.875 | |
Table 2 compares the performances of the RF models trained on the original and first derivative Raman spectra across various hyperparameter settings. After training on the original spectra, the OOB errors remained relatively high (0.474–0.500), indicating limited predictive accuracy of the RF methods. The test F1 scores ranged between 0.664 and 0.711, reflecting low classification performance.
The OOB errors were obviously lower (0.243–0.285) after training on the first derivative spectra, suggesting an enhanced predictive reliability of the RF models. Furthermore, the test F1 scores were markedly improved to 0.838–0.875, affirming that the spectral preprocessing with the first derivative enhances the classification performance of the RF models.
Among the tested hyperparameter settings, the “1/3” configuration for the first derivative spectra minimized the OOB error (0.243) and maximized the F1 score (0.875). Therefore, “1/3” was deemed the optimal configuration for this dataset. As highlighted by these findings, preprocessing steps such as spectral derivative decisively improve the accuracy and generalizability of machine learning models on Raman spectral data, emphasizing the necessity of proper spectral preprocessing for robust and reliable analytical applications.22,23,46–48
Fig. 6a presents the feature importance results for the KOR4, IDN2, and THA products based on the first derivative spectra. In Fig. 6, the red-colored regions indicate high contributions to the differentiation of data classes. The classification of KOR4 was notably influenced by lignin at 798 cm−1,49 lattice vibrations associated with calcium carbonate at 1084 cm−1, and HCC, HCO, and HOC bending, as well as CH and CH2 stretching in carbohydrate components (cellulose and hemicellulose) at 1380 cm−1. For IDN2, the region at 1602 cm−1 was identified as a significant contributor, corresponding to the aromatic ring stretching of lignin. The THA product exhibited a similar pattern. It is well known that a substantial amount of lignin must be removed during the chemical pulping and bleaching processes. However, in East Asian countries such as Korea and China, manufacturers rely on imported wood pulp as a raw material. Due to this dependence, they often select BCTMP as a cost-effective alternative.13 BCTMP is produced by mechanically refining wood chips with a small amount of sodium sulfite, which facilitates sulfonation (Fig. 7). The sulfonation process removes resin components from the wood under mildly alkaline conditions while also providing a slight brightening effect.50 Even after bleaching, a substantial amount of lignin residues remains, which affects the quality of the paper such as brightness and yellowness.
Fig. 6b illustrates the spectral feature importance derived from the entire dataset. In addition to the previously discussed features, 280 cm−1 and 436 cm−1 were identified as highly important Raman spectral variables, corresponding to the presence of calcium carbonate and the pyran ring, respectively.
Overall, the decision-making process of the RF model was primarily influenced by the presence of calcium carbonate and the type of wood pulp used in paper production, making these factors key discriminators of paper products. Moreover, the 200–1650 cm−1 spectral range in Raman spectroscopy appears to be a crucial region for tracing and classifying document papers, and it will be utilized as a selective variable for modeling.
This analysis highlights the effectiveness of spectral feature selection in improving classification performance by emphasizing the contributions of specific fillers and wood pulp types to the distinct spectral characteristics of document papers. Additionally, it underscores the advantage of reducing computational costs when constructing machine learning models.13,29,51
| Model | Spectral range (cm−1) | Hyperparameters | F1 score |
|---|---|---|---|
| hl_size, hidden layer size; lr, learning rate; Adam, adaptive-moment estimation; SGD, stochastic gradient descent. | |||
| SVM | 200–3000 | gamma = 10−3, C = 23 | 0.732 |
| 200–1650 | gamma = 10−4, C = 25 | 0.935 | |
| FNN | 200–3000 | hl_size = (32), lr = 0.001, optimizer = Adam | 0.901 |
| 200–1650 | hl_size = (32), lr = 0.1, optimizer = SGD | 0.968 | |
| RF | 200–3000 | n_feature = 1/3, n_tree = 30 | 0.875 |
| 200–1650 | n_feature = log, n_tree = 144 | 0.903 | |
The F1 score of the SVM model improved from 0.732 to 0.935 after transitioning from the entire spectral range to the selected range, demonstrating that excluding the irrelevant variables enhances the robustness of the model. Similarly, the F1 score of the FNN model increased from 0.901 to 0.968 narrowing the spectral region from 200–3000 to 200–1650 cm−1, emphasizing that the selected spectral range also increased the computational efficiency. The RF model also demonstrated an improvement in performance, with an increase in the F1 score from 0.875 to 0.903. Notably, the selected range reduced the number of input variables from 756 (in the 200–3600 cm−1 range) to 360 (in the 200–1650 cm−1 range). The reduced number of input variables not only enhances the robustness of the model by focusing on the most relevant spectral features but also significantly reduces computational costs.12,13 Therefore, the classification models effectively utilize the critical spectral features within the 200–1650 cm−1 range, which correspond to calcium carbonate, cellulose, and lignin. FNN, which achieved the highest F1 score (0.965), appears to be a promising tool. However, considering Occam's razor,52 which suggests that when accuracy is similar, the simplest model should be preferred, SVM can also serve as an effective alternative to FNN. Nevertheless, both FNN and SVM lack transparency in their decision-making processes, which limits the ability to interpret or justify their predictions. For this reason, the authors suggest that each classification model offers distinct advantages, and no single model can be considered a complete replacement for another.
As clarified by the above results, narrowing the spectral range to the most relevant region enhances the model's robustness and reduces the computational complexity. This finding underscores the potential of feature selection in developing efficient and scalable models for document-paper classification in practical applications. Identifying the relevant range (200–1650 cm−1) is a focused, computationally efficient approach for analyzing document papers. Therefore, our work can make valuable contributions to forensic investigations and material classification tasks.
Spectral preprocessing with the first derivative boosted the classification performance of all models, but most obviously benefited the FNN model. The FNN model outperformed the RF and SVM classifiers, achieving the highest F1 score of 0.968. These findings highlight the superior accuracy and computational efficiency of the variable selection based on feature importance measures with first derivative Raman spectra, confirming the suitable choice for forensic document examination. This work advances the use of Raman spectroscopy and machine learning in forensic science, offering a scalable, interpretable, and efficient solution for document-paper classification in real-world scenarios.
However, this study has certain limitations. Contamination or aging can significantly alter the Raman spectral characteristics of paper, potentially reducing the applicability of the proposed approach. Methods that mitigate the spectral distortion caused by contamination or degradation will be incorporated in future work. In addition, the research scope will be expanded to larger datasets and a broader range of paper products. Advanced methods such as deep learning are expected to further enhance the classification performance and ensure scalability of the proposed framework to diverse forensic applications.
| This journal is © The Royal Society of Chemistry 2025 |