Zhiyong
Zhang†
a,
Wennan
Nie†
a,
Guangpu
Fang†
a,
Jiahe
Qian
a,
Hongxia
Gan
c,
Jingchao
Chen
c and
Wenlong
Li
*ab
aCollege of Pharmaceutical Engineering of Traditional Chinese Medicine, Tianjin University of Traditional Chinese Medicine, Tianjin, 301617, People's Republic of China. E-mail: zzytjutcm@163.com; 736243451@qq.com; 1248505630@qq.com; qjh0101@qq.com
bHaihe Laboratory of Modern Chinese Medicine, Tianjin, 301617, People's Republic of China. E-mail: wshlwl@tjutcm.edu.cn; Fax: +86 22 59596814; Tel: +86 22 59596814
cChengdu Kanghong Pharmaceutical Co. Ltd., Chengdu, 610036, People's Republic of China. E-mail: 024675@cnkh.com; 019357@cnkh.com
First published on 4th November 2025
Elemental analysis is crucial for determining the geographical origin of medicinal plants. This study employs laser-induced breakdown spectroscopy (LIBS) and machine learning to trace the origins and predict elemental content in Hypericum perforatum L. (HPL). LIBS data were collected from 269 HPL samples across various regions, while inductively coupled plasma mass spectrometry quantified 15 elements. Eleven preprocessing methods were evaluated for their impact on models. We developed four origin traceability models and multi-element quantification models using a multi-output regression framework. Optimal spectral intervals were identified through a moving window algorithm, and bootstrap aggregating enhanced model accuracy. The 4–8 interval effectively distinguished samples from different origins, with the XGBoost model performing particularly well in identifying July-harvested HPL from Xinjiang. All models achieved R2 values exceeding 0.8574, and paired t-tests showed no significant differences between actual and predicted values, confirming their effectiveness in origin identification and elemental content prediction.
Elemental analysis is a crucial method for identifying the geographical origins of medicinal plants, offering valuable insights into environments and soil conditions.6 By analyzing the types and concentrations of trace elements in medicinal plants, unique fingerprint profiles of herbal sources can be generated, enabling precise traceability of the medicinal plants.7 Inductively coupled plasma mass spectrometry (ICP-MS) plays an invaluable role in elemental analysis due to its high sensitivity and accuracy. ICP-MS not only quantifies various trace elements but also simultaneously detects heavy metals and other toxic elements, thereby ensuring the safety and quality of medicinal plants.8–10
Laser-induced breakdown spectroscopy (LIBS) is a rapid, non-contact elemental analysis technique that has gained widespread application in agriculture and food sectors in recent years.11,12 LIBS enables the analysis of elemental composition within microseconds, requiring minimal sample preparation and significantly enhancing analytical efficiency.13 When combined with machine learning (ML) methods, LIBS can develop more accurate and reliable classification and prediction models. For instance, processing LIBS data with ML algorithms such as random forest (RF) and support vector machine (SVM) can effectively distinguish herbs based on their geographical origins and assess their quality grades.14–16 Furthermore, after determining the elemental content in samples using standard methods, LIBS data can be integrated with regression models like partial least squares regression (PLSR) and support vector regression (SVR) for the rapid prediction of elemental content in unknown samples.17–19
In recent years, LIBS with machine learning (LIBS-ML) has made rapid progress in various industries due to the iteration of instrument updates and the continuous development of computers. With the use of LIBS-ML, soil heavy metal element information can be quickly analyzed through imaging and distribution analysis models.20 In addition, such technology is gradually infiltrating into multiple fields such as agricultural production and plant or herbal testing. However, existing studies on LIBS-ML for herbal analysis still have two critical limitations. First, most studies focus on either origin traceability or quantification of a few elements independently.21–23 Many studies have not considered the analysis of LIBS spectra using a combination of different preprocessing methods and feature screening methods, which may bring some uncertainties to the establishment of the model. In fact, these processes are the foundation for obtaining an adapted model. Moreover, in recent years, the multi output model framework has been increasingly validated for its application in evaluating complex systems, which bring the unique charm and advantages in analysis of complex systems.24–26 Therefore, we developed this work to verify the idea that LIBS spectra, driven by machine learning, can resolve information in complex systems. The objectives of this work are twofold: (i) to classify HPL samples based on their geographical origin and harvest time using LIBS coupled with machine learning models, and (ii) to predict the concentrations of multiple elements in these samples. By achieving these aims, we aim to demonstrate the dual-purpose potential of LIBS for both authentication and multi-element quantification in medicinal plants.
This study collected 269 batches of HPL samples from various sources and measured their LIBS data. Subsequently, ICP-MS was employed for the quantitative analysis of 15 elements in HPL. The impact of 11 different preprocessing methods on LIBS data was evaluated, and multiple origin discrimination models and elemental content prediction models were developed based on various algorithms. Multivariate output regression frameworks were utilized to simultaneously predict the content of multiple elements. A moving window algorithm was applied to determine the optimal range for different elements. Bootstrap aggregating (bagging) frameworks were used to enhance the prediction accuracy of the models and reduce the risk of overfitting. Notably, a significant polarization in the concentrations of the 15 elements was observed. Five elements exhibited concentrations at the ppb level (Al, Ga, Fe, K, and Mg), while another five were at the ppm level (Ba, Cr, Cu, Ga, and Mn). Due to this polarization, it was decided to model these 10 elements in two stages. Additionally, the concentrations of five heavy metal elements were notably lower, with over 90% of samples showing undetectable levels of Hg, leading to its exclusion from modeling. For the four lower concentration heavy metals (As, Cd, Pb, and V), logarithmic transformation was applied to improve the prediction of their concentration levels.
A mixed standard solution containing the elements Mg, Al, K, Ca, V, Cr, Mn, Fe, Cu, Ga, As, Cd, Ba, Pb, and Hg at a concentration of 100 mg L−1, as well as internal standard reference solutions of Sc, Ge, In, and Bi at a concentration of 10 mg L−1 were purchased from Tianjin Yifang Technology Co., Ltd (Tianjin, China). LC-MS grade nitric acid (HNO3) was obtained from Thermo Fisher Scientific (Massachusetts, USA), and ultra-pure water was obtained from Wahaha Group Co., Ltd (Hangzhou, China).
000 rpm for 10 min. The supernatant was collected and stored for further analysis.
Accurately pipette 10 ml of a mixed standard solution containing 15 elements and perform serial dilutions to obtain various concentrations (10 mg L−1, 5 mg L−1, 1 mg L−1, 0.5 mg L−1, 0.1 mg L−1, 0.05 mg L−1, 0.01 mg L−1, 0.001 mg L−1, and 0.0001 mg L−1) for sample analysis. Additionally, accurately pipette 10 ml of an internal standard reference solution containing four elements, dilute it to 1 mg L−1, and prepare it for analysis.
The analytical conditions for ICP-MS (Shimadzu Corporation, Kyoto, Japan) were as follows: operating in collision reaction cell (CRC) mode; argon (Ar) as the carrier gas (purity 99.999%); helium (He) as the collision gas (purity 99.999%); operating power set to 1.19 kW; plasma gas flow rate at 9.0 L min−1; carrier gas flow rate at 0.70 L min−1; auxiliary gas flow rate at 1.10 L min−1; the nebulizer chamber temperature maintained at 5 °C; and the peristaltic pump speed adjusted to 20 rpm. Average values were obtained from three replicates for each sample.
The contents of 17 elements (24Mg, 27Al, 39K, 44Ca, 51V, 52Cr, 55Mn, 56Fe, 63Cu, 69Ga, 75As, 114Cd, 138Ba, 202Hg, and 208Pb) in HPL samples from various sources were determined using the standard curve method. For the elements 24Mg, 27Al, 39K, 44Ca, 51V, 52Cr, 55Mn, and 56Fe, 45Sc served as the internal standard; for 63Cu, 69Ga and 75As, 74Ge was used; 114Cd and 138Ba utilized 115In as the internal standard; 202Hg and 208Pb used 209Bi as the internal standard. The standard curve was established by sequentially inserting sample tubes into mixed standard solutions of varying concentrations from low to high. Subsequently, the sample tubes were immersed in the test solution, with each sample measured in triplicate to obtain average values. Blank tests under the same analytical conditions were performed to account for blank interference. The establishment of the standard curve, subtraction of blank interference, mean calculations, and element concentration determinations were conducted using LabSolutionsTM ICPMS Ver. 2 software.
In order to further improve the robustness and generalization performance of the model, the Bagging ensemble strategy was introduced. Perform random sampling with replacement from the training set to generate B different sub training sets (optimized through cross validation). Train a base regression model independently on each sub training set. For the new input sample, average the prediction results of all B base models to obtain the final multi output prediction value. The Bagging ensemble strategy effectively reduces overfitting by averaging the predictions of multiple models, especially for high-dimensional small sample LIBS data. All model training parameter settings are explained on the last page of the SI.
First, the isolation forest (IF) algorithm was employed to identify outlier samples. Then, the K-nearest neighbors (KNN) model was used to assess sample recovery. Model performance was evaluated using cross-validation scores, accuracy, Kappa value, and area under the curve (AUC). Next, a stratified random sampling algorithm was applied to partition the entire sample set into three subsets in a 7
:
2
:
1 ratio according to sample categories. The coverage rate of the 95% confidence interval for the training and testing samples was utilized to assess the representativeness of the dataset.
Subsequently, the effects of 11 different spectral preprocessing methods on model performance were evaluated, including multiplicative scatter correction (MSC), standard normal variate transformation (SNV), Savitzky–Golay smoothing (SG), first derivative (1st Der), second derivative (2nd Der), wavelet transform (WT), mean centralization (MC), standardization (STD), min–max normalization (MMN), vector normalization (VN), and moving average filter (MAF). Four classification models (KNN, SVM, RF, and extreme gradient boosting (XGB)) were used to identify samples. After determining the optimal base model, the moving window algorithm was employed to identify the best spectral range. Feature importance scores were then used to select key variables. Finally, 27 batches of external validation samples were utilized to assess generalization capability of the models. This step ensured that the models could reliably predict the origins of HPL samples not included in the initial training dataset.
First, the IF algorithm was employed to identify outlier samples. PLSR was then used to assess sample recovery. The root mean square error of cross-validation (RMSEcv) served as the performance evaluation criterion. Subsequently, a comparative analysis of the representativeness of training sets obtained through three dataset partitioning methods (sample set partitioning based on joint X–Y distances (SPXY), Kennard–Stone (KS), and stratified random sampling) was conducted based on a 95% confidence interval. Next, the impact of 11 spectral preprocessing methods across four different types on model performance was evaluated. Four regression models (PLSR, LASSO, Ridge, and SVR) were utilized as base models, with multi-output regression employed as a meta-model aimed at simultaneously predicting the concentrations of the 14 elements in HPL samples. Due to variations in the concentration, 14 elements were categorized into three classes for improved predictive accuracy. (Class 1) constant elements (Al, Ca, Fe, K, and Mg). (Class 2) trace elements (Ba, Cr, Cu, Ga, and Mn). (Class 3) heavy metals (As, Cd, Pb, and V). The optimal spectral preprocessing methods and corresponding base models were determined for each element. The optimal spectral range was identified using a moving window algorithm. Finally, 10 batches of HPL samples were used to validate the generalizability of the models. This comprehensive approach ensures accurate and reliable quantification of elemental concentrations in HPL samples, enhancing the robustness and applicability of the developed models.
033 variables for further analysis. The cleaned LIBS spectra are shown in SI Fig. S1.
Despite the reduction to 16
033 variables after data cleaning, this number still presents computational and processing challenges. Considering that t-SNE requires substantial computational time and resources for high-dimensional data analysis, this study applied a moving window algorithm to segment the entire spectrum into 10 parts for individual modeling, with results shown in Fig. 2.
Dimensionality reduction based on intervals 4, 5, 6, and 8 classified the samples into three distinct categories. This outcome indicates significant differences in trace element content among HPL samples from various sources within these intervals. This highlights the variability in sample characteristics.
The results of LIBS outlier detection based on the isolation forest algorithm are shown in Fig. S2, with batch numbers 1, 77, and 118 identified as outliers. To evaluate whether the removal of these samples would negatively impact model performance, we constructed a KNN model using spectral data as independent variable X and sample category as dependent variable Y. The results are presented in Table 1. Ultimately, sample number 1 was decided to be excluded to ensure the accuracy and reliability of the model.
| Code | Score (cross-validation) | Accuracy | Kappa value | AUC | ||
|---|---|---|---|---|---|---|
| Training | Test | Macro | Weighted | |||
| Total | 0.9812 | 1.0000 | 0.9815 | 0.9666 | 0.9739 | 0.9773 |
| Exclude no. 1 | 0.9812 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
| Exclude no. 77 | 0.9766 | 0.9766 | 0.9815 | 0.9598 | 0.9985 | 0.9994 |
| Exclude no. 118 | 0.9766 | 0.9766 | 0.9815 | 0.9598 | 0.9995 | 0.9998 |
:
2 ratio, resulting in 192 samples designated for the training set and 49 samples for the validation set. The PCA distribution results after dataset partitioning are shown in Fig. S3. The results indicate that the 95% confidence interval of the training samples adequately covers the validation samples, demonstrating good consistency between the training and validation sets in feature space. This provides a strong foundation for subsequent model training and evaluation.
![]() | ||
| Fig. 3 Model performance evaluation based on the combination of different modeling and preprocessing methods. | ||
| Model | Pretreatment methods | Score (cross-validation) | Accuracy | Kappa value | AUC | ||
|---|---|---|---|---|---|---|---|
| Training | Test | Macro | Weighted | ||||
| KNN | SG | 1.0000 | 1.0000 | 0.9259 | 0.8594 | 0.9136 | 0.9409 |
| SVM | MMN | 1.0000 | 1.0000 | 0.9444 | 0.8962 | 0.9553 | 0.9752 |
| RF | VN | 1.0000 | 1.0000 | 0.9444 | 0.8962 | 0.9833 | 0.9907 |
| XGB | 2nd D | 1.0000 | 1.0000 | 0.9444 | 0.8962 | 0.9962 | 0.9979 |
The classification accuracy results on the validation set are shown in Fig. S5A, with all samples being correctly identified. This indicates that the variable selection method effectively filters out irrelevant spectral information and extracts significant differential information related to origin and harvest season. The predicted probabilities and frequency distribution for the validation set are illustrated in Fig. S5B. The x-axis represents the predicted probabilities for different categories, while the y-axis denotes the number of samples from three different origins within specific probability ranges. Notably, at a predicted probability of 0.9, the “XJ-7” category reached its highest frequency with 30 occurrences, demonstrating distinctly recognizable characteristics that allow the model to accurately classify this category. In contrast, samples from Xinjiang collected in September and from other origins exhibited higher frequencies in the low probability range, indicating uncertainty in classification decisions for these samples. This may be due to insufficient feature learning by the model or significant compositional differences among these samples.
| Elements | Concentration (mg L−1) | All samples | Exclude no. 1 | Exclude no. 77 | Exclude no. 118 |
|---|---|---|---|---|---|
| Ba | 0.0861 | 0.0142 | 0.0152 | 0.0170 | 0.0149 |
| Cr | 0.0756 | 0.0293 | 0.0295 | 0.0343 | 0.0268 |
| Cu | 0.0278 | 0.0106 | 0.0129 | 0.0109 | 0.0105 |
| Ga | 0.0574 | 0.0142 | 0.0131 | 0.0159 | 0.0145 |
| Mn | 0.5719 | 0.0873 | 0.0875 | 0.0906 | 0.0854 |
| Al | 6.9685 | 1.9810 | 1.2804 | 1.6307 | 1.6953 |
| Ca | 106.3246 | 15.8832 | 14.2926 | 15.7819 | 15.7580 |
| Fe | 5.5806 | 1.3811 | 0.9510 | 1.2911 | 1.2552 |
| K | 255.2621 | 28.7515 | 23.5119 | 29.5293 | 26.4463 |
| Mg | 28.9303 | 4.2338 | 4.4061 | 4.6790 | 4.1354 |
| As | 2.9441 | 0.9354 | 1.0309 | 0.8591 | 0.8930 |
| Cd | 2.8137 | 0.7903 | 0.3592 | 0.2444 | 0.7917 |
| Pb | 2.3884 | 0.1356 | 0.1612 | 0.1149 | 0.1275 |
| V | 2.9761 | 0.1018 | 0.1087 | 0.0827 | 0.0962 |
:
2 ratio, resulting in 126 samples for the training set and 31 samples for the internal validation set. The partitioning results are illustrated in Fig. S7. Based on the overlap of the 95% confidence intervals, it was observed that the calibration set partitioned by the SPXY algorithm adequately covered the validation set data. Consequently, the SPXY method was employed for data partitioning in further research.
It is noteworthy that the concentrations of these 15 elements exhibit a two-tier differentiation: five elements are at the ppb level (Al, Ga, Fe, K, and Mg), while the other five are at the ppm level (Ba, Cr, Cu, Ga, and Mn). Consequently, we opted to model these ten elements in two phases. Additionally, due to the low concentrations of the five heavy metals, with over 90% of samples showing no detectable mercury (Hg), we decided against modeling this element. For the heavy metals with lower concentrations (As, Cd, Pb, and V), a logarithmic transformation was applied to facilitate better handling of their magnitudes.
The optimal model results for the 14 elements based on different preprocessing methods are presented in Fig. 5. For elements with higher concentrations, the SVR model demonstrated superior performance; conversely, Ridge regression and LASSO regression models performed better for elements with lower concentrations, specifically: for Al, the optimal model is SVR-VN; for Ba, LASSO-MSC is the best model; for Ca, PLSR-MSC is preferred; for Cr, LASSO-MSC is optimal; for Cu, Ridge-MA performs best; for Fe, SVR-MMN is optimal; for Ga, LASSO-MSC is preferred; for K, SVR-1st Der is optimal; for Mg, Ridge-SNV is best; for Mn, Ridge-STD is optimal; for As, Ridge-SG is preferred; for Cd, Ridge-2nd Der is best; for V, PLSR-2nd Der is optimal; for Pb, Ridge-2nd Der is preferred. These results indicate variations in model performance across different element predictions, providing significant reference for further research and applications.
We subsequently used SHAP values to interpret the models. Given the complexity of Ridge regression and PLS regression models when handling high-dimensional data, computational efficiency for calculating SHAP values can be affected by model complexity. To address this, we reduced the sample size to accelerate SHAP value calculations while ensuring the representativeness of the selected data. To achieve this, we employed K-means clustering analysis to determine an appropriate number of samples. The centroid of each cluster was utilized to represent the characteristics of that cluster. To identify the optimal number of clusters, we used the silhouette coefficient, which measures how similar a sample is to its own cluster compared to other clusters. The silhouette coefficient ranges from −1 to 1, with a higher average indicating better clustering quality. Specifically, the optimal K values were as follows: K = 11 for Mg; K = 12 for Mn; K = 10 for Ca; K = 16 for Cu; and K = 10 for Cr. Detailed clustering results are presented in SI Fig. S8. These results provide a stable sample foundation for subsequent SHAP value calculations, thereby enhancing the effectiveness of model interpretability.
The SHAP function is an additive feature attribution method that interprets the predicted values of the model as a linear sum of contributions from individual features. Positive SHAP values indicate a positive contribution of the feature to the current prediction, suggesting that the feature increases the likelihood of the predicted output, while negative values indicate a negative contribution, reducing that likelihood. In this study, we extracted features with SHAP values greater than zero (SI Fig. S9). It can be observed that within the selected range of intervals, features with higher SHAP values are significantly associated with their respective peaks. These findings provide crucial insights into the decision-making process of the model. To validate the generalization capability, we selected 10 batches of samples for further testing (Table 4). All models achieved R2 values exceeding 0.8574, and paired t-test results indicated no significant differences between the actual and predicted values. These results demonstrate that the developed model performs well not only on the training data but also possesses strong generalization ability, effectively predicting the characteristics of unseen samples.
| Elements | Model | Best interval | Parametersa | Concentration (mg L−1) | RMSEc (mg L−1) | R c | RMSEp (mg L−1) | R p | t-Test |
|---|---|---|---|---|---|---|---|---|---|
| a SVR: ‘C’, ‘epsilon’, ‘kernel’; LASSO: ‘alpha’, ‘fit intercept’; PLSR: ‘number of components’; Ridge: ‘alpha’, ‘fit intercept’, ‘solver’. | |||||||||
| Al | SVR | 5 | 100, 0.4, poly | 6.9685 | 1.4325 | 0.8688 | 1.0563 | 0.9389 | 0.717 |
| Ba | LASSOR | 6 | 0.1, False | 0.0861 | 0.0136 | 0.8514 | 0.0081 | 0.9062 | 0.530 |
| Ca | PLSR | 4 | 15 | 106.3246 | 5.7783 | 0.9565 | 5.1193 | 0.9291 | 0.401 |
| Cr | RidgeR | 3 | 1.0, False, svd | 0.0756 | 0.0266 | 0.8564 | 0.0129 | 0.8574 | 0.249 |
| Cu | RidgeR | 6 | 100, False, lsqr | 0.0278 | 0.0025 | 0.9777 | 0.0025 | 0.9698 | 0.109 |
| Fe | SVR | 5 | 100, 0.1, poly | 5.5806 | 0.9048 | 0.9171 | 0.7118 | 0.9563 | 0.995 |
| Ga | LASSOR | 6 | 0.1, False | 0.0574 | 0.0115 | 0.8717 | 0.0092 | 0.8652 | 0.820 |
| K | SVR | 9 | 0.1, 0.4, linear | 255.2621 | 16.0164 | 0.9562 | 14.5871 | 0.9501 | 0.418 |
| Mg | RidgeR | 5 | 0.1, True, auto | 28.9303 | 2.4770 | 0.9078 | 2.4300 | 0.9011 | 0.640 |
| Mn | RidgeR | 8 | 100, True, lsqr | 0.5719 | 0.0356 | 0.9515 | 0.0251 | 0.9458 | 0.805 |
| As | RidgeR | 7 | 100, True, lsqr | 2.9441 | 0.3932 | 0.9336 | 0.2130 | 0.9651 | 0.907 |
| Cd | RidgeR | 10 | 100, True, auto | 2.8137 | 0.3876 | 0.9097 | 0.1316 | 0.9368 | 0.449 |
| Pb | RidgeR | 8 | 10, True, lsqr | 2.3884 | 0.0732 | 0.9371 | 0.0563 | 0.9093 | 0.314 |
| V | PLSR | 6 | 6 | 2.9761 | 0.0183 | 0.9618 | 0.0211 | 0.9290 | 0.262 |
Footnote |
| † Zhiyong Zhang, Wennan Nie and Guangpu Fang contributed equally to this work. |
| This journal is © The Royal Society of Chemistry 2026 |