Rapid origin traceability and multi-element quantification of Hypericum perforatum L. using LIBS combined with machine learning methods

Zhiyong Zhang; Wennan Nie; Guangpu Fang; Jiahe Qian; Hongxia Gan; Jingchao Chen; Wenlong Li

doi:10.1039/D5JA00321K

View PDF VersionPrevious ArticleNext Article

DOI: 10.1039/D5JA00321K (Paper) J. Anal. At. Spectrom., 2026, 41, 135-146

Rapid origin traceability and multi-element quantification of Hypericum perforatum L. using LIBS combined with machine learning methods

Zhiyong Zhang† ^a, Wennan Nie† ^a, Guangpu Fang† ^a, Jiahe Qian ^a, Hongxia Gan ^c, Jingchao Chen ^c and Wenlong Li *^ab
^aCollege of Pharmaceutical Engineering of Traditional Chinese Medicine, Tianjin University of Traditional Chinese Medicine, Tianjin, 301617, People's Republic of China. E-mail: zzytjutcm@163.com; 736243451@qq.com; 1248505630@qq.com; qjh0101@qq.com
^bHaihe Laboratory of Modern Chinese Medicine, Tianjin, 301617, People's Republic of China. E-mail: wshlwl@tjutcm.edu.cn; Fax: +86 22 59596814; Tel: +86 22 59596814
^cChengdu Kanghong Pharmaceutical Co. Ltd., Chengdu, 610036, People's Republic of China. E-mail: 024675@cnkh.com; 019357@cnkh.com

Received 21st August 2025 , Accepted 9th October 2025

First published on 4th November 2025

Abstract

Elemental analysis is crucial for determining the geographical origin of medicinal plants. This study employs laser-induced breakdown spectroscopy (LIBS) and machine learning to trace the origins and predict elemental content in Hypericum perforatum L. (HPL). LIBS data were collected from 269 HPL samples across various regions, while inductively coupled plasma mass spectrometry quantified 15 elements. Eleven preprocessing methods were evaluated for their impact on models. We developed four origin traceability models and multi-element quantification models using a multi-output regression framework. Optimal spectral intervals were identified through a moving window algorithm, and bootstrap aggregating enhanced model accuracy. The 4–8 interval effectively distinguished samples from different origins, with the XGBoost model performing particularly well in identifying July-harvested HPL from Xinjiang. All models achieved R² values exceeding 0.8574, and paired t-tests showed no significant differences between actual and predicted values, confirming their effectiveness in origin identification and elemental content prediction.

1 Introduction

Hypericum perforatum L. (HPL), commonly known as St. John's Wort, is a perennial herb widely distributed across Eurasia and renowned for its significant medicinal value.¹ HPL is rich in various bioactive compounds, including phenolic compounds,² flavonoids,³ and anthraquinone derivatives,⁴ which contribute to its antidepressant, antioxidant, and anti-inflammatory properties.⁵ However, the quality of HPL can vary substantially among different geographical origins, directly impacting its quality and efficacy.

Elemental analysis is a crucial method for identifying the geographical origins of medicinal plants, offering valuable insights into environments and soil conditions.⁶ By analyzing the types and concentrations of trace elements in medicinal plants, unique fingerprint profiles of herbal sources can be generated, enabling precise traceability of the medicinal plants.⁷ Inductively coupled plasma mass spectrometry (ICP-MS) plays an invaluable role in elemental analysis due to its high sensitivity and accuracy. ICP-MS not only quantifies various trace elements but also simultaneously detects heavy metals and other toxic elements, thereby ensuring the safety and quality of medicinal plants.^8–10

Laser-induced breakdown spectroscopy (LIBS) is a rapid, non-contact elemental analysis technique that has gained widespread application in agriculture and food sectors in recent years.^11,12 LIBS enables the analysis of elemental composition within microseconds, requiring minimal sample preparation and significantly enhancing analytical efficiency.¹³ When combined with machine learning (ML) methods, LIBS can develop more accurate and reliable classification and prediction models. For instance, processing LIBS data with ML algorithms such as random forest (RF) and support vector machine (SVM) can effectively distinguish herbs based on their geographical origins and assess their quality grades.^14–16 Furthermore, after determining the elemental content in samples using standard methods, LIBS data can be integrated with regression models like partial least squares regression (PLSR) and support vector regression (SVR) for the rapid prediction of elemental content in unknown samples.^17–19

In recent years, LIBS with machine learning (LIBS-ML) has made rapid progress in various industries due to the iteration of instrument updates and the continuous development of computers. With the use of LIBS-ML, soil heavy metal element information can be quickly analyzed through imaging and distribution analysis models.²⁰ In addition, such technology is gradually infiltrating into multiple fields such as agricultural production and plant or herbal testing. However, existing studies on LIBS-ML for herbal analysis still have two critical limitations. First, most studies focus on either origin traceability or quantification of a few elements independently.^21–23 Many studies have not considered the analysis of LIBS spectra using a combination of different preprocessing methods and feature screening methods, which may bring some uncertainties to the establishment of the model. In fact, these processes are the foundation for obtaining an adapted model. Moreover, in recent years, the multi output model framework has been increasingly validated for its application in evaluating complex systems, which bring the unique charm and advantages in analysis of complex systems.^24–26 Therefore, we developed this work to verify the idea that LIBS spectra, driven by machine learning, can resolve information in complex systems. The objectives of this work are twofold: (i) to classify HPL samples based on their geographical origin and harvest time using LIBS coupled with machine learning models, and (ii) to predict the concentrations of multiple elements in these samples. By achieving these aims, we aim to demonstrate the dual-purpose potential of LIBS for both authentication and multi-element quantification in medicinal plants.

This study collected 269 batches of HPL samples from various sources and measured their LIBS data. Subsequently, ICP-MS was employed for the quantitative analysis of 15 elements in HPL. The impact of 11 different preprocessing methods on LIBS data was evaluated, and multiple origin discrimination models and elemental content prediction models were developed based on various algorithms. Multivariate output regression frameworks were utilized to simultaneously predict the content of multiple elements. A moving window algorithm was applied to determine the optimal range for different elements. Bootstrap aggregating (bagging) frameworks were used to enhance the prediction accuracy of the models and reduce the risk of overfitting. Notably, a significant polarization in the concentrations of the 15 elements was observed. Five elements exhibited concentrations at the ppb level (Al, Ga, Fe, K, and Mg), while another five were at the ppm level (Ba, Cr, Cu, Ga, and Mn). Due to this polarization, it was decided to model these 10 elements in two stages. Additionally, the concentrations of five heavy metal elements were notably lower, with over 90% of samples showing undetectable levels of Hg, leading to its exclusion from modeling. For the four lower concentration heavy metals (As, Cd, Pb, and V), logarithmic transformation was applied to improve the prediction of their concentration levels.

2 Materials and methods

2.1 Plant materials and reagents

A total of 269 batches of HPL samples were collected for this study. Of these, 219 batches were sourced from Xinjiang, China, with 169 batches harvested in July and 50 batches in September. An additional 50 batches were collected from the southwestern region of China. All samples were authenticated as HPL by herbalist Hu Yunfei from Bozhou University. The samples were air-dried, cut into smaller pieces, and stored at the School of Traditional Chinese Medicine Pharmaceutical Engineering, Tianjin University of Traditional Chinese Medicine. Detailed sample information about the 269 batches of samples can be found in the SI (Table S1).

A mixed standard solution containing the elements Mg, Al, K, Ca, V, Cr, Mn, Fe, Cu, Ga, As, Cd, Ba, Pb, and Hg at a concentration of 100 mg L⁻¹, as well as internal standard reference solutions of Sc, Ge, In, and Bi at a concentration of 10 mg L⁻¹ were purchased from Tianjin Yifang Technology Co., Ltd (Tianjin, China). LC-MS grade nitric acid (HNO₃) was obtained from Thermo Fisher Scientific (Massachusetts, USA), and ultra-pure water was obtained from Wahaha Group Co., Ltd (Hangzhou, China).

2.2 Sample preparation

To enhance the accuracy of LIBS measurements considering the variability of elemental content in different medicinal parts of HPL, the following procedure was implemented: (1) all preprocessed samples were placed in a high-speed grinder and pulverized for 15 seconds, with this process repeated three times. (2) The resulting powder was sieved through a 65-mesh filter to ensure uniformity. (3) Approximately 0.5 g of the uniform powder was then pressed into round tablets (Φ13 mm × 3.5 mm) using a single-punch tablet press at a pressure of 9 tons for 5 min. (4) Each tablet was individually stored in a sealed bag within a dry, light-protected container at room temperature.

2.3 LIBS data acquisition for HPL

The LIBS data acquisition conditions were as follows: (1) the equipment was preheated for 30 minutes to ensure stable system performance. The laser parameters (Vlite-200, Beamtech Optronics Co., Ltd., Beijing) were set to an excitation wavelength of 532 nm, with a laser pulse energy of 68 mJ. (3) The spectrometer delay time was set to 2.0 μs, the integration time to 1 ms, and the ICCD detector gate width to 5 μs. The laser pulse width was 8 ns, and the repetition rate was 1 Hz. The spectral acquisition range was 200–1000 nm. To minimize the influence of surface impurities, each sample was struck at 16 collection points arranged in a 4 × 4 matrix. Each sample was measured six times, and the final spectral data for each sample were obtained by averaging all collected data. This protocol ensures consistent and reliable LIBS data acquisition, minimizing variability due to surface impurities and providing accurate elemental analysis.

2.4 ICP-MS data acquisition

The tablets previously analyzed by LIBS were placed in microwave digestion vessels and crushed with a glass rod. Then, 10 ml of HNO₃ was added for digestion. The microwave digestion system (MASTER40, Shanghai Xinyi Microwave Chemical Technology Co., Ltd.) followed this heating program: 0–10 min ramp to 150 °C, 10–20 min maintain at 180 °C, and 20–50 min continue at 180 °C. After the digestion process, once the digestion solution had cooled to below 60 °C, it was transferred to a temperature-controlled heating plate set at 120 °C for evaporation over 3 hours. And then, the digestion solution was transferred to a 50 ml volumetric flask and diluted to the mark with ultrapure water. Finally, the solution was transferred to a centrifuge tube and centrifuged at 12 [thin space (1/6-em)]

000 rpm for 10 min. The supernatant was collected and stored for further analysis.

Accurately pipette 10 ml of a mixed standard solution containing 15 elements and perform serial dilutions to obtain various concentrations (10 mg L⁻¹, 5 mg L⁻¹, 1 mg L⁻¹, 0.5 mg L⁻¹, 0.1 mg L⁻¹, 0.05 mg L⁻¹, 0.01 mg L⁻¹, 0.001 mg L⁻¹, and 0.0001 mg L⁻¹) for sample analysis. Additionally, accurately pipette 10 ml of an internal standard reference solution containing four elements, dilute it to 1 mg L⁻¹, and prepare it for analysis.

The analytical conditions for ICP-MS (Shimadzu Corporation, Kyoto, Japan) were as follows: operating in collision reaction cell (CRC) mode; argon (Ar) as the carrier gas (purity 99.999%); helium (He) as the collision gas (purity 99.999%); operating power set to 1.19 kW; plasma gas flow rate at 9.0 L min⁻¹; carrier gas flow rate at 0.70 L min⁻¹; auxiliary gas flow rate at 1.10 L min⁻¹; the nebulizer chamber temperature maintained at 5 °C; and the peristaltic pump speed adjusted to 20 rpm. Average values were obtained from three replicates for each sample.

The contents of 17 elements (²⁴Mg, ²⁷Al, ³⁹K, ⁴⁴Ca, ⁵¹V, ⁵²Cr, ⁵⁵Mn, ⁵⁶Fe, ⁶³Cu, ⁶⁹Ga, ⁷⁵As, ¹¹⁴Cd, ¹³⁸Ba, ²⁰²Hg, and ²⁰⁸Pb) in HPL samples from various sources were determined using the standard curve method. For the elements ²⁴Mg, ²⁷Al, ³⁹K, ⁴⁴Ca, ⁵¹V, ⁵²Cr, ⁵⁵Mn, and ⁵⁶Fe, ⁴⁵Sc served as the internal standard; for ⁶³Cu, ⁶⁹Ga and ⁷⁵As, ⁷⁴Ge was used; ¹¹⁴Cd and ¹³⁸Ba utilized ¹¹⁵In as the internal standard; ²⁰²Hg and ²⁰⁸Pb used ²⁰⁹Bi as the internal standard. The standard curve was established by sequentially inserting sample tubes into mixed standard solutions of varying concentrations from low to high. Subsequently, the sample tubes were immersed in the test solution, with each sample measured in triplicate to obtain average values. Blank tests under the same analytical conditions were performed to account for blank interference. The establishment of the standard curve, subtraction of blank interference, mean calculations, and element concentration determinations were conducted using LabSolutionsTM ICPMS Ver. 2 software.

2.5 t-SNE dimensionality reduction analysis

t-Distributed stochastic neighbor embedding (t-SNE) is an unsupervised, nonlinear dimensionality reduction technique that effectively preserves the local structure of the original high-dimensional data while providing good visualization. However, t-SNE has a high computational complexity, which can lead to extended computation times and greater resource demands for large datasets. In this study, the moving interval algorithm was employed for segmenting the LIBS data, which can enhance processing efficiency and reduce computational costs.

2.6 Elemental information analysis of HPL based on LIBS-ML

In this study, our goal was to simultaneously predict the content of multiple production areas and related elements, so we used a multi output regression framework instead of establishing independent models for each element separately. This strategy can capture potential correlations between different elements, improving overall prediction consistency and stability. We selected four different base learners to perform multi output tasks. RMSE was used as the joint loss function to optimize the overall prediction accuracy of all output variables. Compared to single output models, multi output regression reduces the number of models, avoids repeated parameter tuning, and can improve generalization ability by utilizing covariance information between elements.

In order to further improve the robustness and generalization performance of the model, the Bagging ensemble strategy was introduced. Perform random sampling with replacement from the training set to generate B different sub training sets (optimized through cross validation). Train a base regression model independently on each sub training set. For the new input sample, average the prediction results of all B base models to obtain the final multi output prediction value. The Bagging ensemble strategy effectively reduces overfitting by averaging the predictions of multiple models, especially for high-dimensional small sample LIBS data. All model training parameter settings are explained on the last page of the SI.

2.6.1 Qualitative discriminant models. This study established a qualitative discriminant model for HPL samples from different sources, with the modeling process divided into five steps: outlier removal, dataset partitioning, spectral preprocessing, variable selection, and model validation.

First, the isolation forest (IF) algorithm was employed to identify outlier samples. Then, the K-nearest neighbors (KNN) model was used to assess sample recovery. Model performance was evaluated using cross-validation scores, accuracy, Kappa value, and area under the curve (AUC). Next, a stratified random sampling algorithm was applied to partition the entire sample set into three subsets in a 7 [thin space (1/6-em)] :2:1 ratio according to sample categories. The coverage rate of the 95% confidence interval for the training and testing samples was utilized to assess the representativeness of the dataset.

Subsequently, the effects of 11 different spectral preprocessing methods on model performance were evaluated, including multiplicative scatter correction (MSC), standard normal variate transformation (SNV), Savitzky–Golay smoothing (SG), first derivative (1st Der), second derivative (2nd Der), wavelet transform (WT), mean centralization (MC), standardization (STD), min–max normalization (MMN), vector normalization (VN), and moving average filter (MAF). Four classification models (KNN, SVM, RF, and extreme gradient boosting (XGB)) were used to identify samples. After determining the optimal base model, the moving window algorithm was employed to identify the best spectral range. Feature importance scores were then used to select key variables. Finally, 27 batches of external validation samples were utilized to assess generalization capability of the models. This step ensured that the models could reliably predict the origins of HPL samples not included in the initial training dataset.

2.6.2 Quantitative analysis models. A quantitative calibration model for 14 elements in HPL samples harvested in July from Xinjiang, China, was established. The modeling steps are consistent with those of qualitative models.

First, the IF algorithm was employed to identify outlier samples. PLSR was then used to assess sample recovery. The root mean square error of cross-validation (RMSEcv) served as the performance evaluation criterion. Subsequently, a comparative analysis of the representativeness of training sets obtained through three dataset partitioning methods (sample set partitioning based on joint X–Y distances (SPXY), Kennard–Stone (KS), and stratified random sampling) was conducted based on a 95% confidence interval. Next, the impact of 11 spectral preprocessing methods across four different types on model performance was evaluated. Four regression models (PLSR, LASSO, Ridge, and SVR) were utilized as base models, with multi-output regression employed as a meta-model aimed at simultaneously predicting the concentrations of the 14 elements in HPL samples. Due to variations in the concentration, 14 elements were categorized into three classes for improved predictive accuracy. (Class 1) constant elements (Al, Ca, Fe, K, and Mg). (Class 2) trace elements (Ba, Cr, Cu, Ga, and Mn). (Class 3) heavy metals (As, Cd, Pb, and V). The optimal spectral preprocessing methods and corresponding base models were determined for each element. The optimal spectral range was identified using a moving window algorithm. Finally, 10 batches of HPL samples were used to validate the generalizability of the models. This comprehensive approach ensures accurate and reliable quantification of elemental concentrations in HPL samples, enhancing the robustness and applicability of the developed models.

2.7 Data analysis software

The cleaning, preprocessing, dataset partitioning, and modeling algorithms for the raw spectral data were conducted using PyCharm with Python 3.10. All classification and regression models were implemented using the open-source machine learning library Scikit-Learn. The modeling process is illustrated in Fig. 1.


	Fig. 1 Overview of data acquisition and model building processes.

3 Results and discussion

3.1 t-SNE model for HPL samples from different sources based on LIBS data

LIBS data are rich and complex, with excessive data points potentially hindering model convergence and fitting, which may lead to inefficiency and poor predictive performance. Therefore, preliminary processing and cleaning of LIBS data are crucial. To minimize interference from irrelevant variables, variables with response values within the baseline fluctuation range were removed, leaving 16 [thin space (1/6-em)]

033 variables for further analysis. The cleaned LIBS spectra are shown in SI Fig. S1.

Despite the reduction to 16 [thin space (1/6-em)] 033 variables after data cleaning, this number still presents computational and processing challenges. Considering that t-SNE requires substantial computational time and resources for high-dimensional data analysis, this study applied a moving window algorithm to segment the entire spectrum into 10 parts for individual modeling, with results shown in Fig. 2.


	Fig. 2 Cluster analysis based on moving interval algorithms and t-SNE dimensionality reduction.

Dimensionality reduction based on intervals 4, 5, 6, and 8 classified the samples into three distinct categories. This outcome indicates significant differences in trace element content among HPL samples from various sources within these intervals. This highlights the variability in sample characteristics.

3.2 Establishment of the qualitative discriminant model

In this section, classification models for HPL based on LIBS data were developed, taking into account different sources and harvesting seasons. Generally, the modeling process for spectral data includes the removal of outlier samples, dataset partitioning, spectral preprocessing, variable selection, and model evaluation.

3.2.1 Removal of outlier samples. In this study, the IF algorithm was employed to identify outlier samples. Outliers have a low probability of occurrence in random sampling, often leading to their misclassification as anomalies in most decision trees. By counting the frequency with which a data point is classified as an outlier across multiple decision trees, we can compute its anomaly score. A higher anomaly score indicates a greater likelihood of the data point being an outlier. After removing the outlier samples, it was essential to assess the performance of the models both before and after their removal. To achieve this, a recovery analysis was conducted to evaluate how well the models performed in the presence of these potentially anomalous values. This step ensures that the removal of outliers does not adversely affect the overall model performance and provides insights into the robustness of the models against anomalies.

The results of LIBS outlier detection based on the isolation forest algorithm are shown in Fig. S2, with batch numbers 1, 77, and 118 identified as outliers. To evaluate whether the removal of these samples would negatively impact model performance, we constructed a KNN model using spectral data as independent variable X and sample category as dependent variable Y. The results are presented in Table 1. Ultimately, sample number 1 was decided to be excluded to ensure the accuracy and reliability of the model.

Table 1 Modeling results after individual recovery of HPL anomalies based on the KNN model

Code	Score (cross-validation)	Accuracy		Kappa value	AUC
Code	Score (cross-validation)	Training	Test	Kappa value	Macro	Weighted
Total	0.9812	1.0000	0.9815	0.9666	0.9739	0.9773
Exclude no. 1	0.9812	1.0000	1.0000	1.0000	1.0000	1.0000
Exclude no. 77	0.9766	0.9766	0.9815	0.9598	0.9985	0.9994
Exclude no. 118	0.9766	0.9766	0.9815	0.9598	0.9995	0.9998

3.2.2 Dataset partitioning. After removing the outliers, the remaining 268 batches of HPL were proportionally sampled based on category information. Specifically, 10% (27 batches) were selected as the external validation set. The remaining 241 samples were partitioned using stratified random sampling at an 8 [thin space (1/6-em)]

2 ratio, resulting in 192 samples designated for the training set and 49 samples for the validation set. The PCA distribution results after dataset partitioning are shown in Fig. S3. The results indicate that the 95% confidence interval of the training samples adequately covers the validation samples, demonstrating good consistency between the training and validation sets in feature space. This provides a strong foundation for subsequent model training and evaluation.

3.2.3 Spectral preprocessing. To extract effective information from the spectra, enhance analytical efficiency, improve model performance, and increase predictive capability for unknown samples, preprocessing of the raw spectral data is essential. In this study, we evaluated 11 different preprocessing methods to optimize the spectral data. Four classifiers—KNN, SVM, RF, and XGB—were employed to determine sample categories, with their performance results displayed in Fig. 3. The optimal preprocessing methods and corresponding parameters for each model are detailed in Table 2. Considering multiple evaluation metrics such as cross-validation scores, accuracy, Kappa values, and AUC, we concluded that the XGB-2nd Der model exhibited the best overall performance. This model was therefore selected for further analysis.


	Fig. 3 Model performance evaluation based on the combination of different modeling and preprocessing methods.

Table 2 Optimal spectral preprocessing methods and model performance for HPL using different classification models

Model	Pretreatment methods	Score (cross-validation)	Accuracy		Kappa value	AUC
Model	Pretreatment methods	Score (cross-validation)	Training	Test	Kappa value	Macro	Weighted
KNN	SG	1.0000	1.0000	0.9259	0.8594	0.9136	0.9409
SVM	MMN	1.0000	1.0000	0.9444	0.8962	0.9553	0.9752
RF	VN	1.0000	1.0000	0.9444	0.8962	0.9833	0.9907
XGB	2nd D	1.0000	1.0000	0.9444	0.8962	0.9962	0.9979

3.2.4 Variable selection. The moving window algorithm was applied to determine the optimal spectral intervals, with a window step size set to 10 for developing XGB models. The results of the models for different interval segments are presented in Fig. S4. The classification model utilizing all intervals showed the best performance, while models based on individual segments also demonstrated strong results. This variability indicates differences in the elemental content of HPL samples collected from different origins and at different harvest times. Feature extraction was conducted for these 10 intervals, resulting in a total of 243 key features, as shown in Fig. 4.


	Fig. 4 The contribution of feature variables to the prediction results of the XGB model.

The classification accuracy results on the validation set are shown in Fig. S5A, with all samples being correctly identified. This indicates that the variable selection method effectively filters out irrelevant spectral information and extracts significant differential information related to origin and harvest season. The predicted probabilities and frequency distribution for the validation set are illustrated in Fig. S5B. The x-axis represents the predicted probabilities for different categories, while the y-axis denotes the number of samples from three different origins within specific probability ranges. Notably, at a predicted probability of 0.9, the “XJ-7” category reached its highest frequency with 30 occurrences, demonstrating distinctly recognizable characteristics that allow the model to accurately classify this category. In contrast, samples from Xinjiang collected in September and from other origins exhibited higher frequencies in the low probability range, indicating uncertainty in classification decisions for these samples. This may be due to insufficient feature learning by the model or significant compositional differences among these samples.

3.2.5 External validation. The results of the external validation are shown in Fig. S6A, where one sample labeled “XJ-7” was misidentified as “XJ-9”. Aside from this error, all other samples were accurately identified, indicating the good usability for the model we established. The predicted probabilities and frequency distribution exhibited the same trend as the validation set (Fig. S6B). The established mobile range XGB model demonstrated stability and reliability, effectively identifying samples of HPL from different sources.

3.3 Development of quantitative analysis models

Preliminary research indicated that HPL harvested in July from Xinjiang demonstrates superior quality compared to samples from other regions. Therefore, this study focused on developing predictive models for the concentrations of various elements in July-harvested HPL from Xinjiang. Notably, the concentration of Hg in all samples was below the detection limit and was thus excluded from further analysis. Additionally, As, Cd, Pb, and V exhibited low concentrations. To facilitate data analysis and model performance, we applied logarithmic transformation to the concentrations of these elements. This transformation allowed us to predict the element concentrations indirectly through numerical values, addressing the challenges posed by their low concentrations.

3.3.1 Outlier sample removal. According to the results presented in Section 3.2.1, samples 1, 77, and 118 were identified as outliers. To evaluate the impact of these outliers on the establishment of a quantitative model, we conducted an analysis of the specific effects of sequentially removing outlier samples on model performance. We developed PLSR models after retrieving each outlier sample individually. During the model evaluation process, we used RMSE as the primary performance metric to assess the influence of outliers on model validity. The relevant results are summarized in Table 3. Our analysis revealed that excluding samples 1 and 118 significantly enhanced the overall performance of the model. Sample 1 and Sample 118 showed abnormal peak broadening suggesting sample degradation during preparation. Sample 77 was retained as its removal did not provide a significant improvement in model performance. By systematically evaluating the impact of each outlier, this approach allowed us to ensure the reliability and effectiveness of the developed model for practical applications.

Table 3 Outlier assessment based on the PLSR model using RMSE as the evaluation metric

Elements	Concentration (mg L⁻¹)	All samples	Exclude no. 1	Exclude no. 77	Exclude no. 118
Ba	0.0861	0.0142	0.0152	0.0170	0.0149
Cr	0.0756	0.0293	0.0295	0.0343	0.0268
Cu	0.0278	0.0106	0.0129	0.0109	0.0105
Ga	0.0574	0.0142	0.0131	0.0159	0.0145
Mn	0.5719	0.0873	0.0875	0.0906	0.0854
Al	6.9685	1.9810	1.2804	1.6307	1.6953
Ca	106.3246	15.8832	14.2926	15.7819	15.7580
Fe	5.5806	1.3811	0.9510	1.2911	1.2552
K	255.2621	28.7515	23.5119	29.5293	26.4463
Mg	28.9303	4.2338	4.4061	4.6790	4.1354
As	2.9441	0.9354	1.0309	0.8591	0.8930
Cd	2.8137	0.7903	0.3592	0.2444	0.7917
Pb	2.3884	0.1356	0.1612	0.1149	0.1275
V	2.9761	0.1018	0.1087	0.0827	0.0962

3.3.2 Dataset partitioning. After removing the outliers, the remaining 167 samples were used to establish the quantitative model dataset. 10 batches of samples were randomly selected as an external validation set to assess model performance. The remaining 157 samples were then divided using random sampling, the KS algorithm, and the SPXY algorithm at an 8 [thin space (1/6-em)]

2 ratio, resulting in 126 samples for the training set and 31 samples for the internal validation set. The partitioning results are illustrated in Fig. S7. Based on the overlap of the 95% confidence intervals, it was observed that the calibration set partitioned by the SPXY algorithm adequately covered the validation set data. Consequently, the SPXY method was employed for data partitioning in further research.

3.3.3 Spectral preprocessing. In this study, four regression algorithms (PLSR, LASSO, Ridge, and SVR) were utilized as base models, accompanied by a multi-output regression approach as the meta-model, aiming to simultaneously predict the concentrations of 14 elements in HPL. The advantage of algorithms is that they can use a single model to predict multiple outputs, and improve generalization ability and prediction accuracy by sharing information among multiple targets.

It is noteworthy that the concentrations of these 15 elements exhibit a two-tier differentiation: five elements are at the ppb level (Al, Ga, Fe, K, and Mg), while the other five are at the ppm level (Ba, Cr, Cu, Ga, and Mn). Consequently, we opted to model these ten elements in two phases. Additionally, due to the low concentrations of the five heavy metals, with over 90% of samples showing no detectable mercury (Hg), we decided against modeling this element. For the heavy metals with lower concentrations (As, Cd, Pb, and V), a logarithmic transformation was applied to facilitate better handling of their magnitudes.

The optimal model results for the 14 elements based on different preprocessing methods are presented in Fig. 5. For elements with higher concentrations, the SVR model demonstrated superior performance; conversely, Ridge regression and LASSO regression models performed better for elements with lower concentrations, specifically: for Al, the optimal model is SVR-VN; for Ba, LASSO-MSC is the best model; for Ca, PLSR-MSC is preferred; for Cr, LASSO-MSC is optimal; for Cu, Ridge-MA performs best; for Fe, SVR-MMN is optimal; for Ga, LASSO-MSC is preferred; for K, SVR-1st Der is optimal; for Mg, Ridge-SNV is best; for Mn, Ridge-STD is optimal; for As, Ridge-SG is preferred; for Cd, Ridge-2nd Der is best; for V, PLSR-2nd Der is optimal; for Pb, Ridge-2nd Der is preferred. These results indicate variations in model performance across different element predictions, providing significant reference for further research and applications.


	Fig. 5 Results of 14 element concentration prediction models based on different preprocessing methods; due to variations in the concentration, 14 elements were categorized into three classes for improved predictive accuracy. (Class 1) constant elements (Al, Ca, Fe, K, and Mg). (Class 2) trace elements (Ba, Cr, Cu, Ga, and Mn). (Class 3) heavy metals (As, Cd, Pb, and V)).

3.3.4 Variable selection. The moving window algorithm divided the LIBS data into 10 intervals. The optimal interval for each element was determined based on RMSEcv and RMSEp results. The overall model results are illustrated in Fig. 6. Generally, a smaller RMSE indicates better model performance. However, during modeling, it was observed that for certain elements, the difference between RMSEcv and RMSEp was substantial, suggesting overfitting due to model complexity. To address this issue, we employed the Booting algorithm to enhance overall prediction accuracy and stability while reducing the risk of overfitting. The optimal intervals for each element are as follows: for Al, the optimal interval is P5; for Ba, it is P6; for Ca, it is P4; for Cr, it is P3; for Cu, it is P6; for Fe, it is P5; for Ga, it is P6; for K, it is P9; for Mg, it is P5; for Mn, it is P8; for As, it is P7; for Cd, it is P10; for V, it is P6; for Pb, it is P8.


	Fig. 6 Four model evaluations of different elements in different intervals.

We subsequently used SHAP values to interpret the models. Given the complexity of Ridge regression and PLS regression models when handling high-dimensional data, computational efficiency for calculating SHAP values can be affected by model complexity. To address this, we reduced the sample size to accelerate SHAP value calculations while ensuring the representativeness of the selected data. To achieve this, we employed K-means clustering analysis to determine an appropriate number of samples. The centroid of each cluster was utilized to represent the characteristics of that cluster. To identify the optimal number of clusters, we used the silhouette coefficient, which measures how similar a sample is to its own cluster compared to other clusters. The silhouette coefficient ranges from −1 to 1, with a higher average indicating better clustering quality. Specifically, the optimal K values were as follows: K = 11 for Mg; K = 12 for Mn; K = 10 for Ca; K = 16 for Cu; and K = 10 for Cr. Detailed clustering results are presented in SI Fig. S8. These results provide a stable sample foundation for subsequent SHAP value calculations, thereby enhancing the effectiveness of model interpretability.

The SHAP function is an additive feature attribution method that interprets the predicted values of the model as a linear sum of contributions from individual features. Positive SHAP values indicate a positive contribution of the feature to the current prediction, suggesting that the feature increases the likelihood of the predicted output, while negative values indicate a negative contribution, reducing that likelihood. In this study, we extracted features with SHAP values greater than zero (SI Fig. S9). It can be observed that within the selected range of intervals, features with higher SHAP values are significantly associated with their respective peaks. These findings provide crucial insights into the decision-making process of the model. To validate the generalization capability, we selected 10 batches of samples for further testing (Table 4). All models achieved R² values exceeding 0.8574, and paired t-test results indicated no significant differences between the actual and predicted values. These results demonstrate that the developed model performs well not only on the training data but also possesses strong generalization ability, effectively predicting the characteristics of unseen samples.

Table 4 Using the external testing set to load and apply the best model for each element

Elements	Model	Best interval	Parameters^a	Concentration (mg L⁻¹)	RMSEc (mg L⁻¹)	R _c	RMSEp (mg L⁻¹)	R _p	t-Test
a SVR: ‘C’, ‘epsilon’, ‘kernel’; LASSO: ‘alpha’, ‘fit intercept’; PLSR: ‘number of components’; Ridge: ‘alpha’, ‘fit intercept’, ‘solver’.
Al	SVR	5	100, 0.4, poly	6.9685	1.4325	0.8688	1.0563	0.9389	0.717
Ba	LASSOR	6	0.1, False	0.0861	0.0136	0.8514	0.0081	0.9062	0.530
Ca	PLSR	4	15	106.3246	5.7783	0.9565	5.1193	0.9291	0.401
Cr	RidgeR	3	1.0, False, svd	0.0756	0.0266	0.8564	0.0129	0.8574	0.249
Cu	RidgeR	6	100, False, lsqr	0.0278	0.0025	0.9777	0.0025	0.9698	0.109
Fe	SVR	5	100, 0.1, poly	5.5806	0.9048	0.9171	0.7118	0.9563	0.995
Ga	LASSOR	6	0.1, False	0.0574	0.0115	0.8717	0.0092	0.8652	0.820
K	SVR	9	0.1, 0.4, linear	255.2621	16.0164	0.9562	14.5871	0.9501	0.418
Mg	RidgeR	5	0.1, True, auto	28.9303	2.4770	0.9078	2.4300	0.9011	0.640
Mn	RidgeR	8	100, True, lsqr	0.5719	0.0356	0.9515	0.0251	0.9458	0.805
As	RidgeR	7	100, True, lsqr	2.9441	0.3932	0.9336	0.2130	0.9651	0.907
Cd	RidgeR	10	100, True, auto	2.8137	0.3876	0.9097	0.1316	0.9368	0.449
Pb	RidgeR	8	10, True, lsqr	2.3884	0.0732	0.9371	0.0563	0.9093	0.314
V	PLSR	6	6	2.9761	0.0183	0.9618	0.0211	0.9290	0.262

4 Conclusion

Elemental analysis is a key tool for determining the geographical origin of medicinal plants and plays a significant role in their quality assessment. This study utilized LIBS to trace the origin and harvesting time of HPL and developed predictive models for the content of 14 elements, enabling rapid evaluation of HPL quality in Xinjiang. We collected 269 batches of HPL samples from various sources and measured their LIBS data. Subsequently, we conducted quantitative analysis of 15 elements in HPL using ICP-MS. We compared the effects of 11 different preprocessing methods on LIBS data and constructed multiple origin discrimination models and predictive models for elemental content using various algorithms. In the model development process, we employed a multivariate output regression framework to simultaneously predict the content of multiple elements. Additionally, the moving window algorithm was used to determine the optimal ranges for different elements, and Bootstrap aggregating was implemented to enhance predictive accuracy while mitigating overfitting risks. SHAP values were utilized to visualize the contributions of the variables. The qualitative discrimination models and predictive models for elemental content demonstrated good performance, providing a novel approach for the quality assessment of HPL. These findings not only offer insights for standardized production of medicinal plants but also provide a scientific basis for large-scale cultivation, modification, and market management, thus contributing valuable references for related research fields.

Author contributions

Zhiyong Zhang: conceptualization, methodology, software, writing – original draft. Wennan Nie: methodology, software, writing – original draft preparation. Guangpu Fang: data curation, software, methodology. Jiahe Qian: software, data curation. Hongxia Gan: data curation. Jingchao Chen: data curation. Wenlong Li: writing – reviewing and editing, funding acquisition.

Conflicts of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Data availability

The data supporting this article have been included as part of the supplementary information (SI). All figures, tables, and data related information are presented in the main text and SI of the manuscript. Supplementary information is available. See DOI: https://doi.org/10.1039/d5ja00321k.

Acknowledgements

We are extremely grateful for the financial support of the Hebei Province Science and Technology support program (no. 21372503D) and Tianjin University of Traditional Chinese Medicine Research Innovation Project for Postgraduate Students (YJSKC-20240008).

References

Z. Zhang, Z. Ying, M. He, Y. Zhang, W. Nie, Z. Tang, W. Liu, J. Chen, J. Ye and W. Li, J. Pharm. Biomed. Anal., 2024, 248, 116313 CrossRef CAS.
H. Zhang, K. Wang and F. Chen, Nat. Prod. Res., 2024, 38, 4134–4140 CrossRef CAS PubMed.
W.-Y. Liu, H. Qiu, H.-M. Li, R. Zhang, Y.-K. Pan, C.-Y. Cao, J.-M. Tian and J.-M. Gao, Ind. Crops Prod., 2024, 216, 118792 CrossRef CAS.
K. Bruňáková and E. Čellárová, J. Biotechnol., 2017, 251, 59–67 CrossRef PubMed.
S. Kapoor, R. Chandel, R. Kaur, S. Kumar, R. Kumar, S. Janghu, A. Kaur and V. Kumar, Biochem. Syst. Ecol., 2023, 110, 104702 CrossRef CAS.
B. V. Canizo, A. L. Diedrichs, E. F. Fiorentini, L. Brusa, M. Sigrist, J. M. Juricich, R. G. Pellerano and R. G. Wuilloud, J. Food Compos. Anal., 2025, 137, 106958 CrossRef CAS.
L.-L. Cui, H. Chen, Z.-P. Chen, Y.-W. Yuan, S.-L. Han, Y.-N. Fu, H.-W. Hou and Q.-Y. Hu, Microchem. J., 2023, 193, 109163 CrossRef CAS.
R. Jamwal, Amit, M. Sengar and R. Jamwal, Food Chem. Adv., 2023, 2, 100233 CrossRef.
B. Silva, L. V. Gonzaga, H. F. Maltez, K. B. Samochvalov, R. Fett and A. C. O. Costa, J. Food Compos. Anal., 2021, 96, 103727 CrossRef CAS.
Y.-K. Kwon, Y.-S. Bong, K.-S. Lee and G.-S. Hwang, Food Chem., 2014, 161, 168–175 CrossRef CAS PubMed.
J. Peng, M. Lin, W. Xie, L. Ye, C. Zhang, Z. Zhao, F. Liu, W. Kong and F. Zhou, Microchem. J., 2023, 194, 109337 CrossRef CAS.
N. Baskali-Bouregaa, M.-L. Milliand, S. Mauffrey, E. Chabert, M. Forrestier and N. Gilon, Talanta, 2020, 211, 120674 CrossRef CAS PubMed.
V. Detalle and X. Bai, Spectrochim. Acta, Part B, 2022, 191, 106407 CrossRef CAS.
Y. Wang, M.-G. Li, T. Feng, T.-L. Zhang, Y.-Q. Feng and H. Li, Chin. J. Anal. Chem., 2022, 50, 100057 Search PubMed.
J. Liang, M. Li, Y. Du, C. Yan, Y. Zhang, T. Zhang, X. Zheng and H. Li, Chemom. Intell. Lab. Syst., 2020, 207, 104179 CrossRef CAS.
M. Cicero Ribeiro, J. Cabral, G. Nicolodelli, G. S. Senesi, A. R. L. Caires, D. A. Gonçalves, C. Menegatti, D. Milori, C. Cena and B. Marangoni, Microchem. J., 2024, 203, 110898 CrossRef CAS.
Q. Zhao, Y. Yu, P. Cui, N. Hao, C. Liu, P. Miao and Z. Li, Spectrochim. Acta, Part A, 2023, 287, 122053 CrossRef CAS PubMed.
X. Cama-Moncunill, M. Markiewicz-Keszycka, P. J. Cullen, C. Sullivan and M. P. Casado-Gavalda, Food Chem., 2020, 309, 125754 CrossRef CAS PubMed.
S. Hu, J. Ding, Y. Dong, T. Zhang, H. Tang and H. Li, RSC Adv., 2024, 14, 16358–16367 RSC.
B. Han, W. Gao, J. Feng, A. Iroshan, J. Yang, G. Chen, Y. Zhang, N. Aizezi and Y. Liu, J. Hazard. Mater., 2025, 496, 139284 CrossRef CAS.
J. Zhang, X. Li, Y. Yan, S. Cen, W. Song, J. An, Y. Yu and Z. Li, Microchem. J., 2024, 207, 111946 CrossRef CAS.
Y. Rao, T. Sun, C. Sun and J. Yu, Spectrochim. Acta, Part B, 2022, 198, 106567 CrossRef CAS.
T. Zhang, Z. Liu, Q. Ma, D. Hu, Y. Dai, X. Zhang and Z. Zhou, Foods, 2024, 13, 1676 CrossRef PubMed.
J. Hu, M. Li, L. Liu, X. Ji and D. Li, Comput. Electron. Agric., 2025, 237, 110498 CrossRef.
Y. Wang, L. Xu, J. Li, Z. Ren, W. Liu, Y. Ai, Y. Zhou, Q. Li, B. Zhang, N. Guo, J. Qu and Y. Zhang, Sci. Total Environ., 2024, 945, 173942 CrossRef CAS PubMed.
A. Ibrahim and C. Ataca, ACS Appl. Mater. Interfaces, 2024, 16, 41145–41156 CrossRef CAS PubMed.

Footnote

† Zhiyong Zhang, Wennan Nie and Guangpu Fang contributed equally to this work.

Click here to see how this site uses Cookies. View our privacy policy here.