Open Access Article
Siyao Li
a,
Haiyan Qinb,
Xi Wua and
Zhirong Suo*a
aSchool of Materials and Chemistry, Southwest University of Science and Technology, Mianyang 621010, Sichuan, People's Republic of China. E-mail: suozhirong@163.com
bCollege of Life Sciences and Agri-Forest, Southwest University of Science and Technology, Mianyang 621010, Sichuan, People's Republic of China
First published on 25th March 2026
With the continuous development of industry, accidental or unauthorized discharges of wastewater from manufacturing enterprises have increasingly severe impacts on the environment. Rapid and accurate identification of industrial wastewater sources is of great significance for enhancing regulatory oversight and environmental protection. In this study, we propose a novel approach for discriminating the sources of industrial wastewater by integrating data standardization with ensemble learning. High-performance liquid chromatography (HPLC) is employed to collect wastewater samples. To ensure data consistency and accuracy, a local stretching alignment method combined with Gaussian fitting is introduced for precise peak alignment in chromatographic data. We compare the modeling performance of two ensemble learning algorithms: Random Forest (RF) and Extreme Gradient Boosting (XGBoost). To further improve model accuracy, hyperparameter optimization is conducted using the Optuna framework. The models are systematically evaluated through five-fold cross-validation. Experimental results show that the optimized RF model achieves an average cross-validation accuracy of 97.87%, a test set accuracy of 98.28%, and an F1 score of 0.9799, the accuracy on the newly collected samples reached 95.08%, demonstrating excellent overall performance and a well-balanced trade-off between precision and recall. This approach provides an efficient and reliable analytical tool for tracing the sources of industrial wastewater.
Currently, numerous methods exist for identifying water pollution. For instance, traceability methods that rely on conventional water quality indicators, such as chemical oxygen demand (COD), ammonia nitrogen (NH3+-N), and total phosphorus (TP), are widely employed.7,8 While these methods can indicate whether water quality is abnormal, they often provide limited information, making it challenging to accurately distinguish and identify wastewater from different enterprises.9 Furthermore, analytical methods based on isotopes and trace elements have gained extensive application in river basin pollution source traceability.10 Sun et al.11 used isotopes δ13C and δ15N to trace pollution sources and successfully identified the source of organic matter in the Xinhe Estuary of the Yongding River as originating from urban wastewater and terrestrial sources. However, the elemental composition of industrial wastewater is typically complex and highly similar, and this method cannot effectively distinguish among sources. Therefore, it is imperative to explore other, more efficient identification methods.
To improve identification accuracy, researchers have increasingly been developing methods based on advanced indicators. For example, spectroscopic analytical techniques such as ultraviolet-visible (UV-Vis),12 near-infrared (NIR),13 and fluorescence (FLD)14,15 are based on the principle that the absorption or emission of a substance at specific wavelengths is governed by its molecular structure.16 When pollutants produced by different enterprises exhibit similar molecular structures, the discriminatory capability of these spectral analysis methods becomes limited. Additionally, spectral analysis techniques are susceptible to external environmental interference during detection, and the resulting data often require complex processing and analysis methods.17 High-performance liquid chromatography (HPLC) has become an important technique for identifying pollutants in water due to its high speed, sensitivity, accuracy, and excellent capability for separating complex samples.18 For instance, József et al.19 utilized HPLC to identify four painkillers in Danube water, while Gure et al.20 integrated HPLC-diode array detection (HPLC-DAD) with ion pair-assisted liquid–liquid extraction (IPA-LLE) to identify six sulfonylureas and four organophosphorus pesticides from three different environmental water bodies in Ethiopia. Although HPLC can accurately and sensitively detect trace pollutants in effluent environments, due to the complex composition of industrial wastewater, it is difficult to achieve rapid identification solely with this method. Machine learning algorithms possess strong logical computing capabilities and demonstrate excellent performance in analyzing complex, nonlinear, and multidimensional data.21 Lu et al.22 analyzed the relative content of eight pigments in olive oil from five major producing regions in China using HPLC and combined three machine learning algorithms, random forest (RF), k-nearest neighbor (KNN), and decision tree (DT), to achieve rapid classification of olive oil producing origins, with the RF model achieving a classification accuracy of 96%. Zhong et al.23 established HPLC fingerprinting of the polysaccharides from Citri Reticulatae Pericarpium (Chenpi) and assessed the capabilities of nine machine learning algorithms to discriminate between different varieties of Chenpi. Among these, five models: linear discriminant analysis (LDA), support vector machine (SVM), artificial neural network (ANN), logistic regression (LR), and quadratic discriminant analysis (QDA) achieved accuracy, precision, recall, and F1-score values all greater than 0.888. However, to date, no reports have been published on the application of HPLC combined with machine learning algorithms for the identification of industrial wastewater.
In this paper, we present a method for identifying wastewater from four enterprises, Sichuan Dongcai New Material Co., Ltd (DC), Mianyang Heze Chemical Co., Ltd (HZ), Sichuan Jin'an Environmental Technology Co., Ltd (JA), and Sichuan Qisai Microelectronics Co., Ltd (QS), that discharging into a centralized wastewater treatment plant within an industrial park in Mianyang, China. HPLC is employed to capture the characteristic features of the wastewater, followed by a series of preprocessing steps applied to the raw data. The RF and extreme gradient boosting (XGBoost) algorithms are then implemented to identify these industrial wastewaters. By comparing the performance of the two algorithms, the RF model is selected as the more appropriate baseline model for this study. In addition, Savitzky–Golay (SG) smoothing is applied for data preprocessing, and the Optuna algorithm is used to optimize the RF model's parameters. The optimized RF model exhibits a notable improvement in accuracy. This method enables rapid identification of wastewater from different enterprises and provides reliable evidence for supervising enterprises, facilitating rapid identification of wastewater across the industrial park.
| Enterprises | Sampling point |
|---|---|
| DC | Category I wastewater inlet, category II wastewater inlet, fine screen chamber, clear water tank, PV resin production wastewater collection tank, cleaning wastewater collection tank, specialty resin production wastewater collection tank, secondary sedimentation tank, wastewater outlet, final discharge outlet of the plant |
| HZ | General wastewater inlet, primary sedimentation tank, hydrolysis tank, aerobic tank, secondary sedimentation tank, final sedimentation tank, final wastewater discharge outlet |
| JA | General wastewater inlet, coagulation sedimentation tank, primary sedimentation tank, anoxic tank, aerobic tank, secondary sedimentation tank, final wastewater discharge outlet. |
| QS | General wastewater inlet, primary PAM dosing tank, primary sedimentation tank, secondary PAM dosing tank, reconditioning tank, final wastewater discharge outlet |
A total of 290 samples were collected in this study, and the organic glass water sampler (Henan Xinchangyuan Experimental Equipment Co., Ltd, China) was used for on-site sampling. The collected water samples were stored in 500 mL high-borosilicate silicon glass bottles (Chengdu Huangyu Experimental Co., Ltd, China), promptly transported to the laboratory, stored at 4 °C, and filtered through a 0.45 µm membrane filter (Tianjin Jinteng Experimental Equipment Co., Ltd, China). The filtrate was transferred into HPLC vials. HPLC-grade methanol was purchased from Hunan Tengma New Material Co., Ltd (Hunan, China), and ultrapure water was produced using a Milli-Q water purification system.
We calculated the similarity between the chromatographic data obtained at all sampling times. The Pearson correlation coefficient was used to assess the similarity among samples from the four enterprises. The similarity between samples from each enterprise exceeded 93%, indicating good stability in the wastewater composition, which is conducive to the subsequent identification of wastewater from the four enterprises.
Data at 230 nm, 250 nm, 280 nm, and 310 nm were examined. The chromatograms at 230 nm, 250 nm, and 280 nm exhibited numerous peaks with identical retention times, resulting in similar overall chromatographic profiles and introducing interference, thereby reducing the model's classification accuracy. In contrast, at 310 nm, the overall chromatographic profiles showed distinct differences, and preliminary classification predictions yielded better results than at 230 nm and 280 nm. Therefore, 310 nm was selected as the study wavelength.
After data cleaning, the scipy.signal.find_peaks function is employed to extract the peaks from the samples. This method identifies significant chromatographic peaks based on specified parameters, such as peak height, adjacent peak spacing, and prominence. Following peak extraction, the retention times of each peak are calculated, and the frequency and distribution of these retention times are analyzed to provide a foundation for subsequent standard retention time setting. Based on the frequency and proportion of retention times, peaks with an occurrence frequency greater than 50% within each enterprise were selected as standard retention times, and the chromatographic peaks were aligned using a local stretching method. Concurrently, the peak area is corrected by Gaussian fitting to ensure consistency across samples at the standard retention time.
The specific procedure is illustrated in Fig. 2, which presents a schematic diagram of peak alignment. First, peak extraction is performed. The extracted peaks are then matched with the standard retention times using a threshold of 0.13 minutes. A match is considered successful if the difference between the extracted peak and the standard retention time is less than 0.13 minutes. A greedy strategy is used to ensure that each standard peak corresponds to a single extracted peak. For successfully matched peaks, local stretching alignment is applied to align the peak center with the standard retention time. Subsequently, Gaussian fits are performed on the aligned peaks to determine peak areas. The parameters of the Gaussian and baseline models (peak height, width, and baseline, with the center fixed to the standard time) are obtained via nonlinear least-squares fitting. The peak area is then calculated using trapezoidal numerical integration (scipy.integrate.trapezoid) to obtain the fitted peak area. If the relative error between the fitted and original areas is less than 10%, the peak is considered effectively aligned.
To eliminate scale differences between samples and enhance data comparability, it is also necessary to standardize the chromatographic features, ensuring that the characteristic scales of each sample are aligned. Subsequently, the data are preprocessed using D1st, D2nd, BC, and SG smoothing. The effects of these different preprocessing methods on the model are then compared.
Through these steps, the accuracy and consistency of the data are ensured after removing experimental errors and standardizing the measurements, thereby providing a reliable foundation for subsequent modeling analyses.
The gradient boosting mechanism employed by XGBoost effectively captures complex patterns in data, particularly in datasets with a large number of features and elevated levels of noise. This capability can significantly enhance classification accuracy. The built-in L1 and L2 regularization mechanisms in XGBoost effectively mitigate overfitting and improve the model's generalization.27 Furthermore, the model accommodates class imbalance in the data and can optimize classification performance by adjusting class weights.
In the classification task involving data from four enterprises, the initial focus should be on the processing of classification labels. In multi-class classification tasks, these labels serve to distinguish between different categories. The labels are abbreviations for the four enterprises: DC, HZ, JA, and QS.
XGBoost's software implementation requires numeric labels. When using the XGBoost model for a four-class classification problem, the categorical labels DC, HZ, JA, and QS must be converted to numeric labels. This conversion can be achieved through label encoding. Where each categorical label is mapped to a unique integer: DC is encoded as 0, HZ as 1, JA as 2, and QS as 3. It is essential to numerically encode the labels starting from 0, which imposes additional requirements for label processing. Furthermore, XGBoost offers a broader range of hyperparameters, and parameter tuning for multi-class classification problems may require more time and computational resources.
Unlike XGBoost, RF does not directly support categorical labels in string format; rather, it processes them indirectly via an internal encoding mechanism. This characteristic renders it particularly suitable for rapid classification tasks. Regardless of whether the label is DC, HZ, JA, or QS, the RF model can seamlessly recognize it as a classification label without requiring numerical conversion. This feature enhances the convenience of employing RF for such tasks and facilitates swift modeling. Furthermore, it demonstrates robust tolerance to noise and missing values in the dataset, effectively mitigating the risk of overfitting, especially when dealing with imperfect training data. Because the decision trees in the RF are trained independently, the algorithm benefits from parallel computing, thereby accelerating training and making it well-suited for large-scale datasets.
Although RF can assess feature importance, the overall model integrates multiple decision trees, making it challenging to directly interpret the decision-making process. While RF improves accuracy by increasing the number of trees, merely increasing the number of trees may not yield substantial performance improvements in certain complex classification tasks.
![]() | (1) |
![]() | (2) |
![]() | (3) |
![]() | (4) |
TP (True Positive): the number of samples correctly predicted as belonging to this category (i.e., the value on the diagonal). TP represents the sum of the number of correctly predicted samples for all categories.
TS (Total Samples): the total number of all samples.
FN (False Negative): the number of samples with true labels in this category but predicted as belonging to another category (i.e., the sum of non-diagonal elements in this row).
FP (False Positive): the number of samples with true labels of other categories but incorrectly predicted as belonging to this category (i.e., the sum of off-diagonal elements in this column).
In this study, model performance is primarily evaluated using two indicators: accuracy and F1-score. These metrics provide crucial quantitative evidence of the model's classification capability, particularly in multi-class classification tasks, where they comprehensively reflect its performance.
Grid search is a hyperparameter optimization method that identifies the optimal hyperparameter combination by exhaustively evaluating all possible combinations within a specified hyperparameter space.31 This technique involves establishing a series of predefined candidate values for each hyperparameter of the model (such as learning rate, tree depth, etc.), subsequently training all possible combinations of these hyperparameters, evaluating the model's performance, and ultimately selecting the combination that yields the best performance. When the hyperparameter space is limited and computational resources permit, grid search proves to be an effective optimization method. It is particularly suitable for tasks characterized by relatively low computational costs and rapid model training.
Optuna is an efficient framework for automated hyperparameter optimization. Grounded in the principles of Bayesian optimization, Optuna aims to enhance the performance of machine learning models by intelligently searching for and optimizing their hyperparameters. Its core principle is to use historical experimental results to inform hyperparameter selection and achieve optimal convergence within the hyperparameter space through a systematic optimization process.32,33 Specifically, Optuna constructs a surrogate model, known as a probability model, based on the Tree-structured Parzen Estimator (TPE) algorithm. It adjusts the hyperparameter search strategy based on feedback from historical experiments. Each time new hyperparameters are explored, Optuna evaluates their impact on model performance and updates the search space accordingly, thereby increasing the search efficiency. Through this approach, the TPE algorithm can accurately estimate the optimal hyperparameter configuration with fewer trials, ultimately converging on the global optimal solution.34
| Enterprises | Total sample number | Training set | Testing set | Weight |
|---|---|---|---|---|
| DC | 99 | 79 | 20 | 1/79 |
| HZ | 68 | 54 | 14 | 1/54 |
| JA | 55 | 44 | 11 | 1/44 |
| QS | 68 | 55 | 13 | 1/55 |
To enhance the model's generalization and ensure stability across datasets, five-fold cross-validation is used as the primary evaluation method. The five-fold cross-validation divides the training set into five subsets and performs the following steps:
(1) Data partitioning: first, the dataset is randomly divided into five non-overlapping subsets. Each subset is of equal size and preserves the class distribution of the original dataset, thereby mitigating the potential effects of class imbalance.
(2) Training and verification: in each round of training, four subsets are selected to form the training set, while the remaining subset is designated as the verification set. Consequently, the model is trained and evaluated five times, using different training and validation sets each time.
(3) Performance evaluation: after each training round, the performance indicators (such as accuracy, F1-score, etc.) of the model on the validation set are calculated. The overall performance of the model is represented by the average results of these five evaluations, which ensures the stability and reliability of the model assessment.
The five-fold cross-validation method effectively mitigates the risk of model overfitting. Each sample is evaluated across multiple training and validation sets, thereby enhancing the model's generalization. The model's overall performance is derived by averaging the evaluation results across folds, making it a more reliable approach than a single training-test set partition.
![]() | ||
| Fig. 3 (a) Raw chromatograms of the four companies, (b) standardized chromatograms of the four companies. | ||
Using the Optuna algorithm, we obtained the optimal values for the model's key hyperparameters: “n_estimators = 469”, “max_depth = 40”, “min_samples_split = 1”, “min_samples_leaf = 1”, and “max_features = sqrt”. The n_estimators reduces model variance by increasing the number of trees. The max_depth parameter limits the maximum depth of each tree to prevent the model from becoming overly complex and experiencing overfitting. The min_samples_split prevents over-splitting by increasing the minimum number of samples per node. The min_samples_leaf further enhances generalization by increasing the minimum number of samples per leaf node. And max_features increases the model's diversity by limiting the number of selected features per tree, thereby reducing computational costs while suppressing overfitting. These hyperparameters work together to control model complexity and improve generalization ability.
After importing the optimal parameters, the average accuracy achieved through five-fold cross-validation is 86.19%. In addition, as shown in Fig. 4, which presents the optimized confusion matrix, the test set accuracy increased from 82.76% to 84.48%.
Continue studying the selection of preprocessing methods with optimized parameters and compare them. The methods include D1st, D2nd, BC, and SG smoothing. Using the average accuracy of five-fold cross-validation as the standard, organize the results of different data preprocessing methods, as shown in Table 3.
| Method | Accuracycva | F1cvb | AccuracyPc | F1Pd |
|---|---|---|---|---|
| a Accuracycv denotes the average accuracy of five-fold cross-validation.b F1cv denotes the average F1 score of 5-fold cross-validation.c AccuracyP denotes the accuracy of the test set.d F1P denotes the F1 score of the test set. | ||||
| Original | 86.20% | 0.8548 | 84.48% | 0.8244 |
| SG | 95.29% | 0.9490 | 96.55% | 0.9594 |
| BC | 85.36% | 0.8492 | 75.86% | 0.7468 |
| D2nd | 57.30% | 0.5391 | 68.97% | 0.6523 |
| D1st | 53.07% | 0.4855 | 60.34% | 0.5887 |
According to Table 2, the SG smoothing method achieved the greatest improvement in accuracy during five-fold cross-validation. By using Optuna to optimize the window size and polynomial order for SG smoothing, and using a grid search to enumerate all parameter settings, the optimal configuration was identified: a window size of 11 and a polynomial order of 3. Following this optimization, the average accuracy in five-fold cross-validation was 97.85%, suggesting that the model remains stable across different data subsets. Upon training the model on the complete training set, Fig. 5 shows the identification results of the test set after SG smoothing. It achieved 98.28% accuracy on the test set, demonstrating outstanding predictive performance. The accuracy of both the internal five-fold cross-validation and the external test set validation exceeded 97%, indicating that the model did not exhibit overfitting. Furthermore, the model's F1-score was 0.9799, indicating a commendable balance between precision and recall and demonstrating its effectiveness in identifying samples.
In March 2026, two newly collected batches of samples from four enterprises were used to validate the model's performance. The HPLC data from the newly collected samples were preprocessed using standardization and SG smoothing, without the need to recalculate standard retention times. The RF model achieved an overall accuracy of 95.08% and an F1 score of 0.9475, indicating excellent model performance in identifying industrial wastewater from different companies.
Pairwise cross-mixing of the final wastewater discharge outlet samples from the newly collected first batch of four enterprises was conducted at a 1
:
1 volume ratio to investigate the impact of mixed samples on the model. The identification results for different mixing methods are presented in Table 4. The overall matching rate for the four enterprises was 1. Among the mixed samples, the matching degree for QS was relatively low, indicating that the overall identification performance for QS was not ideal. Therefore, the RF classification model currently has limitations in identifying mixed samples.
Mixed samples (1 : 1, v/v) |
DC | HZ | JA | QS |
|---|---|---|---|---|
DC : HZ |
0.2768 | 0.5071 | 0.1121 | 0.1040 |
DC : JA |
0.5636 | 0.0684 | 0.3382 | 0.0622 |
DC : QS |
0.5694 | 0.0501 | 0.3447 | 0.0358 |
HZ : JA |
0.2580 | 0.4234 | 0.2750 | 0.0508 |
HZ : QS |
0.1531 | 0.5305 | 0.2538 | 0.0626 |
JA : QS |
0.1940 | 0.0369 | 0.5695 | 0.1996 |
Despite these encouraging results, opportunities for further improvement remain. The model exhibited unsatisfactory identification performance for QS in mixed samples, revealing certain limitations. The current dataset is relatively small and exhibits a slight class imbalance. Future work will focus on expanding the sample pool, diversifying the range of industrial categories, and conducting research on cross-source mixed samples to enhance the model's generalization. Additionally, the framework will be extended beyond classification to quantitative analysis by leveraging chromatographic peak areas, thereby broadening its practical utility. The success of this study, achieved through basic preprocessing and parameter tuning alone, highlights the substantial potential of machine learning in chromatographic water sample analysis. Further gains are expected by integrating advanced preprocessing pipelines, feature engineering strategies, and ensemble techniques to handle increasingly complex real-world datasets.
Supplementary information (SI) is available. See DOI: https://doi.org/10.1039/d6ra00080k.
| This journal is © The Royal Society of Chemistry 2026 |