Research on a small-concentration chemical oxygen demand prediction algorithm based on an enhanced parrot optimizer–BPNN and ultraviolet-visible spectroscopy

Hongmei Wang , Qiaoling Du * and Xin Wang
State Key Laboratory on Integrated Optoelectronics, College of Electronic Science and Engineering, Jilin University, Changchun 130012, China. E-mail: 1056205694@qq.com; duql@jlu.edu.cn; 1426210775@qq.com

Received 9th September 2025 , Accepted 23rd November 2025

First published on 16th December 2025


Abstract

Purpose: determining small concentrations of chemical oxygen demand (COD) is crucial for domestic drinking water safety. Ultraviolet-visible spectroscopy (UV-vis spectroscopy) is important for COD determination, but the multi-wavelength method has low accuracy and stability for small-concentration COD due to turbidity interference. This paper presents an enhanced parrot optimizer (EPO) algorithm for back propagation neural network (BPNN) parameter optimization to improve small-concentration COD prediction, which includes accuracy and stability. Results: firstly, the EPO algorithm uses the LHS population initialization strategy, which generates the initial population with the help of Latin hypercube sampling and improves the population diversity from the source; secondly, the EPO algorithm adopts the persistence-random-boundary (PRB) location update strategy, improves the position update formula in the residence phase, and integrates the simulated annealing idea to dynamically adjust the search step length to realize the precise balance between global exploration and local development ability; finally, this article proposed the contraction and whirl (CAW) individual elimination strategy, combined with the elite retention logic of the whale optimization algorithm, to periodically eliminate the inferior individuals to avoid premature maturation of the algorithm, and to strengthen the evolutionary momentum of the population. The synergistic effect of the above strategies can accurately optimize the weights and thresholds of the BPNN, and finally build a small concentration COD prediction model that is resistant to low turbidity interference. The core logic of the model's anti-turbidity interference lies in that the BPNN simultaneously learns the mapping relationship of “COD concentration – turbidity concentration – spectrum” and automatically identifies and deducts the contribution of turbidity to the spectrum when predicting COD, thereby offsetting its nonlinear interference and ultimately achieving accurate prediction of low concentration COD. Conclusions: the EPO–BPNN model is outstanding in convergence speed and accuracy. On the standard drinking water quality simulation data set, the coefficient of determination (R2) reached 0.9976, the root mean square error (RMSE) was as low as 0.3930 mg L−1, the mean absolute percentage error (MAPE) was only 3.47%, the percentage bias (PBIAS) was −0.081%, and the maximum relative standard deviation (RSD) was 2.26% (<3%). In the interference of multiple substances in the monitoring data of the inter-reservoir, the standard deviation (SD) of COD concentration values predicted by the model was 0.2876 and 0.3437, respectively; the fluctuations were 81.88% and 79.61% lower than those of the traditional model.



Water impact

This paper proposes the EPO–BPNN model to improve the detection accuracy and anti-interference ability of low-concentration COD, providing a reliable technical solution for drinking water safety monitoring and helping to deepen the scientific understanding of water quality control and its related water environment impacts.

1. Introduction

The detection of small concentrations of chemical oxygen demand (COD) is of great significance to guarantee the safety of domestic drinking water. Song et al. simulated the summer and winter environment in the same region and found that the COD concentration under high temperature conditions was lower than that under low temperature conditions,1 which directly reflects the impact of seasons and climate on the water environment. The study by Chen on flood disasters2 also showed that the COD concentration during disasters skyrocketed from 15 mg L−1 to 40 mg L−1, further confirming the significant role of climate change in COD concentration in water bodies. It can be seen that the prediction research of COD concentration under different climatic conditions is of great significance. This study mainly focuses on the COD concentration during the normal water period of drinking water reservoirs. The ultraviolet-visible spectroscopy (UV-vis spectroscopy) method has the advantages of fast detection speed, real-time, low cost, no pollution, etc.,3 and has been widely used in recent years for the determination of small COD concentration values in drinking water.4,5 However, the spectral range of UV-vis spectrometry acquisition is 200–800 nm. The total amount of spectral information is huge.6 Therefore, how to select effective wavelengths to construct the regression model is a key issue in improving the accuracy of the COD prediction algorithm at small concentrations. The small concentration COD prediction models based on UV-vis spectroscopy mainly include five types: single-wavelength, dual-wavelength, multi-wavelength, sub-interval, and full spectrum.7–11 Among them, the multi-wavelength method can reflect the characteristics of organic matter more comprehensively and has the advantages of less computation and faster prediction.12 However, in practical applications, the presence of turbidity and other interfering substances in the water leads to a complex non-linear relationship between COD concentration values and UV-visible absorbance values;13 moreover, according to the Chinese Environmental Quality Standards for Surface Water (GB3838-2002) and Groundwater Quality Standards (GB/T 14848-2017) for the water quality parameters, the upper limits of COD and scattered turbidity in drinking water are 15 mg L−1 and 3 NTU, respectively, which are small values and difficult to measure accurately. The above reasons affect the accuracy and stability of the detection of small concentration COD values in drinking water.

BPNNs can deal with nonlinear relationships in data, are adaptable to small sample data sets, and can achieve high prediction performance using a small number of feature wavelength.14 This method can be used to solve the problem of poor accuracy and stability in the detection of small concentration COD values in drinking water with turbidity interference. In the field of water quality concentration prediction, accurate and practical application demand-oriented prediction methods have attracted much attention. Existing research has explored the innovation and integration of multiple types of prediction frameworks, such as integrating multimodal data through basic models and optimizing cross-domain generalization capabilities to enhance the reliability of medical scenario predictions,15 or integrating traditional probabilistic models with machine learning methods and constructing profit-oriented prediction frameworks to strengthen the practical decision-making value of prediction results.16 Or, use swarm intelligence algorithms to optimize the network model, focusing on the improvement of predictive capabilities in specific environments.17–20 Drawing on the prediction research approach of “method fusion optimization” and “focusing on actual needs”, this paper adopts the optimized PO algorithm to improve the BPNN model.

The parrot optimizer (PO) is an emerging swarm intelligence optimization algorithm proposed in 2024, which demonstrates superior performance on the classic CEC2022 test set, highlighting the strong ability of the parrot optimizer to deal with nonlinear relationships.21 PO mimics the intelligent behaviors of parrots in foraging and socializing, and it has a strong global search capacity. It can better balance the exploration of new regions and the use of existing advantageous solutions in limited samples and complex feature data, avoiding prematurely falling into local optima, so as to more accurately search for the appropriate combination of weights and thresholds and improve the performance of the BPNN algorithm. In this paper, a training set of 90 samples is utilized to detect small concentration COD values, which is a small number of samples. Predicting COD concentration using 20 features in the training set, the optimized weights and threshold dimensions are relatively high. The PO algorithm shows a unique advantage in this case. Therefore, this paper investigates the prediction method of the PO–BPNN for small-concentration COD values to improve the accuracy of small-concentration COD prediction algorithms. However, the PO algorithm also has some drawbacks. At the start of the algorithm, the conventional population initialization method may result in a lack of initial population diversity, which makes it impossible to adequately sample each region at the beginning of the solution space exploration. This leads to the possibility that some potentially high-quality solution regions may be overlooked, which in turn affects the efficiency of the algorithm in utilizing the entire solution space. As the search advances, the limitations of the PO algorithm's search range gradually emerge. When encountering a more complex parameter space with a discrete distribution of solutions, it is difficult to effectively expand to a wider region to explore possible better solutions, resulting in a relatively limited search path that is unable to fully explore all possible combinations of high-quality weights and thresholds. At the later stage of iteration, the convergence of the PO algorithm slows down significantly. In the context of data optimization tasks that require timeliness, this slow convergence process slows down the overall optimization process and reduces efficiency. To solve the above problems, this paper proposes an improved EPO algorithm to optimize the BPNN and construct a quantitative prediction algorithm for small-concentration COD to improve the detection accuracy and stability of small-concentration COD values in mixed solutions. Firstly, an improved EPO algorithm is proposed with the PRB position update strategy, CAW individual elimination strategy and the LHS population search strategy and is introduced to improve the performance of the PO algorithm. Secondly, this paper constructs a prediction algorithm for small-concentration COD values based on the EPO–BPNN. Compared with other methods, the EPO–BPNN algorithm improves the detection accuracy of small-concentration COD values in mixed solutions and has high stability.

2. Study area

Research area. The Xinlicheng Reservoir is located in the southern suburbs of Changchun City, Jilin Province, China. It is the most important and largest centralized drinking water source in the city. It is located in the middle and upper reaches of the Yitong River, a tributary of the Yinma River in the Songhua River system, with a total storage capacity of 592 million cubic meters. It is a large reservoir mainly used to ensure the supply of urban drinking water, while also considering comprehensive utilization functions such as aquaculture. The water quality of the reservoir is excellent and stable throughout the year, meeting the national drinking water source class III standards. This study selected two representative measurement stations, namely the water inlet and ski resort (Fig. 1). The geographical location of the water inlet is 43.701360 north latitude and 125.346786 east longitude, with an average elevation of 234.94 meters; the ski resort is located at 43.683613 N latitude and 125.330914 E longitude, with an average elevation of 232.98 meters. The area belongs to a temperate continental monsoon climate with distinct four seasons, with cold and dry winters and warm and humid summers. Due to the influence of this climate, the water level and quality of the reservoir vary significantly with the seasons, divided into “glacial period, dry season, and normal season”. The sampling data were collected during the normal water period on June 24, 2025, with a sampling environment temperature of 32 °C.
image file: d5ew00882d-f1.tif
Fig. 1 Study area depicting two selected stations.

The data sources for this study are divided into three parts: elevation and base map data, simulation experiment data, and reservoir sampling data.

Elevation data. The elevation data are sourced from geographic spatial data clouds and use SRTDMEMUTM 90M resolution digital elevation data products; part of the base map data comes from China's basic data, and the water system data of the study area are extracted through ArcGIS.
Simulated experimental data. This study simulates the COD environment of drinking water in the laboratory: the COD standard solution was prepared according to the National Environmental Protection Standard of the People's Republic of China (HJ 828-2017), with a total of 10 portions and a concentration range of 0–20 mg L−1, decreasing in a gradient of 2 mg L−1. The turbidity solution sample was prepared by diluting a 400 NTU formaldehyde standard solution according to the GBW (E) 084169 standard, with a concentration range of 0–5 mg L−1 and a gradient of 0.5 mg L−1, in a total of 10 portions. After mixing the COD solution with the turbidity solution, 100 sets of mixed samples were obtained. The spectral measurement adopts a Shimadzu UV-1900i UV visible spectrophotometer from Japan. The instrument uses a deuterium halogen lamp as the light source, covering the entire UV visible spectral range. The detection spectral range is 210–700 nm, with a scanning step size of 1 nm and deionized water as the background reference solution. The experiment was conducted in a dark room with a temperature of 20 °C (±0.5 °C) and a humidity of 35% (±5%). The built-in smoothing filtering function of the spectrometer system software was used to eliminate background noise and select relevant parameters. The spectral data of the mixed solution are simulated experimental data.
Reservoir sampling data. The sampling points of each reservoir are located in the water depth area. A water sampler is used to collect 1500 mL of water samples from the upper layer of the water body at each sampling point. After sealing, the samples are immediately sent back to the laboratory for testing water quality indicators. On site, an LS310 water quality detector was used to measure and record the COD concentration, and the Japanese Shimadzu UV-1900i UV visible spectrophotometer was used to measure the spectral data of the water sample.

3. Materials and methods

3.1. BPNN

In 1986, Rumelhart proposed the BPNN,22 which is a multilayer feedforward neural network. It usually consists of an input layer, a hidden layer, and an output layer. Its core idea is to predict the output of the network by forward propagation through the initialized values of weights and thresholds, and then calculate the error layer by layer by backpropagation, correcting the weights and threshold matrices with the help of the gradient values until the loss function meets the termination conditions.

This paper selects the BPNN as the basic prediction model. The core basis lies in its adaptability to the detection scenarios of low-concentration COD and its efficient synergy with the EPO optimization strategy, specifically as follows:

1. Scene task adaptability: this study requires the inversion of low-concentration COD from ultraviolet spectral data with turbidity interference, which has the characteristics of low signal and high interference. Essentially, it is a nonlinear mapping from high-dimensional spectra to low concentration values. The BPNN, through multi-layer nonlinear transformation, can effectively capture the weak correlation between spectra and COD, can meet the amplification requirements of low-concentration signals, and does not require presetting mapping relationships. It can adaptively learn the difference patterns between turbidity interference and COD signals.

2. Advantages of collaborative optimization with the EPO: the performance of the BPNN depends on the initial weights and thresholds. The traditional gradient descent is prone to fall into the local optimum, resulting in prediction deviations for low-concentration COD. The global optimization feature of the EPO can specifically solve this problem. Meanwhile, the BPNN has a simple structure and fewer parameters compared to deep learning models, matching the optimization efficiency of the EPO: it can not only precisely optimize key parameters through the EPO but also avoid a sharp increase in optimization costs due to excessive parameters, ensuring efficient convergence to a stable solution with small samples.

The BPNN can solve the problem of nonlinear data processing, but there are problems such as slow convergence speed and ease of falling into local optimal solutions. The main cause of these problems is the initial values of weights and thresholds. In the prediction of COD concentration, the spectral information dimension is high, and the initial weights and thresholds of the BPNN are mostly taken randomly, which will lead to a slow convergence speed of the model, and it is difficult to jump out of the local optimal solution, affecting the prediction effect. Therefore, to improve the model prediction ability and quickly find the global optimal solution, this paper adopts the enhanced PO algorithm to optimize the parameters of the initial values of weights and thresholds.

In this study, a standard three-layer BPNN was used to construct the model. The number of neuron nodes in the input layer was set to 20. The characteristic wavelengths are shown in Table 1. The number of neuron nodes in the hidden layer was determined to be 7 with the help of empirical formulae. The number of neuron nodes in the output layer was 2. The transformed outputs corresponded to the predicted values of turbidity and COD concentration.

Table 1 Characteristic wavelengths
Num Wavelength Num Wavelength Num Wavelength Num Wavelength
1 200 nm 6 224 nm 11 306 nm 16 311 nm
2 201 nm 7 225 nm 12 307 nm 17 312 nm
3 202 nm 8 226 nm 13 308 nm 18 313 nm
4 222 nm 9 227 nm 14 309 nm 19 314 nm
5 223 nm 10 228 nm 15 310 nm 20 316 nm


The BPNN has a fully connected mode between the layer and layer neurons, with no connections between neurons in the same layer. The hidden layer activation function was selected as the hyperbolic tangent function (TANH) and the output layer activation function was set as the rectified linear unit (RELU). The data set was divided according to the ratio of the training set: test set = 9[thin space (1/6-em)]:[thin space (1/6-em)]1. The experimental code was written in Python 3.7. For the hardware environment, the CPU model was Intel (R) Core (TM) i7-9750H CPU@2.60 GHz 2.59 GHz. The system environment has been successfully configured with Python third-party libraries such as pandas, numpy, Scikit-learn, pyDOE, and so on.

3.2. PO algorithm

In 2024, Junbo Lian et al. proposed the PO algorithm,21 inspired by parrots' habits of foraging, staying, communicating, and fear of the unfamiliar. The algorithm uses the cooperative and competitive relationships of individual parrots in a group to find an optimal solution to the problem. The specific process is as follows:
(1) Population initialization. During the initialization of the parrot population, it is necessary to ensure that the initial position of the parrot population is within reasonable upper and lower boundary limits. The initial position is shown in eqn (1).
 
x0i = lb + rand(0, 1) × (ublb)(1)

In eqn (1), x0i denotes the initial position of the i individual in the parrot population, ub and lb denote the upper and lower bounds of the search space, and rand(0, 1) denotes the generation of a random number between 0 and 1.

(2) Foraging behavior. Foraging behavior is the process of determining the approximate location of the food by observing the position of the owner and then adjusting the position of the individual during the foraging process of the parrot. The position update is shown in eqn (2). Levy(dim) is used to describe the flight process of the parrot during the foraging process as shown in eqn (3).
 
image file: d5ew00882d-t1.tif(2)
 
image file: d5ew00882d-t2.tif(3)

x t i denotes the current position of the i body. xt+1i denotes the position of the first body at the next moment. xbest denotes the best position searched so far, and it also denotes the position of the parrot owner. t and Maxiter denote the current iteration and the maximum iteration. dim denotes the dimensionality of the problem under study. ϒ is set to 1.5.

 
image file: d5ew00882d-t3.tif(4)

x t mean denotes the average position of the current population, as shown in eqn (4). N is the number of individuals in the parrot population and k represents the k individual in the population.

(3) Staying behavior. Staying behavior simulates the real-life process of a parrot flying towards its owner's body and making a random position to stay.
 
xt+1i = xti + xbest + Levy(dim) + rand(0, 1) × ones(dim)(5)

In eqn (5), xbest × Levy(dim) denotes flying to the current best position, i.e., the owner's position. ones(dim) denotes a vector of dimension 1. rand(0, 1) × ones(dim) denotes random staying at a certain position on the owner's body.

(4) Communication behavior. Communicative behavior mimics the socialization behavior of parrots, i.e. the choice to fly towards or away from the flock when communicating within the flock, with both behaviors occurring with equal probability. The communication behavior was simulated by generating random numbers as shown in eqn (6).
 
image file: d5ew00882d-t4.tif(6)
(5) Fearful behavior towards strangers. Fearful behavior towards strangers reflects the natural habit of parrots to seek a safe environment away from unfamiliar environments as shown in eqn (7).
 
image file: d5ew00882d-t5.tif(7)

The PO algorithm demonstrates superior performance on the classic CEC2022 test set for processing limited samples with complex feature data. Therefore, the PO can be combined with the BPNN to optimize the performance of the neural network. However, the conventional population initialization method used by the algorithm at start-up may result in a lack of initial population diversity, leading to potential high-quality solution regions that may be overlooked. As the search progresses, the limitations of the search scope gradually appear. When encountering a more complex, discrete distribution of solutions in the parameter space, it is difficult to effectively expand to a wider region to explore the possible existence of better solutions, resulting in the search path being relatively limited and not being able to be fully excavated to find out all the possible combinations of high-quality weights and thresholds. In the late iteration, the convergence of the algorithm slows down significantly. In the context of time-sensitive data optimization tasks, this slow convergence process slows down the overall optimization process and reduces efficiency. Therefore, appropriate improvements to the PO algorithm are also needed, which are described in detail in section 3.3 of this paper.

The pseudocode of the PO algorithm is as follows:

Algorithm 1: PO algorithm
1: Initialize the PO parameters
2: Initialize the solution's positions randomly
3: For i = 1:Max_iter do
4: Calculate the fitness function
5: Find the best position
6: For j = 1:N do
7: St= randi([1, 4])
8: Behavior 1: the foraging behavior
9: If St= =1 then
10: Update position by eqn (2)
11: Behavior 2: the staying behavior
12: Else if St= =2 then
13: Update position by eqn (5)
14: Behavior 3: the communicating behavior
15: Else if St= =3 then
16: Update position by eqn (6)
17: Behavior 4: fearful behavior towards strangers
18: Else if St= =4 then
19: Update position by eqn (7)
20: END
21: END
22: Return the best solution
23: End

3.3. EPO algorithm

Aiming at the PO algorithm's problems of insufficient initial diversity of populations, the search range to be expanded, and the convergence speed to be accelerated, the EPO algorithm adopts three corresponding improvement strategies: the LHS population initialization strategy, the PRB population search strategy, and the CAW individual elimination strategy.
3.3.1. Strategy I: LHS population initialization strategy. McKay et al. proposed Latin hypercube sampling (LHS)23 in 1979. The LHS initialization strategy can generate uniformly distributed population individuals in multi-dimensional space, thus covering the initial solution space more comprehensively, effectively solving the problem of under-exploration of the initial solution space, and greatly improving the initial diversity of the population, so that the algorithm can explore the solution space more extensively at the beginning stage. In this paper, LHS initialization is adopted instead of random initialization. The specific steps of this sampling strategy are as follows:

(1) Determine the population size N and dimensions dim of the problem.

(2) Determine the upper ub and lower limits lb for each dimension.

(3) Divide each dimension [lb, ub] interval into N equal parts.

(4) Randomly sample a point in a sub-interval image file: d5ew00882d-t6.tif of each dimension.

(5) Combine the sampled points of each dimension to form the initial population.

3.3.2. Strategy II: PRB location update strategy. The PRB position update strategy is mainly used to solve the problem of insufficient search range in the PO algorithm. It contains three parts: modifying the position update formula in the dwell phase, random perturbation, and boundary neighborhood update.
1 Modifying the position update formula in the stay phase. Based on the idea of simulated annealing, rand(0, 1) in eqn (6) is modified to image file: d5ew00882d-t7.tif to obtain eqn (8).
 
image file: d5ew00882d-t8.tif(8)

Using eqn (8), it is possible to improve the convergence accuracy of the algorithm. At the beginning of the iterations, this modification allows the algorithm to select the stay strategy with a large random perturbation, to better explore the solution space, and to exhibit a strong global search capability. As the number of iterations increases, the magnitude of the random search gradually decreases, which enhances the local search capability and allows the algorithm to focus more on fine search near the optimal solution. A balance between global and local search is achieved, effectively expanding the population search range.


2 Random perturbation. The perturbation strategy is used to improve the four position update strategies of the PO algorithm for foraging, communication, staying, and fear. The introduction of the random perturbation strategy increases the search range of population individuals and helps to explore more solution space as shown in eqn (9).
 
image file: d5ew00882d-t9.tif(9)
 
image file: d5ew00882d-t10.tif(10)

3 Boundary neighborhood update. The boundary neighborhood update strategy is introduced to replace the boundary update strategy of the PO algorithm after the individual position update, as shown in eqn (11). This strategy can deal with the boundary situation more effectively and prevent the individual positions from concentrating too much on the boundary, further enhancing the search range of the population.
 
image file: d5ew00882d-t11.tif(11)

The PRB position update strategy significantly improves the search range of the algorithm and enhances the exploration of the complex parameter space by improving the above three aspects. At the same time, the PRB position update strategy regulates the randomness and certainty in the search process through a reasonable mechanism, so that the algorithm can search methodically when facing the complex parameter space, which in turn ensures the stability of the algorithm.

3.3.3. Strategy III: CAW individual elimination strategy. The WOA has the advantage of faster convergence speed. Compared with some complex optimization algorithms, its implementation is relatively simple, without excessive parameter adjustment and complex mathematical models, and easy to operate. The pseudocode of the WOA algorithm is as follows. Based on the WOA algorithm, this paper improves based on the original PO algorithm. Firstly, this paper draws on the shrink–wrap and spiral update position update strategies of the WOA algorithm. Then, the elimination mechanism is added on this basis to form the CAW individual elimination strategy. The CAW individual elimination strategy can further improve the convergence speed of the EPO algorithm. The specific process is as follows:

(1) Select N/2 individuals in a population of parrots to implement a culling mechanism.

(2) For the selected individuals, judge the values of A and P.

When P < 0.5 and A ≤ 1, use the shrink–wrap mechanism to update the position of the individual at the next moment, as shown in eqn (13).

When P ≥ 0.5 and A ≤ 1, the spiral update mechanism is used to update the position of the individual at the next moment, as shown in eqn (13). In eqn (13), b is a constant that defines the shape of the logarithmic spiral. l is a random number between [−1, 1].

When A > 1, the best replacement mechanism is used to update the next moment position of an individual. That is, for the parrot individual selected to implement the elimination mechanism, its next moment position is updated as the best position of the current population.

(3) Calculate the value of the objective function corresponding to the individual after the position update.

 
image file: d5ew00882d-t12.tif(12)
 
image file: d5ew00882d-t13.tif(13)
 
image file: d5ew00882d-t14.tif(14)

Algorithm 2: WOA algorithm
1: Initialize the WOA parameters
2: Initialize the solution's positions randomly
3: For i = 1:Max_iter do
4: Generate random numbers and calculate the convergence factor a
5: Update individual positions based on the shrink–wrap and spiral update position update strategies
6: Check boundaries, preserve better solutions, and update the global optimal individual
7: Return the best solution
8: End

As shown in eqn (14), if the individual objective function value after the position update is reduced, the position and objective function value are updated to the current value. In contrast, the current position is modified to the optimal individual position, and the objective function value is adjusted to the optimal objective function value. Eqn (14) aims to eliminate some of the current non-optimal solutions and replace them with the current optimal solutions. Such an operation is conducive to strengthening the current optimal solution in the local region of the search efficiency and rate, which in turn promotes the algorithm to accelerate the convergence, prompting the algorithm to converge on the global optimal solution more efficiently and enhance the overall accuracy and speed of the algorithm to find the optimal solution.

The pseudocode of the EPO algorithm is as follows:

Algorithm 3: EPO algorithm
1: Initialize the EPO parameter
2: Initialize the solutions' positions by the LHS initialization strategy
3: For i = 1: Max_iter do
4: Calculate the fitness function
5: Find the best position
6: For j = 1:N do
7: St= randi([1, 4])
8: Behavior 1: the foraging behavior
9: If St= =1 then
10: Update position by eqn (2)
11: Behavior 2: the staying behavior
12: Else if St= =2 then
13: Update position by eqn (8)
14: Behavior 3: the communicating behavior
15: Else if St= =3 then
16: Update position by eqn (6)
17: Behavior 4: fearful behavior towards strangers
18: Else if St= =4 then
19: Update position by eqn (7)
20: Random perturbation by eqn (9)
21: Boundary correction by eqn (11)
22: END
23: END
24: CAW elimination: eliminate inferior solutions periodically by eqn (13)
25: Return the best solution
26: END

The flowchart of the EPO algorithm is shown in Fig. 2.


image file: d5ew00882d-f2.tif
Fig. 2 Flowchart of the EPO.

3.4. Hybrid EPO–BPNN algorithm

In this section, a COD prediction algorithm based on the EPO–BPNN model is constructed to improve the accuracy and stability of predicting small concentrations of COD in mixed solutions. The EPO algorithm is used to de-optimize the BPNN to obtain the initial and threshold values of the BPNN, which makes the BPNN more suitable for the detection of small concentrations of COD in mixed solutions. The key steps of the EPO–BPNN model are shown in Fig. 3.
image file: d5ew00882d-f3.tif
Fig. 3 The research flowchart of recent studies.

1. Accumulate spectral data of simulated drinking water mixed solutions and reservoir solutions.

2. For the EPO–BPNN prediction model, determine the input and output variables. In this study, the COD concentration was considered as the output variable. On the other hand, the UV visible absorption spectrum data of the mixed solution and the solution concentration were selected as input variables for the EPO–BPNN model.

3. Before applying the proposed model, perform feature extraction and normalization of the maximum and minimum values on the data.

4. Divide the simulated data into two parts, with 90% of the data used for training and the remaining 10% used for testing.

5. Use the EPO–BPNN algorithm and other machine learning methods to determine the optimal weights and thresholds for the BPNN.

6. To minimize the error, use the root mean square error (RMSE) metric given in formula (17) as the fitness function.

7. Use the optimal parameters found by the EPO–BPNN to predict the COD concentration value in drinking water, and perform inverse normalization on the predicted results.

8. To validate the predictive results of the EPO–BPNN model and evaluate its predictive performance, the proposed model was compared with other benchmark models (i.e. BPNN, PSO–BPNN, WOA–BPNN, RIME–BPNN, and PO–BPNN) on a numerical basis using our evaluation metrics. The formula for technical indicators is shown in formulas (16)–(18).

9. For visual inspection, use five graphical representations to compare the predictive performance of the proposed EPO–BPNN model, including performance analysis of the EPO algorithm, error convergence plots, final convergence values, radar chart and PBIAS chart.

3.5. Evaluation index

As mentioned earlier, this study simulated 100 sets of drinking water environments in the laboratory, and the spectral data of their mixed solutions are shown in Fig. 4. The study selected 210–310 nm as the research interval for the simulation data. To optimize the data quality input to the BPNN, the data were first normalized to the maximum and minimum values3 to eliminate the interference of dimensional differences on model training; subsequently, using the Python scikit learn library, the SelectKBest feature selection function and f_regression function were called to extract the 20 feature wavelengths with the strongest correlation with COD concentration. Table 1 shows the selected characteristic wavelengths. These preprocessing steps effectively reduce redundant information, highlight key features, and help improve the learning efficiency and prediction accuracy of the BPNN.
image file: d5ew00882d-f4.tif
Fig. 4 The absorption spectra of the solutions.

To verify the predictive effect of the EPO–BPNN model on COD concentration in drinking water, this article analyzes it from three dimensions: algorithm theory analysis, simulation data test set validation, and reservoir measurement data validation.

Firstly, in terms of algorithm theory analysis, on the one hand, we will deeply analyze the inherent characteristics of the algorithm, and on the other hand, compare it with commonly used models to clarify its theoretical advantages. Secondly, for the simulated data test set, the model prediction accuracy is measured by R2, RMSE, and MAPE, the deviation of COD concentration prediction is measured by PBIAS, and the repeatability of the model prediction results is measured by RSD. Finally, in the validation of reservoir measurement data, RMSE, which can intuitively reflect the actual error amplitude, and SD, which reflects the deviation degree between the data and the average, are selected as the core measurement standards to evaluate the actual prediction performance of the proposed model in real water samples. The quantitative formula for evaluation indicators is shown in eqn (15)–(20).3,4,24–27

 
image file: d5ew00882d-t15.tif(15)
 
image file: d5ew00882d-t16.tif(16)
 
image file: d5ew00882d-t17.tif(17)
 
image file: d5ew00882d-t18.tif(18)
 
image file: d5ew00882d-t19.tif(19)
 
image file: d5ew00882d-t20.tif(20)

4. Results and discussion

The experimental results are divided into three parts: the theoretical analysis results of the EPO algorithm, the verification results of the simulation data test set of the EPO–BPNN model, and the verification results of the reservoir measurement data of the EPO–BPNN model.

4.1. Algorithm theory analysis

This section mainly verifies the feasibility of the EPO algorithm from a theoretical perspective. Specifically, it is carried out from three aspects: performance analysis of the algorithm itself and comparison with cross-algorithms, as well as time complexity analysis of the algorithm. Firstly, regarding the performance of the EPO algorithm itself, the parameter space graph, search history graph, first dimension trajectory graph, diversity curve and exploration development graph are obtained through benchmark function testing to systematically characterize its core operating characteristics. Secondly, by comparing with other algorithms, the focus is on evaluating its convergence performance: on the one hand, using error convergence graphs to visually present the convergence speed and accuracy of the algorithm at a macro level; on the other hand, by using the final convergence value graph (including the worst, best, mean, median, and standard deviation of convergence results from different algorithms), the convergence effect can be quantitatively compared from a statistical perspective to comprehensively reflect the convergence performance of the EPO algorithm. Finally, by means of the time complexity analysis of the algorithm, the differences in computational efficiency among different algorithms under the influence of factors such as population size, problem dimension, and the number of iterations are quantified.
4.1.1. Analyze the inherent characteristics of the algorithm. To observe the search performance of the algorithms, this paper studies the inherent characteristics of the EPO algorithm. We use the Zakharov function, expanded Schaffer's function, Levy function, high conditioned elliptic function, and Katsuura function benchmarking functions to test the performance of the EPO. The parameter space graph, search history graph, first dimension trajectory graph, diversity curve, and exploration and exploitation graph of the EPO are shown in Fig. 5. In the parameter space diagram (Fig. 5(a)), when the EPO algorithm outlines the search space contour, it accurately presents the topological relationship between the global optimal solution and the local optimal solution through the distribution trajectory of particles in the multimodal structure. Its mapping of the solution space “ravine peak valley” is essentially the process of constructing a “cognitive model” of the solution space through fitness calculation and position update, providing a topological solution space prior to subsequent searches and helping to quickly identify high-quality solution areas. In the search history graph (Fig. 5(b)), individuals in the population tend to approach and aggregate towards the global optimum during iteration. From a dynamic perspective, it is the EPO algorithm that drives the potential energy of the population to shift towards the optimal solution through information exchange mechanisms such as position guidance and fitness competition between individuals. Behind this aggregation effect is the efficient utilization of the solution space gradient by the algorithm convergence mechanism, reflecting its strong driving force in guiding the search process. The first dimension trajectory graph (Fig. 5(c)) presents a dynamic balance between initial strong fluctuations and later precise convergence, corresponding to the exploration and development of the algorithm. The initial fluctuations stem from the algorithm's random sampling strategy on the solution space, which breaks through local optima with high exploration efficiency, essentially performing a “wide area scan” in the solution space. The later precise convergence is a refined search of the optimal solution neighborhood under the guidance of development efficiency. By iteratively optimizing the weight update formula, potential optimal combinations are discovered, achieving a transition from “extensive exploration” to “precise mining”. The diversity curve (Fig. 5(d)) reflects the high diversity in the early stage of iteration and the adjustment of adaptability in the later stage, which is a manifestation of the adaptive mechanism of the EPO algorithm. The initial maintenance of diversity is achieved through the random initialization of particle positions and the global search strategy to ensure the coverage of the solution space. The later adjustment is based on the fitness feedback, dynamically compressing the search space. This adaptive nature is due to the dynamic allocation of search resources by the algorithm, balancing the needs of global exploration and local convergence. In the exploration development graph (Fig. 5(e)), the characteristics of the EPO algorithm are high exploration at the initial stage and high development at the later stage. From the perspective of computing resource allocation, it is the algorithm that dynamically adjusts the “computing power ratio” according to the search phase. At the initial stage, more resources are used for global traversal, and at the later stage, local optimization is focused to achieve efficient use of computing resources. Finally, through balancing global and local search, weight and threshold optimization is completed, laying a solid foundation for improving the performance of the BPNN. From the perspective of the algorithm mechanism, it explains the internal logic of the EPO adapting to complex optimization tasks.
image file: d5ew00882d-f5.tif
Fig. 5 Performance analysis of the EPO algorithm: (a) parameter space graph, (b) search history graph, (c) first dimension trajectory graph, (d) diversity curve and (e) exploration development graph.
4.1.2. Comparison with other models. To analyze the performance of the EPO algorithm, this paper compared it with the PO, WOA, PSO, and the more novel RIME algorithm. The selection of test functions covers the classic CEC2005 benchmark function as well as the CEC2022 benchmark function, which has been widely used in recent years. Among them, CEC2005 contains high-dimensional F1–F13 benchmark test functions, which are remarkably representative in evaluating the algorithms' ability to deal with complex and high-dimensional data, while the CEC2022 test function consists of nine functions in total, including the single-peak function (F1), basis function (F2–F5), hybrid function (F7–F8), and combined function (F9–F10). The algorithm performance is comprehensively considered in terms of multiple optimal solution distributions.

Simulation experimental conditions: set the number of individuals in the population to 20 and the maximum number of iterations to 50. For each algorithm, run it independently 100 times. The average of the results of the 100 runs is taken as the final evaluation index.

The error convergence plots of the EPO, PO, WOA, PSO, and RIME algorithms on the CEC2005 benchmark test function are shown in Fig. 6. Compared to the other algorithms, the EPO shows superior performance on the CEC2005 benchmark function. Within the same number of iterations, the EPO algorithm converges significantly faster in the early stage, which is due to the LHS strategy and PRB strategy in the early stage of the algorithm and the search process to improve the diversity of the population, expanding the search range, and the CAW strategy to improve the BPNN convergence speed is slow, to avoid the neural network to fall into the local optimum, and to be able to approach the optimal solution more quickly while maintaining a higher degree of accuracy. This advantage makes the EPO show stronger efficiency and accuracy when dealing with complex problems.


image file: d5ew00882d-f6.tif
Fig. 6 Error convergence plots of the five algorithms on the CEC2005 benchmark test function.

The error convergence plots of the EPO, PO, WOA, PSO, and RIME algorithms on the CEC2022 benchmark test functions are shown in Fig. 7. Although the convergence of the EPO is slightly worse than that of some of the other algorithms on some basis functions (e.g., F3) and hybrid functions (e.g., F9), it still demonstrates faster convergence and lower average error on most benchmark test functions. From the single-peak benchmark test function test results, it can be seen that the EPO algorithm exhibits faster convergence performance during the search process, mainly due to the PRB position update strategy, which can accelerate the search speed of the global optimal solution of the algorithm during the iteration process. On most of the multi-peak benchmark test functions, hybrid benchmark test functions, and combined benchmark test functions, the EPO shows extremely fast convergence speed in the initial stage, which is due to the LHS population initialization strategy to enrich the population diversity in the initial stage, expanding the upper search range and laying the foundation for finding the global optimal solution. Besides, the EPO shows high convergence accuracy in the short term, which is due to the CAW individual elimination strategy and because the searchability of the global optimal solution is strong.


image file: d5ew00882d-f7.tif
Fig. 7 Error convergence plots of the five algorithms on the CEC2022 benchmark test function.

The worst (Worst), optimal (Best), mean (Mean), median (Median), and standard deviation (Std) of the convergence values of the different algorithms on the CEC2005 and CEC2022 base test functions are plotted as bar charts as shown in Fig. 8 and 9.


image file: d5ew00882d-f8.tif
Fig. 8 Final convergence values of the five algorithms on the CEC2005 benchmark function: (a) Worst convergence value (b) Best convergence value (c) Mean of convergence value (d) Median of convergence value (e) Standard deviation of convergence value.

image file: d5ew00882d-f9.tif
Fig. 9 Final convergence values of the five algorithms on the CEC2022 benchmark function: (a) Worst convergence value (b) Best convergence value (c) Mean of convergence value (d) Median of convergence value (e) Standard deviation of convergence value.

As can be seen from Fig. 8, among the first 13 benchmark test functions of CEC2005, compared with the other algorithms, the EPO shows the best results in the five metrics of Worst, Best, Mean, Median, and Std. This indicates that the EPO algorithm has relatively high accuracy and consistency in its convergence effect. Specifically, the closer the optimal value is to the standard value, the higher the convergence accuracy of the EPO can find solutions that are close to the global optimal solution. The closer the median and mean values are to the standard values, the better the stability of the algorithm is, which performs consistently over multiple runs. The smaller the worst value and the standard deviation are, the more consistently the EPO algorithm performs in multiple experiments, with less volatility in the convergence results. The indicators show that the EPO algorithm can not only effectively reduce the variation of the solution in the optimization problem, but also keep smaller fluctuations in multiple runs, thus improving the consistency and accuracy of the overall performance. As can be seen from Fig. 9, among the first 10 benchmark test functions of CEC2022, the EPO shows good results on the vast majority of benchmark test functions in the five indicators of Worst, Best, Mean, Median, and Std.

Due to the large volume of convergence value data of different algorithms within the corresponding number of iterations, we have uploaded the data related to Fig. 6–9 to https://doi.org/10.6084/m9.figshare.30589553 to facilitate quantitative comparison.

4.1.3. Analysis of algorithm time complexity. This paper analyzes the time complexity of the PSO, WOA, RIME, PO, and EPO, mainly from three aspects: the initialization stage of the algorithm, the single core iteration process, and the overall computational cost. Among the involved variables, N represents the population size, dim indicates the dimension of the decision variable, F is the computational complexity of the objective function, Max_iter is the maximum number of iterations, and log[thin space (1/6-em)]N reflects the computational overhead caused by the sorting operation. As can be seen from Table 2, the time complexity of the EPO is slightly higher than that of comparison algorithms such as PSO, because the introduction of sorting operations and diversified search strategies increases additional computational overhead. However, according to the convergence results in Fig. 6 and 7, the EPO has higher convergence accuracy and faster speed for high-dimensional nonlinear functions under the same number of iterations. The moderate increase in its computational complexity has led to stronger adaptability to complex optimization requirements and a substantial improvement in solution quality.
Table 2 Comparison of the time complexity of different algorithms
Method Initialization phase Single core iteration Total complexity
PSO O(N × dim + N × F) O(N × dim + N × F) O(Max_iter × N × (dim + F))
WOA O(N × dim + N × F) O(N × dim + N × F) O(Max_iter × N × (dim + F))
RIME O(N × dim + N × F) O(N × dim + N × F) O(Max_iter × N × (dim + F))
PO O(N × dim + N × F + N[thin space (1/6-em)]log[thin space (1/6-em)]N) O(N × dim + N × F + N[thin space (1/6-em)]log[thin space (1/6-em)]N) O(Max_iter × N × (dim + F + log[thin space (1/6-em)]N))
EPO O(N × dim + N × F + N[thin space (1/6-em)]log[thin space (1/6-em)]N) O(N × dim + N × F + N[thin space (1/6-em)]log[thin space (1/6-em)]N) O(Max_iter × N × (dim + F + log[thin space (1/6-em)]N))


4.2. Simulation data test set validation

Artificial intelligence methods such as the BPNN, PSO–BPNN, WOA–BPNN, RIME–BPNN, PO–BPNN, and EPO–BPNN have been implemented in predicting COD concentration in drinking water. All models can be used to predict COD content in drinking water. Based on R2, RMSE, and MAPE accuracy indicators, it was found that the EPO–BPNN model has the highest accuracy in predicting COD concentration in drinking water. Based on the PBIAS index, the EPO–BPNN model has the smallest systematic deviation in predicting COD concentration in drinking water. Based on the RSD index, the EPO–BPNN model meets the strict requirement of less than 3% RSD specified by the national standard for predicting COD concentration in drinking water. The prediction model has good repeatability and high stability. These comprehensive evaluation measures collectively emphasize the ability of the EPO–BPNN model to accurately predict COD concentration in drinking water.
4.2.1. Prediction accuracy of the EPO–BPNN model. The experimental data were 100 sets of UV-vis absorbance data for COD solutions ranging from 0 to 20 mg L−1 and turbidity mixture solutions ranging from 0 to 5 NTU. The data sets were divided according to the ratio of the training set[thin space (1/6-em)]:[thin space (1/6-em)]test set of 9[thin space (1/6-em)]:[thin space (1/6-em)]1. On the test set, the COD concentration was predicted using the BPNN, PSO–BPNN, RIME–BPNN, WOA–BPNN, PO–BPNN, and EPO–BPNN algorithms. The results are shown in Fig. 10, and the specific data are shown in Table 3. As shown in Fig. 10(a), the accuracy metrics (R2, RMSE, and MAPE) of the test set are presented under different models. Among them, the BPNN model without an optimization algorithm performed the worst: its RMSE and MAPE were the highest among all the models, while its R2 was the lowest. To observe in more detail the differences in accuracy indicators among the other five optimized models (since the performance gap between BPNN and these five algorithms is significant, we have excluded it from the comparison), further plot Fig. 10(b). From Fig. 10(b), it can be seen that the R2 of the EPO–BPNN model is the highest among all the models, while the RMSE and MAPE are at the lowest levels. This indicates that the EPO optimization algorithm can effectively improve the predictive performance of the BPNN. By enhancing the model's ability to fit data patterns (higher R2), reducing the absolute magnitude of prediction errors (lower RMSE) and relative proportion (lower MAPE), the prediction accuracy and stability of the EPO–BPNN on the test set are significantly better than those of the comparison models.
image file: d5ew00882d-f10.tif
Fig. 10 Precision graphs of different models.
Table 3 Precision comparison of different prediction algorithms
Method TUR COD
RMSE (mg L−1) MAPE R 2 RMSE (mg L−1) MAPE R 2
BPNN 1.2642 0.7200 0.3467 3.6129 0.6039 0.6436
PSO–BPNN 0.5891 0.3511 0.9369 0.4896 0.0426 0.9952
RIME–BPNN 0.6024 0.3581 0.9301 0.4256 0.0410 0.9963
WOA–BPNN 0.7050 0.3744 0.9126 0.4003 0.0357 0.9970
PO–BPNN 0.5813 0.3505 0.9308 0.4265 0.0376 0.9974
EPO–BPNN 0.4932 0.3405 0.9396 0.3930 0.0347 0.9976


As can be seen from Table 3, the EPO–BPNN prediction algorithm has the highest accuracy in predicting turbidity and COD concentrations compared to the other algorithms. In predicting small concentrations of COD, the RMSE of the algorithm is 0.3930, which is 89.12%, 19.73%, 7.66%, 1.82%, and 7.86% lower compared to those of the BPNN, PSO–BPNN, RIME–BPNN, WOA–BPNN, and PO–BPNN algorithms. The MAPE of the algorithm is 3.47%, which is 94.25%, 18.54%, 15.37%, 2.80%, and 7.71% lower than those of the BPNN, PSO–BPNN, RIME–BPNN, and PO–BPNN algorithms. R2 can reach 0.9976, which is the strongest predictive correlation compared to the other methods. The experimental results show that the small concentration COD prediction algorithm based on the EPO–BPNN improves the prediction accuracy of small concentration COD values.

4.2.2. Systematic deviation analysis of model prediction. PBIAS is a key indicator for measuring systematic bias in model predictions. The positive and negative values and magnitude correspond to different meanings: when PBIAS > 0, the model overestimates the actual value as a whole; when PBIAS < 0, the overall model underestimates the actual value; the ideal state is PBIAS = 0, representing no systematic bias.

From Fig. 11, it can be seen that the PBIAS of the BPNN model is as high as 38.585%, significantly greater than 0, indicating a systematic bias of overestimating the actual value; the PBIAS of the PSO–BPNN is −1.473%, that of the RIME–BPNN is 0.898%, that of the WOA–BPNN is −0.623%, and that of the PO–BPNN is 0.449%. Although their PBIAS is close to 0, there are still slight overestimation or underestimation cases. The PBIAS of the EPO BPNN is −0.081%, which is approximately 0, fully reflecting its advantage in predicting bias control. This indicates that EPO optimization effectively corrects the systematic error of the BPNN, making the model prediction closer to the actual value and demonstrating better stability and accuracy in the bias control dimension.


image file: d5ew00882d-f11.tif
Fig. 11 PBIAS values of COD predicted by different models.
4.2.3. Repeatability verification of the EPO–BPNN algorithm. To verify the repeatability of the EPO–BPNN method, concerning the method stability testing specifications in the relevant national standards, two groups of samples were selected by random sampling to carry out the repeatability testing work in the 10 test sets of this study. The test results are shown in Table 4.
Table 4 Concentration information of measured water samples
Random test sample Concentration value Number of experiments/concentration value RSD Standard
1 2 3 4 5 6
Sample 1 10 mg L−1 10.44 10.63 10.56 10.23 10.25 10.48 1.56% <3%
Sample 2 18 mg L−1 18.11 18.32 17.46 17.45 18.36 17.89 2.26% <3%


As can be seen from Table 4, the RSD values of COD concentration are at a low level and meet the stringent requirements of less than 3% as stipulated in the national standards. The experimental results show that the small concentration COD prediction algorithm based on the EPO–BPNN is excellent in terms of repeatability and has high stability.

4.3. Reservoir measurement data validation

To verify the performance of the model in actual water sample prediction, the UV-vis spectroscopy data of two water samples from the reservoir, the water inlet (SAMPLE1) and ski resort (SAMPLE2), were selected. The spectral data were input into different models and repeatedly predicted 6 times. The RMSE index, which reflects the average error amplitude between the predicted value and the true value, and the SD index, which reflects the dispersion degree of multiple prediction results, were used as evaluation indicators. According to Table 5, in the SAMPLE1 prediction, the RMSE of the EPO–BPNN is 3.1318 mg L−1, significantly lower than the other models, and the SD is also the smallest at 0.2876 mg L−1, indicating that its prediction error is small and the results are stable; in the SAMPLE2 prediction, the RMSE of the EPO–BPNN is 2.7109 mg L−1 and the SD is 0.3437 mg L−1, both at a relatively low level. By comparison, the EPO–BPNN outperforms the other models such as the BPNN, PSO–BPNN, WOA–BPNN, RIME–BPNN, and PO–BPNN in both absolute error control and stability of multiple predictions in actual water sample prediction, demonstrating its more reliable practical application performance.
Table 5 Comparison of prediction performance of multiple models for reservoir water samples
Sample Method Repeat count RMSE (mg L−1) SD (mg L−1)
1 2 3 4 5 6
SAMPLE1 BPNN 9.269810 8.372337 5.298523 6.421083 10.38374 5.041649 5.7451 2.0106
PSO–BPNN 8.295607 7.102257 6.818201 9.539236 5.873040 5.608294 5.6483 1.3615
WOA–BPNN 6.628190 9.098687 6.628190 8.352554 8.519474 8.555843 4.9044 0.9718
RIME–BPNN 4.590708 9.282539 4.887689 8.767553 7.473460 8.132808 5.8651 1.8209
PO–BPNN 8.437826 10.40252 10.8165 8.248908 11.46336 9.578685 3.2035 1.1881
EPO–BPNN 10.10278 9.642706 9.786047 9.394301 9.204108 9.509941 3.1318 0.2876
SAMPLE2 BPNN 9.673257 8.38944 5.129623 6.725695 10.90253 5.142403 6.9756 2.1876
PSO–BPNN 9.258589 8.453915 7.572746 10.97453 6.314971 5.747759 6.4864 1.7661
WOA–BPNN 8.100370 11.68670 8.002072 10.80380 10.63405 10.81767 4.6247 1.4237
RIME–BPNN 5.402566 12.00706 6.174425 10.63089 9.523114 10.29072 5.9221 2.4002
PO–BPNN 10.10042 12.40033 9.122895 9.351872 13.90045 11.46351 3.7501 1.7136
EPO–BPNN 12.17917 11.70645 11.88201 11.23247 11.20955 11.70868 2.7109 0.3437


4.4. Discussion

The precise detection of low concentration chemical oxygen demand (COD) in drinking water is a key challenge in the field of water quality monitoring. In traditional methods of water quality parameter detection, the measurement of biochemical oxygen demand (BOD) takes up to 5 days, which is difficult to meet the needs of rapid monitoring; the potassium permanganate index method, commonly used to reflect the organic matter content in water bodies, is prone to bias in detecting low concentrations of COD due to factors such as oxidation conditions and types of organic matter, making it difficult to accurately reflect the total amount of organic matter in water bodies. Artificial neural networks (ANNs), especially the BPNN, have shown potential in water quality parameter prediction due to their strong nonlinear fitting ability. However, the BPNN has inherent drawbacks such as being prone to local optima and slow convergence speed, and conventional applications require a large amount of training data, which limits its adaptability in low concentration COD detection scenarios in drinking water. Therefore, promoting the research of hybrid models combining swarm intelligence optimization algorithms with the BPNN has become an important direction to break through this dilemma. In recent years, hybrid models combining swarm intelligence optimization algorithms with the BPNN have been widely applied in the field of water quality prediction. Fu et al. proposed the EEMD–MCS–BPNN model,17 which reduces the weight threshold MAPE by 13.49% through ensemble empirical mode decomposition (EEMD) denoising and improved cuckoo algorithm (MCS) optimization. Zhang et al. optimized the BPNN using an improved microbial foraging algorithm,18 resulting in a MAPE as low as 0.0136. Luo et al. used logistic tent composite mapping to improve the sparrow search algorithm (LT-SSA) to optimize the BPNN,28 reducing the MAE, MSE, and RMSE by 36.17%, 17.55%, and 51.75%, respectively. Xu et al. optimized the BPNN by combining the restricted Boltzmann machine (RBM) and genetic simulated annealing algorithm (GASA),19 resulting in a 6.12% improvement in prediction accuracy compared to the traditional BPNN. The SWOA–BPNN model proposed by Dai et al.20 outperforms the PSO–BPNN and WOA–BPNN in both convergence speed and prediction accuracy. These studies all indicate that swarm intelligence optimization algorithms can effectively improve the performance of the BPNN.

However, the high-dimensional and strong coupling characteristics of BPNN parameters make classic swarm intelligence optimization algorithms such as PSO, WOA, and RIME face progressive bottlenecks such as insufficient initial exploration, search balance imbalance, and attenuation of evolutionary momentum during the optimization process, and these bottlenecks cannot be solved by a single strategy. The three major strategies of LHS, PRB and CAW included in the EPO proposed in this study are not simply superimposed, but form a closed-loop collaborative mechanism of source control – process regulation – result reinforcement. Their interaction relationship and innovativeness can be fully demonstrated through the progressive logic of bottleneck resolution.

Firstly, the LHS initialization strategy provides high-quality starting point support for the PRB search strategy. If the initial population of “full coverage and low overlap” in high-dimensional space achieved by the LHS strategy through Latin hypercube sampling is lacking, the PRB strategy will misjudge the convergence signal due to the initial population clustering in local areas, resulting in premature entry into local fine search and inability to play a role in global exploration. Secondly, the dynamic search mechanism of the PRB strategy provides a continuous guarantee for the initial advantage of the LHS strategy. Even if LHS generates a high-quality initial population, if the search method of PSO (which updates particle positions through dynamic velocity but is prone to losing diversity due to the rapid aggregation of population particles towards the global optimum) or the random perturbation search of RIME is adopted, the diversity of the initial population will still rapidly decrease. The PRB strategy expands the search range through the combined effect of three specific mechanisms, avoids premature convergence of the population, retains the population diversity established by LHS, and provides a foundation for subsequent evolution. Then, the CAW elimination strategy provides reinforcement and correction for the effects of the first two major strategies. The synergy of the LHS and PRB strategies can enhance the overall quality of the population, but it still generates some low-fitness inferior individuals. If RIME's indiscriminate elimination is adopted, individuals carrying potential optimal solution information may be mistakenly deleted. However, the CAW strategy extracts information from the elite individuals screened out by the first two strategies and generates new individuals in a targeted manner, which not only avoids inferior individuals occupying iterative resources but also accelerates the convergence speed of the global optimal solution. The optimization achievements of the first two major strategies were further transformed into directed evolution. Meanwhile, the high-quality new individuals generated by CAW can feed back into the state perception mechanism of PRB, making signals such as population distribution density and fitness differences clearer and more accurate, and helping PRB to more accurately judge the population search status.

4.5. Limitations and future scope

Even though the EPO–BPNN generated encouraging outcomes for COD prediction, this study still has certain limitations. The key draw backs of this work and potential suggestions based on such drawbacks are listed below.

1. The current research simulates the drinking water environment using COD concentration and turbidity as model input parameters. Its robustness against other common interfering substances has not been fully verified. However, typical interfering substances such as dissolved organic matter, nitrates, and various ions in actual drinking water, as well as parameters like temperature and conductivity, can also interfere with COD prediction. In the future, the diverse water sample set will be further expanded. The characteristic spectral bands of target substances will be screened through competitive adaptive re-weighted sampling, and an interference compensation mechanism will be constructed at the same time. For instance, based on the spectral library of typical distractors, the coupling law between the target signal and the distractor signal is learned by using the residual network, and the compensation factor is dynamically generated to correct the original spectrum, so as to enhance the robustness of the model to complex water.

2. The model proposed in this study only evaluated the predictive performance of COD concentration during normal water periods, while the water quality of the reservoir exhibits dynamic changes with different hydrological periods such as glacial periods and dry seasons. Therefore, further work can be conducted to comprehensively validate the performance of the model in scenarios such as glacial periods and dry seasons, in order to improve its applicability under different hydrological conditions.

3. Currently, the validation data for the model only include 100 simulated datasets and 2 real datasets, with limited sample coverage. Particularly due to the limitations of on-site sampling conditions and accessibility of open drinking water reservoirs, this study only collected actual water samples from two reservoirs in Jilin Province that met the screening criteria, and the coverage and diversity of the samples were seriously insufficient. The concentration prediction model constructed based on spectral data is difficult to learn the general rules of the correlation between spectra and concentrations. The risk of overfitting objectively exists, and the generalization ability and reliability of the model in actual water samples have not been fully verified. In the future, it is planned to expand the sampling scope to over 60 different types of open drinking water reservoirs, covering various river basins, climate zones and hydrological cycles. Based on the expanded dataset, K-fold cross-validation is carried out. By randomly dividing the training set and the validation set multiple times, the performance fluctuations of the model on different data subsets are evaluated to verify its stability. And a third-party open drinking water reservoir measurement dataset is introduced for independent testing to evaluate the model's prediction accuracy for new scenarios and verify its generalization ability.

4. To expand the practical application value of the EPO–BPNN model, it can be extended to simulate and predict other hydrological variables such as total nitrogen concentration and total phosphorus concentration, assisting in the comprehensive evaluation of water quality levels and further testing the potential of the model in the field of hydrological environment monitoring.

5. Conclusion

Aiming at the problem that COD concentration in drinking water is difficult to determine accurately by turbidity interference, this paper proposes a small concentration COD prediction algorithm based on an EPO–BPNN. Firstly, the feature wavelength selection function SelectKBest and the f_regression function in the scikit-learn library are used to complete the feature wavelength selection. Secondly, the EPO algorithm is proposed. Based on the PO algorithm, the EPO improves the accuracy and stability of predicting small and medium concentration COD values in drinking water through the synergy of three strategies: LHS, PRB and CAW. Finally, the experimental results verified the effectiveness of the algorithm from three aspects: firstly, the theoretical analysis of the EPO algorithm showed that compared with optimization algorithms such as PO, WOA, PSO, RIME, etc., it performed the best in convergence speed and accuracy; secondly, the verification results of the simulated data test set show that the predictive performance of the EPO–BPNN model is significantly better than that of the comparative models such as the BPNN and PO–BPNN, with an R2 of 0.9976, a RMSE of 0.3930 mg L−1, a MAPE of 3.47%, and a PBIAS of −0.081%. The maximum relative standard deviation (RSD) of the algorithm was 2.26%, which is less than 3%, and it meets the national standard of reproducibility of the method according to the Chinese Pharmacopoeia; the verification results of the measured data of the reservoir during the normal water period show that the model has the lowest index SD and RMSE, and the best detection precision.

In this paper, we studied the small concentration COD detection in drinking water and improved the small concentration COD detection accuracy and stability in the presence of turbidity interference. This paper provided a prediction algorithm for drinking water quality testing, which is expected to be widely used in the actual drinking water quality monitoring work, promote the development of drinking water quality testing technology in a more accurate direction, and help protect the safety of residents' drinking water.

Author contributions

All authors contributed to the study conception and design. Hongmei Wang: conceptualization, methodology, software, validation, visualization, formal analysis and writing – original draft. Qiaoling Du: supervision, writing – review and editing, funding acquisition, project administration and resources. Xin Wang: data curation, formal analysis and investigation.

Conflicts of interest

This work has not been carried out previously nor submitted to any other journal for its publication. All the authors of this manuscript declare no conflicts of interest after its publication.

Data availability

Data will be made available on request. We have made the original data used in the experiment available at https://doi.org/10.6084/m9.figshare.30589553.

Acknowledgements

This work was supported by the Jilin Provincial Scientific and Technological Development Program [20240302038GX]. The authors acknowledge support from the Jilin Provincial Scientific and Technological Development Program.

References

  1. X. Song, S. Sun and J. Wang, et al., Simulation of effect of reclaimed water as water source compensation on water environment in Huashan Lake, Jingshui Jishu, 2025, 44(1), 144–154 CAS.
  2. N. Chen, Impact of Floods on Ecological and Hydrological Environment of Watersheds under Background of Climate Change, Huanjing Kexue Yu Guanli, 2025, 50(1), 39–42 CAS.
  3. J. Zhao, Research on Portable COD Detector Based on UV-Vis Spectroscopy Technology, Master's Thesis, Jilin University, 2023 Search PubMed.
  4. G. Zhang, Research on Water Quality Detector Based on Ultraviolet-Visible Spectroscopy Technology, Master's Thesis, Jilin University, 2021 Search PubMed.
  5. H. Chen, Research on Multi-parameter Detection System of Drinking Water Based on Ultraviolet-Visible Absorption Spectrum, Master's Thesis, Jilin University, 2020 Search PubMed.
  6. P. Zheng, W. Zhao and J. Wang, et al., Detection of COD UV absorption spectra based on PSO-PLS hybrid algorithm, Spectrosc. Spectral Anal., 2021, 41(1), 136–140 CAS.
  7. Z. He, Relationships between UV Absorbance at 254 nm and COD∼M∼n and COD∼C∼r in Water, Guangdong Weiliang Yuansu Kexue, 2005, 12(3), 60–62,  DOI:10.3969/j.issn.1006-446X.2005.03.015.
  8. R. Jiang, X. Chai and C. Zhang, et al., A dual-wavelength spectroscopic method for the low chemical oxygen demand determination, Spectrosc. Spectral Anal., 2011, 31(7), 2007–2010,  DOI:10.3964/j.issn.1000-0593(2011)07-2007-04.
  9. W. Qi, P. Feng and B. Wei, et al., Feature wavelength optimization algorithm for water quality COD detection based on embedded particle swarm optimization-genetic algorithm, Spectrosc. Spectral Anal., 2021, 41(1), 194–200 CAS.
  10. Q. Lu, J. Zou and Y. Ye, et al., Research on the chemical oxygen demand spectral inversion model in water based on IPLS-GAN-SVM hybrid algorithm, PLoS One, 2024, 19(4), e0301902,  DOI:10.1371/journal.pone.0301902.
  11. B. Ye, B. Ye and X. Cao, et al., Water chemical oxygen demand prediction model based on the CNN and ultraviolet-visible spectroscopy, Front. Environ. Sci., 2022, 10, 1027693 CrossRef.
  12. S. Jin, Research and application of COD prediction in water quality based on UV-VIS spectrum, Master's Thesis, Nanjing University of Information Science and Technology, 2022 Search PubMed.
  13. J. Qu, Study on method for on-line COD monitoring in the wffluent of rural sewage treatment by UV-Vis continuous spectroscopy, Master's Thesis, Shanghai Jiao Tong University, 2020 Search PubMed.
  14. X. Liao, Application of BP neural network model optimization based on particle swarm optimization in second-hand housing value evaluation, Master's Thesis, Chongqing University of Technology, 2024 Search PubMed.
  15. W. Hou, J. Wang and Q. Lin, et al., Improving Clinical Foundation Models with Multi-modal Learning and Domain Adaptation for Chronic Disease Prediction, IEEE J. Biomed. Health Inform., 2025, 1–14,  DOI:10.1109/JBHI.2025.3595140.
  16. X. Fei, S. Chen and Z. Chen, et al., An interpretable data-driven approach for customer purchase prediction using cost-sensitive learning, Eng. Appl. Artif. Intell., 2024, 138(Part A) DOI:10.1016/j.engappai.2024.109344.
  17. T. Fu and M. Yang, A hybrid wind speed forecasting model based on EEMD and MCS algorithm optimize the BPNN, Qufu Shifan Daxue Xuebao, Ziran Kexueban, 2021, 47(2), 27–34 Search PubMed.
  18. C. Zhang, T. Han and B. Qian, et al., Prediction model for tap water coagulation dosing based on BPNN optimized with improved BFO, China Environ. Sci., 2021, 41(10), 4616–4623,  DOI:10.3969/j.issn.1000-6923.2021.10.017.
  19. T. Xu, Z. Liu and M. Lu, Potential High Value Passenger Forecast Based on RBM-GASA-BPNN, J. Transp. Syst. Eng. Inf. Technol., 2019, 19(4), 108–114 Search PubMed.
  20. B. Dai, T. Hu and J. Tan, et al., Power Generation Prediction of Photovoltaic Station Based on SWOA Optimized BPNN, Hubei Minzu Daxue Xuebao, Ziran Kexueban, 2021, 39(3), 321–325331,  DOI:10.13501/j.cnki.42-1908/n.2021.09.014.
  21. J. Lian, G. Hui and L. Ma, et al., Parrot optimizer: Algorithm and applications to medical problems, Comput. Biol. Med., 2024, 172 DOI:10.1016/j.compbiomed.2024.108064.
  22. D. E. Rumelhart and J. L. Mcclelland, T. P. Group, Parallel Distributed Processing, Encyclopedia of Database Systems, 1987, vol. 1, pp. 45–76 Search PubMed.
  23. M. D. McKay, R. J. Beckman and W. J. Conover, A Comparison of Three Methods for Selecting Values of Input Variables in the Analysis of Output from a Computer Code, Technometrics, 1979, 21(2), 239–245,  DOI:10.2307/1268522.
  24. X. Gong, S. Liu and W. Ye, et al., Decoupling of industrial water consumption and economic expansion in the Yangtze River Economic Belt: a comparative analysis across three Five-Year plans, Sci. Rep., 2025, 15, 21186,  DOI:10.1038/s41598-025-06042-5.
  25. S. Samantaray, A. Sahoo and Z. M. Yaseen, et al., River discharge prediction based multivariate climatological variables using hybridized long short-term memory with nature inspired algorithm, J. Hydrol., 2025, 649, 132453,  DOI:10.1016/j.jhydrol.2024.132453.
  26. B. Ritushree, S. Panda and A. Sahoo, et al., Prediction of groundwater level and potential zone identification in Keonjhar, Odisha based on machine learning and GIS techniques, Franklin Open, 2025, 11 DOI:10.1016/j.fraope.2025.100250.
  27. S. Samantaray and A. Sahoo, Groundwater level prediction using an improved ELM model integrated with hybrid particle swarm optimisation and grey wolf optimisation, Groundwater Sustain. Dev., 2024, 26 DOI:10.1016/j.gsd.2024.101178.
  28. Z. Luo, J. Dong and J. Hu, Optimization of Resistance Spot Welding Quality Prediction Based on Improved Sparrow Search Algorithm for BPNN, Tianjin Daxue Xuebao, Ziran Kexue Yu Gongcheng Jishuban, 2024, 57(5), 445–451 Search PubMed.

This journal is © The Royal Society of Chemistry 2026
Click here to see how this site uses Cookies. View our privacy policy here.