Dongsheng
Wang
a,
Ao
Li
a,
Yicong
Yuan
a,
Tingjun
Zhang
a,
Liang
Yu
*a and
Chaoqun
Tan
*b
aCollege of Automation & College of Artificial Intelligence, Nanjing University of Posts and Telecommunications, Nanjing 210023, China. E-mail: Liang.yu@njupt.edu.cn
bSchool of Civil Engineering, Southeast University, Nanjing 210096, China. E-mail: tancq@seu.edu.cn
First published on 6th March 2025
Urban water treatment plants are among the largest energy consumers in municipal infrastructure, imposing significant economic burdens on their operators. This study employs a data-driven personalized federated learning-based multi-agent attention deep reinforcement learning (PFL-MAADRL) algorithm to address the intake scheduling problem of three water intake pumping stations in urban water treatment plants. Personalized federated learning (PFL) is combined with long short-term memory (LSTM) modeling to create environment models for water plants, focusing on energy consumption, reservoir levels, and mainline pressure. The average accuracies of PFL-based LSTM (PFL-LSTM) models are 0.012, 0.002, and 0.002 higher than those of the LSTM model in the three water plants. Evaluation metrics were established to quantify the effectiveness of each pumping station's energy-efficient scheduling, considering constraints such as reservoir water levels and mainline pressure. The results indicate that the proposed algorithm performs robustly under uncertainties, achieving a maximum energy consumption reduction of 10.6% compared to other benchmark methods.
Water impactUrban water treatment plants are among the largest energy consumers within municipal infrastructure, imposing significant economic burdens on water treatment plant operators. In this study, an algorithm based on personalized federated learning and multi-agent attention deep reinforcement learning (PFL-MAADRL) is employed to address the intake scheduling problem of multiple water intake pumping stations (MWIPSs) in urban water treatment plants. The results indicate that the proposed algorithm demonstrates robust performance against uncertainties and achieves a maximum energy consumption reduction of 10.6% compared to other benchmark methods. |
Numerous methods have been developed to optimize WIPSs in water treatment plants, addressing the significant energy consumption associated with these systems, such as harmony search (HS),6 bi-objective optimization (BOO),7 particle swarm optimization (PSO),8 ant colony optimization (ACO),8 and enhanced cooperative distributed model predictive control (EC-DMPC).9 For instance, the HS method regulated the opening of pressure-reducing valves in the WDN to reduce overpressure at network nodes, achieving a leakage recovery rate of approximately 45%, although energy savings were not investigated.6 Conversely, the BOO method designs real-time pressure control regulators for distribution networks, balancing performance and cost.7 Combining PSO and ACO algorithms for WDNs has led to optimal operation scheduling of pressure valves, resulting in a 32.6% improvement in the average reliability index and over a 31% reduction in the average leakage rate.8 Additionally, an EC-DMPC strategy maintains uniform water supply pressure near the lower limit, ensuring consistent customer pressure despite changes in water demand, thereby reducing both leakage and energy consumption.9 Despite significant progress, some limitations remain. Firstly, when applied to complex problems in multiple waterworks, existing optimization algorithms for controlling pipe network pressure, reservoir level, and energy consumption in a WDN often fall into local optima, resulting in suboptimal solutions.6–9 Secondly, the high computational complexity of these algorithms often consumes large amounts of computational resources.6–8 Thirdly, the inconsistent quality of the solutions produced by these algorithms can lead to less robust results, further limiting their effectiveness in practical applications.9 Finally, these methods rely on centralized control, which cannot effectively coordinate the operations of multiple pump stations, resulting in poor overall optimization, particularly in terms of information sharing and collaboration. Additionally, the varying data structures, quality, and types between different pump stations and water plants make it difficult for traditional methods to handle this heterogeneous data, thereby limiting the algorithm's application across different devices and environments.6–9
Compared to traditional research methods, learning-based optimization methods for WDNs offer distinct advantages by eliminating the need for uncertain parameters or explicit system models. Reinforcement learning (RL)10,11 and deep reinforcement learning (DRL)12–14 are prominent examples. Particularly, DRL has captivated researchers due to its superior representational capacity and its ability to make informed decisions under uncertainty.15,16 For instance, a model-free RL-based approach has been used to control pressure in the water supply network, effectively reducing pressure within the WDN.15 However, this approach did not address WDN withdrawals and reservoir levels, nor did it study energy savings. A knowledge-assisted approximate strategy-based optimization method was used to optimize pumping unit scheduling, satisfying pressure constraints under time-varying water demand.16 Although the optimal policy derived from RL maps the current network state to pump actions without future water demand information, it does not consider the reservoir level storage function in the operational optimization of WDNs.15,16 And the spatiotemporal combination of rewards in high-dimensional discrete actions limits the applicability of these methods for efficient scheduling of MWIPSs in water treatment plants. Additionally, the data structures, quality, and types between pump stations often vary, making it difficult for RL algorithms to directly handle these heterogeneous data. And when dealing with multi-agent collaboration tasks, RL methods perform poorly and cannot effectively promote cooperation among multiple agents, leading to suboptimal decisions.15,16
The aim of this study is to investigate the optimization of energy consumption of MWIPSs under the uncertainty of water supply, considering constraints such as reservoir levels, mainline pressure, and pressure variation values. A data-driven personalized federated learning-based multi-agent attention deep reinforcement learning (PFL-MAADRL) control method was proposed. This approach utilizes the personalized federated learning (PFL) technique to facilitate the sharing and learning of information among different intake pumping stations, thereby enhancing the accuracy of the overall environmental model. In the pump station scheduling problem, employing multi-agent attention deep reinforcement learning (MAADRL) to establish an environment model proves effective in tackling complex and dynamic scheduling scenarios. By leveraging adaptive learning among agents, the system can optimize scheduling strategies, thereby enhancing overall efficiency. Unlike traditional physical models, which are designed for simple and stable systems, these models face difficulties in managing complex interactions and dynamic changes. In contrast, the multi-agent model, through agent collaboration and self-learning, is capable of adapting to evolving conditions and optimizing multiple scheduling objectives, thus improving the system's flexibility and responsiveness to unforeseen events. Concurrently, by employing MAADRL, each intake pumping station acts as an agent that interacts with the environment, learns, and optimizes its scheduling strategy, ultimately minimizing the energy consumption of the entire system.
To facilitate performance comparisons, five benchmark methods for dynamically co-regulating water intake were included. The details are as follows: rules 1 and 2: to achieve better performance in dynamically co-regulating the water intake Qi,t (m3 h−1), two rule-based (RB) baseline methods were adopted. In the RB scheme, adjustments to water withdrawals were made by imposing constraints on the range of mainline pressures. Specifically, the following rules were applied: if: pi,t ≥ pmaxi − ϕi , then: Qi,t = Qi,t−1 − εi. Elif: pi,t ≤ pmini + ϕi , then: Qi,t = Qi,t−1 + εi, otherwise, Qi,t = Qi,t−1. In these rules, ϕi = 0.03 MPa, 1 ≤ i ≤ n. Two distinct RB schemes were considered, as indicated in Table S4;† MAAC: the MAAC algorithm (multi-actor-attention-critic17); greedy: this scheme makes decisions at each time slot based only on current information and optimizes the objective function for the current time slot while adhering to the constraints; DQN: this scheme controls each WIPS independently using the DQN method (i.e., Deep Q-Network18); manual scheduling: the manual scheduling scheme primarily relies on the experience and intuition of water plant engineers to make scheduling decisions.
Four key performance metrics were defined to comprehensively evaluate the MWIPS scheduling algorithm's performance. The average energy consumption per slot for each WIPS (AEC in kW h), the average pressure violation per slot for each WIPS (APV in MPa), and the average pressure variation violation per slot for each WIPS (APVV in MPa) are defined in eqn (S10)–(S12).† Additionally, the average reservoir level violation per slot for each WIPS (ALV in m) is defined in eqn (S13).† These metrics provide a detailed assessment of the algorithm's effectiveness in managing water intake and maintaining system stability. To demonstrate the accuracy of the proposed personalized federated learning-based long short-term memory (PFL-LSTM) models, mean relative error (MRE) was used as a performance metric, which is defined as , where ŷi denotes the actual value, yi denotes the predicted value and n denotes the number of samples.
Qmini ≤ Qi,t ≤ Qmaxi, 1 ≤ i ≤ N, ∀t, | (1) |
The reservoir acts as a buffer for the system, storing excess water when the water supply is insufficient or demand surges, helping to balance the system load and prevent water supply interruptions. During periods of lower demand, the reservoir water level rises, ensuring a reliable water supply during peak demand periods. To prevent water shortages or spills, it is necessary to maintain the reservoir's water level within a reasonable range. li,t (m) represents the reservoir level of WIPS i at slot t.
lmini ≤ li,t ≤ lmaxi, ∀t, | (2) |
In eqn (2), lmaxi (m) and lmini (m) denote the upper limit and lower limit of the reservoir level of WIPS i, respectively.
The water level in the reservoir at the end of each time slot for WIPS i is influenced by various factors. These factors include the water level in the reservoir at the end of the previous time slot for each WIPS and the supplied water volume at time slot t for WIPS i (denoted as wi,t (m3 h−1)). It should be noted that these factors are indirectly related to the mainline pressure, represented as pi,t (MPa). Therefore, li,t can be expressed as eqn (3).
li,t = Ωt(li,t−1, Qi,t, pi,t, wi,t), ∀t, | (3) |
In actual operation, the design of the mainline pressure for the WIPS is determined based on engineering requirements and equipment performance. Therefore, ensuring that the mainline pressure remains within a reasonable range is necessary. Furthermore, excessive fluctuations in mainline pressure can harm the mainline's service life, safety, and reliability. Thus, the mainline pressure pi,t should be maintained within a reasonable range, and the change in mainline pressure between adjacent time slots should be limited by a value denoted as pmaxv (MPa) in eqn (5):
pmini ≤ pi,t ≤ pmaxi, ∀t, | (4) |
|pi,t − pi,t−1| ≤ pmaxv, ∀t, | (5) |
pi,t = Θt(pi,t−1, Qi, t, li,t), ∀t, | (6) |
![]() | (7) |
However, there are several challenges in solving the optimization problem P1. Firstly, there are temporally-coupled constraints, including eqn (3), (5) and (6). For instance, in eqn (3), the reservoir level li,t at the end of time slot t depends on the reservoir level at the end of the previous time slot t − 1. Secondly, uncertainty in the supplied water demand wi,t complicates the problem. Thirdly, it is difficult to obtain accurate and explicit model parameters such as Ωt (·) (m), Θt (·) (MPa), and Φi,t (·) (kW h). Given these challenges, traditional model-based methods are inadequate for addressing them. Therefore, the problem was reformulated as a Markov game and an efficient data-driven algorithm was proposed to solve it.
In this paper, each agent i(1 ≤ i ≤ N) represents a water withdrawal controller, and the environment encompasses all interactions with the agents. The objective of each agent is to maximize the sum of discounted rewards obtained in the future, given the state st ∈ S and action at = (a1,t, …, aN,t), i.e., . The components of the Markov Game model proposed in this paper are defined as follows.
Environment state S: for WIPS i, agent i takes action ai,t based on local observations oi,t to satisfy the constraints of reservoir level li,t and mainline pressure pi,t. Additionally, there exists a relationship between the reservoir level li,t and the water supply wi,t. Hence, the water supply wi,t should be included as part of the global state st. Based on this analysis, the local observation for agent i at time slot t is designed as (t′, li,t, pi,t, wi,t), where t′ (h) denotes the time slot index within a day, i.e., when τ = 1 h, t′ = mod(t, 24). Considering the local observations of all agents at time slot t, then: ot = (o1,t, …, oN,t). For simplicity, the global state st is chosen to be ot.
Action: to facilitate the training of agents related to WIPS i, the action of agent i is defined as ai,t = βi,t, i.e., βi,t = {−400, −300, −200, −100, 0, 100, 200, 300, 400}, 1 ≤ i ≤ N, where βi,t (m3 h−1) denotes the action of agent i related to WIPS i and indicates the water intake adjustment value. Note that to ensure the new water intake is a valid value, the following rule is adopted, i.e., max(Qi,t−1 + βi,t, Qmini) ≤ Qi,t ≤ min(Qmaxi, Qi,t−1 + βi,t), 1 ≤ i ≤ N. For simplicity, the joint action of all agents can be written as at = (β1,t, …, βN,t).
Reward function: when the environment state transitions from st−1 to st due to the combined action of at−1, a reward rt is provided by the environment. Our objective is to minimize the total energy consumption of WIPSs while adhering to constraints related to reservoir levels and mainline pressures. The reward function comprises penalties for energy consumption, violations of reservoir level boundaries, violations of mainline pressure at time slot t, and deviations in mainline pressure differences, which are defined in eqn (S1)–(S4).† Taking four parts into consideration, the reward of each agent $i$ can be designed as in eqn (S5).† In eqn (S5),†αi,1 (in kW h m−1), αi,2 (in kW h MPa−1), and αi,3 (in kW h MPa−1) are positive weight coefficients, respectively.
Data from different water plants may exhibit significant variations in value ranges and distributions. Specifically, WIPS 1 variable ranges may be larger, and the challenges it faces often include larger fluctuations, different water sources, and more complex environmental conditions, while WIPS 2 variable ranges are smaller, with more consistent water sources and more stable environmental changes. These differences in data consistency and distribution make it difficult for traditional centralized training methods to address the diverse needs of these agents, especially in terms of privacy protection and communication efficiency. Therefore, a method that uses PFL21 is proposed to facilitate information sharing and learning between different water intake pumping stations, thereby improving the accuracy of the overall environmental model, as shown in Algorithm S1.† As is shown in lines 1–3 of the algorithm, the inputs and outputs are first defined, and the global personalized model parameters are initialized. During each round of federated learning (FL), denoted as t, a subset St is selected from all the clients to participate in the current round of training. The function ClientUpdate, depicted in lines 11–19, is then invoked to obtain the local model parameter wlocal and the personalized model parameter wpersonalc in parallel for each client. Subsequently, the local model parameters within the subset are aggregated, and the global model parameters are updated. This is achieved using the function Aggregate, as depicted in lines 20–25, resulting in the updated global model parameters wglobal as shown in lines 3–10. Finally, the local fine-tuning process corresponds to the ClientUpdate function in Algorithm S1.†
In the MWIPS system, the personalization models were used as the system's environment models. To efficiently train MWIPS DRL agents, the MAAC algorithm that merges an attention mechanism with a soft actor-critic was employed. The algorithm exhibits excellent performance compared to other MADRL algorithms. The paper's emphasis is on MWIPSs that necessitate coordination among several agents. This coordination facilitates task decomposition and boosts the scheduling algorithms' efficiency. The objective is to minimize energy consumption in the MWIPS system by controlling water withdrawal.
In order to realize cooperative actions among agents, an attention mechanism is introduced.22 This mechanism calculates the current agent's state-action value function by considering the contributions of other agents. In addition, to promote exploration during training, the state-action values are supplemented with entropy rewards when updating both the actor network and the critic network. Specifically, the critic network is updated by employing the joint loss function outlined in eqn (S6).†
In eqn (S6),†D denotes the experience replay buffer, in which past system transitions (i.e., a tuple (o, a, õ, r)) are stored. yi is given as shown in eqn (S7).† In eqn (S7),† the evaluation parameters of the target actor network for each agent can be denoted by i, and −φ
log
π
i(ãi|õi) is related to entropy of π
i(ãi|õi) and maintaining a balance between maximizing entropy and maximizing the reward function depends on φ. Then, the weight parameter of the actor network is updated according to the policy gradient method. Specifically, the policy gradient is calculated as shown in eqn (S8).† In eqn (S8),† the ϱi(oi, ai) is given in eqn (S9).† In eqn (S9),† the set of agents except i is denoted by \i. Here,
can be viewed as baselines, which can indicate whether the current action will result in an increase in the expected return.
The training process for MWIPS DRL agents is shown in Algorithm S2.† At each time slot t, each WIPS agent i interacts with the MWIPS scheduling environment to determine the optimal action ai,t. The algorithm's inputs and outputs are defined in lines 1 and 2, and the environment and parameters are initialized in lines 3–7. Before each episode Y starts, the environment is reset, and each WIPS agent i receives an initial observation state oi,1 (lines 8 and 9). During the interaction, each agent accumulates experience transitions (ot, at, ot+1, rt+1), which are stored in an experience replay buffer D following a first-in-first-out principle (lines 10–13). During training, a batch of experience data is randomly sampled from D to train the agent's neural network model (lines 14–18). The experience replay method accelerates training and improves learning by reusing stored data. The actor and critic networks are updated (lines 19 and 20), followed by updating the target networks' weight parameters (line 21).
As outlined in Algorithm S3,† once training is complete, the learned policies can be tested in practice. The proposed algorithm facilitates real-time decision making based on the current state of the MWIPS system without requiring future water demand forecasts. Additionally, the algorithm's computational complexity is minimal, relying only on the forward propagation of deep neural networks.
Schemes | WIPS 1 | WIPS 2 | WIPS 3 | Average | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Energy | Level | Pressure | Energy | Level | Pressure | Energy | Level | Energy | Level | Energy | Level | Energy | Level | |
LSTM | 0.064 | 0.022 | 0.052 | 0.047 | 0.024 | 0.015 | 0.060 | 0.014 | 0.060 | 0.014 | 0.060 | 0.014 | 0.060 | 0.014 |
PF-LSTM | 0.047 | 0.025 | 0.049 | 0.033 | 0.017 | 0.013 | 0.056 | 0.013 | 0.056 | 0.013 | 0.056 | 0.013 | 0.056 | 0.013 |
In this paper, each WIPS represents a client. Each client has its own private local dataset, limiting their ability to train effective local models due to data scarcity. To overcome this limitation, FL is employed to obtain better-performing local models for WIPSs. However, traditional joint learning's global models may not perform well on individual WIPS due to heterogeneous local data distributions. PFL addresses the challenges of data heterogeneity and personalized scenarios by customizing global models to meet the specific needs of each WIPS.23 PFL-LSTM enables multiple WIPS to share their data, utilizing more diverse datasets for training to create models tailored to their unique requirements. This approach enhances the generalization ability and accuracy of the models within the MAADRL environment. Moreover, improving the model accuracy assists the multi-agent system in learning more effective strategies, thereby boosting the performance of the MAADRL algorithm.
Schemes | WIPS 1 | WIPS 2 | WIPS 3 | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
AEC (kW h) | APV (MPa) | APVV (MPa) | ALV (m) | AEC (kW h) | APV (MPa) | APVV (MPa) | ALV (m) | AEC (kW h) | APV (MPa) | APVV (MPa) | ALV (m) | |
Manual scheduling | 969.6 | 0 | 0 | 2.8 × 10−3 | 1271.6 | 0 | 1.46 × 10−5 | 0 | 1296.8 | 0 | 0 | 0 |
Rule-1 | 892.4 | 0 | 0 | 0 | 1361.2 | 0 | 0 | 5.3 × 10−5 | 1304.0 | 1.0 × 10−4 | 9.5 × 10−4 | 0 |
Rule-2 | 886.2 | 0 | 0 | 0 | 1360.8 | 0 | 0 | 5.3 × 10−5 | 1302.4 | 6.3 × 10−4 | 4.6 × 10−4 | 0 |
MAAC | 666.7 | 0 | 0 | 0 | 1287.5 | 0 | 0 | 1.0 × 10−4 | 1299.0 | 1.3 × 10−3 | 8.2 × 10−5 | 0 |
Greedy | 654.8 | 0 | 0 | 2.6 × 10−2 | 1207.9 | 0 | 0 | 1.1 × 10−6 | 1456.8 | 0 | 0 | 0 |
DQN | 665.6 | 1.8 × 10−4 | 3.4 × 10−4 | 0 | 1293.9 | 0 | 0 | 1.8 × 10−4 | 1402.1 | 3.9 × 10−2 | 1.7 × 10−4 | 0 |
Proposed | 633.2 | 0 | 0 | 1.5 × 10−4 | 1259.2 | 0 | 0 | 1.2 × 10−4 | 1291.3 | 0 | 1.9 × 10−4 | 1.5 × 10−5 |
Compared with rule-1, rule-2, the MAAC algorithm, the greedy algorithm, the DQN algorithm, and the manual scheduling, the proposed algorithm can save 10.4%, 10.6%, 2.2%, 4.1%, 5.3% and 10.0% of energy consumption, respectively. This is because the proposed algorithm can intelligently select the most energy-efficient water withdrawal methods under different operating conditions. Fig. 3 and 4(a)–(d) and S1(a)–(d)† describe the performance details among all schemes for MWIPSs. The proposed algorithm ensures compliance with mainline pressure and reservoir level constraints while reducing the overall energy consumption of the MWIPS system. Compared with rule-1, rule-2, the greedy algorithm, the DQN algorithm, and the manual scheduling, the proposed algorithm utilizes the attention mechanism, enabling agents to focus on the most relevant information from both the environment and the actions of other agents. By selectively attending to critical data, the algorithm enhances the coordination between agents, leading to more optimal decision-making, reduced energy consumption, and better compliance with system constraints, thus enhancing the overall performance and robustness of the system in dynamic and uncertain environments. The main difference between the proposed algorithm and the MAAC method is the introduction of the PFL mechanism.23 Specifically, by integrating PFL with MAAC, the knowledge of WIPS agents can be shared without compromising user privacy or data security. This addresses the issue of poor model prediction accuracy due to data heterogeneity. More accurate models can help the MAADRL algorithm select the optimal policy, thereby improving the generalization ability and accuracy of the system.
![]() | ||
Fig. 3 Performance details among all schemes for WIPS 1. (a) Energy consumption, (b) main pressure, (c) level, and (d) pressure difference of the main pipe. |
![]() | ||
Fig. 4 Performance details among all schemes for WIPS 2. (a) Energy consumption, (b) main pressure, (c) level, and (d) pressure difference of the main pipe. |
The algorithm solves the cooperative competition problem in multi-agent systems through centralized training and decentralized execution.27 During centralized training, the algorithm uses global information to optimize the strategies of an individual agent, and enables each agent to selectively focus on the critical information of other agents through the attention mechanism,17 thereby enhancing the accuracy of value function estimation. In the execution phase, the agents operate independently. This feature provides the proposed algorithm with substantial advantages in practical applications.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4ew00685b |
This journal is © The Royal Society of Chemistry 2025 |