Dewei
Wang
*a,
Jie
Bao
*a,
Miguel A.
Zamarripa-Perez
b,
Brandon
Paul
b,
Yunxiang
Chen
a,
Peiyuan
Gao
a,
Tong
Ma
ac,
Alexander A.
Noring
b,
Arun K. S.
Iyengar
b,
Daniel T.
Schwartz
d,
Erica E.
Eggleton
d,
Qizhi
He
ae,
Andrew
Liu
a,
Olga A.
Marina
a,
Brian
Koeppel
a and
Zhijie
Xu
a
aPacific Northwest National Laboratory, Richland, WA 99352, USA. E-mail: dewei.wang@pnnl.gov; jie.bao@pnnl.gov
bNational Energy Technology Laboratory, Pittsburgh, PA 15236, USA
cNortheastern University, Boston, MA 02115, USA
dUniversity of Washington, Seattle, WA 98195, USA
eUniversity of Minnesota, Minneapolis, MN 55455, USA
First published on 21st September 2023
Computer-aided process engineering and conceptual design in energy and chemical engineering has played a critical role for decades. Conventional computer-aided process and system design generally starts with process flowsheets that have been developed through experience, which often relies heavily on subject matter expertise. These widely applied approaches require significant human effort, either providing the initially drafted flowsheet, alternative connections, or a set of well-defined heuristics. These requirements not only limit the flexibility of the flowsheet design process, but also make the system design highly reliant on the engineer's experiences and expertise. In this study, a novel reinforcement learning (RL) based automated system for conceptual design is introduced and demonstrated on the Institute for the Design of Advanced Energy System (IDAES) Integrated Platform. IDAES is an open-source platform with extensible libraries of dynamic unit operations and thermophysical property models. It provides the capability of optimizing energy and chemical process flowsheets with state-of-the-art solvers and solution techniques. The RL approach provides a generic tool for identifying process configurations and significantly decreases the dependence on human intelligence for energy and chemical systems conceptual design. An artificial intelligence (AI) agent performs the conceptual design by automatically deciding which process-units are necessary for the desired system, picking the process-units from the candidate process-units pool, connecting them together, and optimizing the operation of the system for the user-defined system performance targets solely according to the reward system, while the reward system can incorporate user's experiences and knowledge to advance the training process. The AI agent automatically interacts with the physics-based system-level modeling and simulation toolset IDAES to guarantee the system design is physically consistent. This study showcases the application of the RL-IDAES framework through two demonstration cases. These cases prove the framework's capability of designing and optimizing complicated systems with high flexibility at affordable computing costs. To illustrate, designing the hydrodealkylation of toluene system from 14 candidate process-units yielded 123 feasible designs within 20 hours on a standard PC. Moreover, the framework's versatility is demonstrated by the ability to transfer a trained RL model to different training cases, thus enhancing the overall performance of the reinforcement learning process.
The computer-aided system design without a prior deterministic flowsheet is much more challenging.8 The concept of superstructure optimization is one of the popular procedures that does not need a specific flowsheet9 but needs a large set of process alternatives,10,11 and is often used with simplified mathematical representations of chemical operations. Another popular approach for non-deterministic flowsheet automated design is the evolutionary modification method. It needs an initial drafted flowsheet, and this evolutionary modification approach can analyze and change one or more connections in the flowsheet to improve it, until no further improvement in the flowsheet can be made.12 The third widely used automated flowsheet design approach is called systematic generation, which creates a flowsheet sequentially by adding units one by one from heuristics. The heuristics are constructed based on prior knowledge.13 Hence, the widely applied approaches mentioned above still require significant human efforts, either providing the initial drafted flowsheet, alternative connections, or a well-defined heuristics data set. These requirements not only limit the flexibility of the flowsheet design, but also make the system design highly reliant on the engineer's experiences and expertise.
Machine learning (ML) has developed rapidly in the past decades and has shown promising advantages in the applications of reduced order model,14,15 image and video processing,16–19 and natural language processes.20 Among various ML algorithms, reinforcement learning shows great potential as an alternative to human beings' intelligence and creativity,21,22 such as playing games and self-driving cars. RL focuses on how intelligent agents should take actions in an environment in order to maximize the cumulative reward.21 The true advantage of RL is that it does not need existing training data sets, such as labeled input–output pairs and/or sub-optimal actions database.21 Additionally, RL is also flexible in learning from existing labeled examples and then combined with unsupervised self-learning to accelerate the accumulation of knowledge.23 In recent years, some pioneering efforts have been made in applying RL to conceptual designs of energy and chemical systems. Khan et al. used hierarchical reinforcement learning to search for optimal processing routes for hydrogen production and steam methane reforming.24,25 Gottl et al. applied RL to sequentially build synthesis flowsheets with a specific process problem.26–28
In this study, an RL-based automated conceptual design approach is introduced and demonstrated by interacting with a general energy system modeling platform, the Institute for the Design of Advanced Energy System Integrated Platform (24). The RL approach significantly decreases the requirements from human interaction for energy and chemical system design. The user decides the candidate process-units pool (CPP) from the IDAES module library, which provides all the available energy and chemical operations (e.g., phase change, temperature change, pressure change, etc.) that are allowed to be used in the system. An artificial intelligence agent can then automatically decide which process-units are necessary for the desired system, pick the units from the CPP, connect them together, actively interact with the environment and optimize the system design for the user-defined system performance targets. IDAES serves as the environment to guarantee the system designs are physically consistent. IDAES provides extensive equation-based models of unit operations and thermophysical properties, as well as capabilities of optimizing process flowsheets with state-of-the-art solvers. The interactive framework is named RL-IDAES in this study. The proposed RL-IDAES framework is designed to be generic and can be adapted to different energy and chemical engineering systems, and the user can specify system complexity and optimization objectives. Two case studies are presented in this manuscript to demonstrate the capability of the RL-IDAES and the transferability of the trained AI agent among different conceptual designs.
IDAES includes a large process-unit model library of typical unit operations such as feeds, products, mixers, splitters, flash drums, heat exchangers, and stoichiometric reactors (Fig. 2). The library is continually being updated and includes advanced energy models such as solid oxide fuel cells, auto-thermal reforming, air separation units, steam cycle, boiler models, and steam turbines. Each model provides a set of equations describing a given operation (e.g., phase change, temperature change, pressure change, chemical reactions), and every model and equation is editable and extensible. The user can add, remove or modify variables and constraints for these unit models. This level of flexibility allows the RL-IDAES to be applied to a wide range of conceptual design problems, enabling modelers to develop and customize the flowsheet to their needs.29
By integrating the extensible model library and advanced optimization-based approaches, the IDAES platform can be used for designing novel, large-scale, complex systems with dynamic optimization under uncertainty. The optimization objective can be a thermophysical property (e.g., flow rate, temperature, reaction rate, etc.) or any complex continuous quantification parameter (e.g., annual revenue, product purity, etc.).
Fig. 3 Physics constraints energy or chemical systems must follow (H, R, C, F, O in the plot denote heater, reactor, compressor, flash drum and any other unit). |
This immediate-reward system enables the RL-IDAES to evaluate any incomplete flowsheet; therefore, the flowsheet-building process doesn't have to be sequential. As the AI agent is not learning promising routes to build optimal flowsheets but learning essential connections, it can handle complex and large-scale systems with multiple system inlets and recycling loops. The immediate-reward system also has a mechanism to control the “system complexity” or how complicated the user wants the RL-found designs to be. This function is realized by issuing a “no action” penalty when the agent picks the “no action” option from the action space. A higher penalty should encourage the agent to take an “active” option instead of “no action”. Please note that the AI agent is blind to the rules in the immediate-reward system or the fast pre-screen process. The AI agent can only receive and learn from feedback reward scores.
The AI agent takes actions solely according to the reward system. One thing to be noted is although the proposed RL-IDAES is an automated conceptual design framework, it can always incorporate engineers' knowledge and experiences by adding them into the immediate-reward or finalized-reward systems to advance the training process.
Initially, there is no connection between any process-units. The agent decides an action according to the reward evaluation of the current observation, such as connecting the raw material feed #1 to the mixer, as “Action 1” shown in Fig. 4(a). After “Action 1”, the current observation is updated to a new observation, as shown in “Observation 2” in Fig. 4(a). An immediate-reward system is used to evaluate the connections in “Observation 2”, and assign a reward, which is then sent back to the agent, as the green dash line shown in Fig. 4(a). The immediate-reward system and associated evaluation of rewards are discussed in Section 2.1. Following this procedure, the candidate process-units can be connected one by one until a fully connected system flowsheet is acquired. Note that it is unnecessary to use all the units in the CPP for a completed system flowsheet design, for example, excluding the flash (FL) in Fig. 4(a). The agent is allowed to do nothing for any step of action, as “Action 4” and “Action 6” are shown in Fig. 4(a). Also, the agent doesn't have to build a flowsheet sequentially from system inlets to system products/exhausts.
At the beginning of training, the agent cannot distinguish between the correct and incorrect connections and makes random selections with no bias or information, which usually causes the connected flowsheet to be physically infeasible or far from the user's expected system performance. The agent can learn and improve the decision of selecting the connections according to the reward system. By tracking the potential directions of increasing the finalized reward, the agent would be well trained, and find optimized system flowsheet designs. Based on this training scheme, the deep Q value network (DQN) is applied to the RL framework in this study.
The action space is a 1-dimensional (1D) vector, and the elements in this vector represent all the outlets, as shown in Fig. 4(b). The value in each element is the Q value of picking a certain outlet. The Q value is defined as
Q = R + γ·max(Qnext), | (1) |
Because there are no existing prior system designs and/or data available to train the DQN, the RL has to generate the draft designs all by itself, and use these draft designs to train itself. The main procedures are shown in Fig. 6. Fig. 6 (part 1) records the raw attempts of connecting process-units. According to the “Observation” in part 1, which is a 2D array as discussed in Section 2.1.2, an action #3 is selected. This action index can be randomly picked or predicted by the DQN. A greedy factor ε, whose value is between 0 and 1, is used to control whether an action is randomly picked or DQN predicted. Before every action, a random number δ, which is between 0 and 1, is generated. If δ > ε, an action is randomly picked. If δ < ε, an action is decided by the DQN prediction. ε is gradually increasing from 0 to 1 with RL training. This means that, at the beginning of the training, the DQN has limited knowledge or experience in predicting the correct action, so a random action is preferred. With the training progress, the action has a higher and higher probability of being determined by DQN prediction.
An example is shown in Fig. 6 (part 1): the action #3 stands for the “mixer outlet”, and it should be applied on the 4th row of the observation array, so it is connected to the “heater inlet” and yields the “New observation (after action)”. The “New observation” is then evaluated by a series of physics constraints for fast pre-screens, then through IDAES system-level modeling and optimization,29 if the connected system can pass all the fast pre-screens. The fast pre-screens and the IDAES simulation and optimization can provide the rewards of this action. For this sketched demonstration, the reward is 35, as shown in Fig. 6 (part 1). The details about the fundamental physical constraints and IDAES simulation and optimization are discussed in Section 2.1. Whether this action is the final step for connecting a system is also recorded. In this sketched demonstration, it is not the final step yet, as shown in Fig. 6 (part 1). A process-unit is still available in the 5th row of the observation array, which is the “reactor inlet”.
By repeating the steps in part 1, a large number of raw attempts are recorded into “memory” as part 2 shown in Fig. 6. The “Observation” and the “New observation” 2D arrays are flattened into a 1D array in the “memory”. Therefore, all the necessary data associated with one action is recorded in one row. The recorded “memory” is not infinitely large and is usually defined as around 10000 to 100000 rows. The new raw attempts data overwrite the old data in the memory. Because the newer data usually focuses on the attempts of connection that can get higher rewards, it is unnecessary to keep the older data, which is usually more random and less meaningful. This limited “memory” size can effectively increase the DQN training efficiency.
Usage of all the data in memory to train the DQN, which could be computationally expensive, is not necessary. After the RL attempts to construct 10 to 100 system flowsheets, which is called training interval, the DQN will be trained. A small batch of “memory” is randomly picked for training. The batch size is usually defined as 30 times the training interval, defined by the user from 10 to 100. In the small batch, each row of data needs to be converted to the Q values, as the steps shown in Fig. 6 (part 3). Both “Observation” and “New observation” arrays are sent to DQN and DQN* for predicting the Q values. The structure of the DQN and DQN* are the same, and the parameters in DQN* are not trained, but directly copied from DQN with a delay. The feature of delay is mainly for stabilizing the RL framework. As shown in the sketch demonstration in Fig. 6 (part 3), the predicted Q values for the 5 action options based on “Observation” are [5, 50, 17, 20, 5], and the Q values for “New observation” are [40, 12, 10, 15, 5], which is defined as Qnext. The maximum value in Qnext is 40. In this recorded row, the action reward is 35. Using eqn (1) with γ = 0.5, the new Q value can be calculated as 35 + 0.5 × 40 = 55. In this recorded row, the action is 3, therefore this updated Q value 55 only replaces the value 17 in the 3rd element in the Q vector, and leaves other values unchanged, as the blue block marked in Fig. 6 (part 3). The new Q vector then becomes [5, 50, 55, 20, 5]. If the element in the recorded row representing the indicator of “Final step” is “Yes”, which means there is no further action needed, there is no Qnext, and eqn (1) is simplified to Q = reward. Please note that, the recorded action number is not necessary the same as the index of the maximum Q value. As shown in this sketch example in Fig. 6 (part 3), the maximum Q value is 50, and its index is 2nd, but the recorded action is 3. There are two reasons that lead to this situation. The first reason is that the action may be randomly picked when recorded. The second reason is that the DQN has been updated, and its prediction is not the same as the ones from the old DQN, when the action was recorded.
By implementing the calculations in Fig. 6 (part 3), all the data in the batch memory can be converted to the format of observation and Q array, as shown in Fig. 6 (part 4). The Root Mean Squared Propagation (RMSProp) algorithm34,35 is used to optimize the parameters in DQN, with learning rate α = 0.01. RMSProp is one of the most popular optimizers for neural networks training, and provides the balanced performance between robustness and convergence speed. The observation and the Q array, as the inputs and outputs respectively, are used to train the DQN to update its parameters. By looping parts 1 to 4, the DQN can be kept updated, and the recorded system connection should get closer and closer to the targeted, optimized system design.
C6H5CH3 + H2 → C6H6 + CH5 | (2) |
No. of units | Name | Feed | Product | Exhaust | Mixer | Compressor | Heater | Reactor | Flash | Splitter | Cooler | Expander |
---|---|---|---|---|---|---|---|---|---|---|---|---|
14 | HDA-14 | 2 | 1 | 2 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
15 | HDA-15 | 2 | 1 | 2 | 2 | 2 | 2 | 1 | 2 | 1 | ||
16 | HDA-16 | 2 | 1 | 2 | 3 | 2 | 2 | 1 | 2 | 1 | ||
17 | HDA-17 | 2 | 1 | 2 | 3 | 2 | 2 | 1 | 2 | 2 | ||
18 | HDA-18 | 2 | 1 | 2 | 3 | 2 | 2 | 2 | 2 | 2 | ||
19 | HDA-19 | 2 | 1 | 2 | 3 | 2 | 2 | 2 | 2 | 2 | 1 | |
20 | HDA-20 | 2 | 1 | 2 | 3 | 2 | 2 | 2 | 2 | 2 | 1 | 1 |
22 (All) | 2 | 1 | 2 | 3 | 2 | 2 | 2 | 2 | 2 | 2 | 2 |
In each iteration or episode, the AI agent will move from the top to the bottom row, connecting one inlet to one outlet in each step. There will be five kinds of actions, as shown in Fig. 8. Different actions will be assigned different rewards ranging from around −1000 to 5000. The values of rewards are carefully adjusted to satisfy two main requirements. One is that the rewards corresponding to a potentially successful system design are significant enough to be identified by the AI agent. The second requirement is that the gradient of the rewards for different actions is smooth enough that the AI agent can easily track the direction of improvement. The detailed explanations for the five kinds of actions (Fig. 8) are listed here:
Fig. 8 Five kinds of actions. Case 1: ‘inactive’ action; Case 2: ‘no action’ with ‘active’ options; Case 3: ‘no action’ without ‘active’ options; Case 4: ‘active’ action; Case 5: episode end action. |
Case 1: the agent takes an “inactive” action. Anytime the AI agent steps on an “inactive” spot (the blue spot in row 11), the action will be assigned the minimum reward plus a penalty (e.g., −1000 plus −400).
Case 2: the agent takes a “no action” action when “active” spots are in the row. In this case, the action will inherit the reward of the last action (Rlast) plus a penalty (e.g., −400). Please note that the penalty value is adjustable and impacts the system design complexity.
Case 3: the agent takes a “no action” action if there is no other available “active” spot in this row. In this case, the action will inherit the reward of the last action with no penalty.
Case 4: the agent takes an “active” action. In this case, the incomplete flowsheet at Step 6 (shown in the red dotted line) will be evaluated by the fast pre-screens. The reward is initialized at the beginning of the fast pre-screens with a value of 500. As the incomplete flowsheet violates Constraints 3, 5, 8, and 10, the initialized reward (500) plus the accumulated penalty (e.g., −700) will be the final deducted reward (e.g., −200).
Case 5: at the end of each iteration or episode, if the complete flowsheet satisfies all the physics constraints pre-screens, it will be sent to IDAES for re-evaluation. As seen in the black dotted line of Fig. 8, the flowsheet can be solved by IDAES, and has a feasible solution with a benzene purity of 75% and flowrate of 0.225 mol s−1. In this example, the reward will be 5000 plus an extra reward associated with the benzene purity and flow rate.
The Deep Q Network includes four convolutional layers between the observation and the fully connected layers. The convolutional neural network plays a part in structure compression, which can be turned on or off. It has been observed that CNN helps the AI agent find more feasible designs, especially for large-size CPPs, with reasonable extra computing costs. For the subsequent studies, “CNN structure compression” is always turned on. “System complexity” is controlled by the “no action” penalty. It can vary from 1 to 4 with four discrete “no action” penalties. In this section, “system complexity” is set at the highest level as the authors want the RL to involves as many units in the design as possible.
Seven RL-HDA training cases have been implemented with seven different CPPs. For each CPP, the RL agent can train from “zero” without loading any pre-trained model, and all the parameters in DQN are randomly initialized. The results are shown in Fig. 9, most of which take no more than 20 hours on a personal desktop. In Fig. 9(a), it shows the RL found thousands of unique designs passing the pre-screen process, which means they satisfy all the fundamental physics constraints. Deep blue bars “start from 0” denote RL results trained from “zero”, and light blue bars “restore HDA-20” represent RL results trained by loading a pre-trained model with 20 process-units available in CPP, which will be discussed in Section 3.2.2. Fig. 9(b) shows the number of unique feasible flowsheets found with each CPP. Orange bars “start from 0” denote RL results trained from “zero”, and yellow bars “restore HDA-20” are for RL results trained by restoring the pre-trained model, which will be discussed in Section 3.2.2.
Fig. 9 RL results for HDA system with different CPPs. (a) the number of flowsheets passing pre-screen; (b) the number of feasible flowsheets. |
For example, with 14 units in the CPP, the “start from zero” RL found 373 unique flowsheets passing the pre-screen process, and 123 are feasible. The feasible/pass-pre-screen ratio is about 1/3. As the AI agent is highly encouraged to use all the units in the pool, certain large-size CPPs (e.g., 16 units in Fig. 9) may confuse the RL framework and may need more training to learn the strategies of building feasible flowsheets.
Taking HDA-20 as an example, the RL found 1239 pass-pre-screen flowsheets and 64 feasible ones. Fig. 10(a) shows the number of pass-pre-screen flowsheets found in the training process as the blue line and the number of feasible ones as the red line. In the first 200000 episodes, the RL struggled to pass the pre-screen process. After that, it gradually found more and more unique feasible designs. Most of the flowsheets were found to involve 18 or 19 units in the designs.
The design objective is to obtain high product-flow-rate and product-purity HDA systems, and the performances of the 64 feasible flowsheets are shown in Fig. 10(b). With the system feed of 0.3 mol s−1 toluene, the product flow rate has a theoretical upper limit of 0.3 mol s−1. In the 64 feasible system designs, as shown in Fig. 10(b), the highest purity is 99%, and the highest Benzene flow rate that RL designed is 0.288 mol s−1. The corresponding highest-purity and highest-flow-rate designs are shown in Fig. 11(a) and (b), respectively. The flowsheet in Fig. 11(a) keeps a high volume of Benzene recirculating in the loops to increase the efficiency of the flash drum and stoichiometric reactor to obtain a high-purity product. The second design in Fig. 11(b) recycles the vapor outlet of “flash_1” to increase the utilization of system feed to achieve a high product flow rate. Notably, the core process of the second design correlates with the configuration reported in the literature as the best configuration.37 It is noticed that there are some unnecessary connections in the designs, or certain designs may be unpractical. One reason is that the RL agent is encouraged to use as many process-units as possible. For HDA-20, certain units may be connected to the system but not influence the system's operation. Another reason is in this HDA demonstration case, only product flow rate and purity were considered in the system optimization target; other practical factors, such as capital cost and annualized revenue, were not counted in the designing objective. The RL-IDAES allows the user to customize the design objective. As more practical considerations are added to the IDAES optimization and the action-reward system, the RL agent will behave closer to experienced engineers.
CO + 2H2 → CH3OH | (3) |
No. of units | Name | Feed | Product | Exhaust | Mixer | Compressor | Heater | Reactor | Flash | Splitter | Cooler | Expander |
---|---|---|---|---|---|---|---|---|---|---|---|---|
14 | Meth -14 | 2 | 1 | 2 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
15 | Meth -15 | 2 | 1 | 2 | 2 | 2 | 1 | 1 | 1 | 1 | 1 | 1 |
16 | Meth -16 | 2 | 1 | 2 | 2 | 2 | 2 | 1 | 1 | 1 | 1 | 1 |
17 | Meth -17 | 2 | 1 | 2 | 2 | 2 | 2 | 1 | 1 | 2 | 1 | 1 |
18 | Meth -18 | 2 | 1 | 2 | 2 | 2 | 2 | 2 | 1 | 2 | 1 | 1 |
19 | Meth -19 | 2 | 1 | 2 | 3 | 2 | 2 | 2 | 1 | 2 | 1 | 1 |
20 | Meth -20 | 2 | 1 | 2 | 3 | 2 | 2 | 2 | 2 | 2 | 1 | 1 |
22 (All) | 2 | 1 | 2 | 3 | 2 | 2 | 2 | 2 | 2 | 2 | 2 |
The RL-IDAES framework has been utilized to design methanol systems with the CPPs in Table 2. The RL takes 1 million episodes for each CPP to build the flowsheets. In each episode, the draft flowsheet will be evaluated by the pre-screens and the IDAES. The training setups are the same as that of the RL-HDA: “greedy factor ε” is increasing from 0.0 to 0.9, “learning rate α” is 0.01, “decay factor γ” is 0.5, “memory size” is 20000 and “batch” size is 3200. For all the implemented training, “CNN structure compression” is activated and “system complexity” is set to the highest level to encourage the AI agent to connect as many process-units as possible.
Seven RL-methanol training cases have been implemented with the pre-defined CPPs. All the training was completed with 15 hours on a personal computer; the results are summarized in Fig. 12. Fig. 12(a) shows the RL found pass-pre-screen flowsheets. Deep blue bars “start from 0” denote training DQN without loading any pre-trained model, and all the parameters in the DQN are randomly initialized. Light blue bars “restore HDA-20” denote training results by restoring a DQN model, which was trained for the HDA system with 20 process-units in CPP. Fig. 12(b) shows the feasible flowsheets with different CPPs; orange and yellow bars represent training results without or with loading the pre-trained DQN model. With 14 process-units, “start from zero” RL found 363 pass-pre-screen and 90 feasible flowsheets; with 15 or 16 units, the RL can still find hundreds of pass-pre-screen flowsheets, and the feasible to pass-pre-screen ratios increase to 31% or 41%; with 17 or 18 units, the RL undertook a boost in performance, more than 700 flowsheets passing the pre-screen process and about half of them are feasible. However, with more units in the pool, the RL can rarely find feasible flowsheets. This indicates that too many units don't fit such a simple reaction system.
Fig. 12 RL results for methanol system with different CPPs. (a) the number of flowsheets passing pre-screen; (b) the number of feasible flowsheets. |
Taking the Meth-18 as an example, the RL found 768 pass-pre-screen and 471 feasible flowsheets within 1 million episodes, as shown in Fig. 13(a). Most of the feasible designs involve 15 to 17 units in the system. Their performances are shown in Fig. 13(b). The x-axis denotes the methanol's unit cost, and the y-axis represents the methanol flow rate. The costs consider the operation and amortized capital costs of certain kinds of units. As can be seen, the costs per mole of methanol range from 0.035 $ per mol to 0.072 $ per mol. Flowsheets with methanol flow rates lower than 0.55 mol s−1 have been filtered as “infeasible” cases. The methanol flow rates of feasible flowsheets vary from 0.586 to 0.979 mol s−1. The lowest unit-cost case is in Fig. 14(a), and the highest-flow-rate case is in Fig. 14(b).
Restoring a pre-trained DQN model has proven to improve the RL's performance while reducing computing costs. As shown in light blue bars in Fig. 9(a) and yellow bars in Fig. 9(b), by restoring the DQN model trained with the 20-CPP for 1 million episodes (noted as HDA-20), all seven training cases have increased in performance. For the case with the 20-CPP, the RL found 16% more pass-pre-screen and 48% more feasible flowsheets. For the case with the 14-CPP, the RL found an 8.5 times increase in pass-pre-screen and feasible flowsheets after restoring the HDA-20 model. The RL-methanol cases for seven CPPs were retrained by restoring the same HDA-20 model as well. Results are shown in Fig. 12(a) and (b) compared to the “start from zero” results. With large-size CPPs (e.g., 17, 18 units in the pool), restoring a pre-trained model makes significant differences. For example, with 17 units in the pool, the RL found about 100% more pass-pre-screen and feasible designs. With other CPPs, one can still see some improvement in the RL's performances.
In order to explore the mechanism behind the training performance improvement, the RL found feasible flowsheets are analyzed. As presented in Section 3.1, the HDA or methanol system is denoted by a 21 × 21 matrix. And the matrix contains 441 potential connections. Counting the appearances of each connection in the RL found flowsheets, the probability distribution of each connection can be obtained. The probability distribution function (PDF) of connections for the HDA-20 is shown in Fig. 16. The PDF of connections for the feasible flowsheets found within the 3rd million is shown in Fig. 16(a), the PDF of connections for the feasible flowsheets found in the 4th million is shown in Fig. 16(b), and their difference is shown in Fig. 16(c). As marked in Fig. 16(a) and (b) share several peaks, such as at Connection 66. Connection 66 denotes “flash_0's vapor outlet connects to product_0's inlet”. It means the AI agent thought this connection was necessary for a feasible flowsheet early in the 3rd million episodes. Comparing Fig. 16(a) and (b), one can see that the peak at Connection 170 in Fig. 16(a) diminishes in Fig. 16(b), while new peaks at Connections 180, 236, and 260 appear in Fig. 16(b). This indicates in the 4th million episodes' training, and the RL learned to break the connection from mixer_2 to heater_1, then build a chain of connections “splitter_1 to heater_1, to StReactor_2, and to flash_1”. Learning these changes should help the AI agent to find more feasible designs.
The RL-methanol case experiences a significant improvement in the 5th million episodes. The probability distribution functions (PDFs) of connections for RL found flowsheets in 4th million episodes and 5th million episodes are shown in Fig. 17(a) and (b), respectively. Their difference is shown in Fig. 17(c). The RL has learned to build several essential connections in the early stage, such as Connection 67, “flash_0's liquid outlet connecting to product_0's inlet”. Both Fig. 17(a) and (b) peak at this connection. PDFs for the 5th million episodes in Fig. 17(b) have stronger peaks than that of the 4th million episodes in Fig. 17(a) at Connection 84 and 127. These two connections denote that mixer_0 integrates with mixer_1 and mixer_2 to be a four-inlet mixer. It will ensure recycling loops go upstream of the system. What makes a dramatic change in the 5th million episodes is that the RL breaks Connection 51 and builds Connection 9, which means it connects flash_1's liquid outlet to flash_0's inlet instead of exhaust_2's inlet. In other words, the methanol stream in flash_1's liquid outlet can go to flash_0 and then product_0 instead of exhausting the system.
Two demonstration case studies have been implemented. The RL-IDAES framework has been proven to design and optimize complicated hydrodealkylation of toluene and methanol synthesis systems with high flexibility at affordable computing costs. For example, 123 feasible designs for the HDA system with 14-CPP can be found within 20 hours on a PC. The RL framework can share a universal observation template among different CPPs or energy and chemical systems. A trained DQN model can be transferred to other training cases. Restoring a pre-trained DQN model has proven to improve the RL's performance. As demonstrated in Section 3.2.2, the DQN model trained for the HDA system can be directly used in methanol synthesis system design and found more feasible designs with limited training episodes. In some cases, the RL framework may struggle to learn the strategies for building feasible flowsheets within the limited 1 million training episodes, especially for cases with relatively large CPPs. However, as demonstrated in Section 3.3, with adequate training episodes, the AI agent can eventually discover the key connections in the system and discard the distracting connections, significantly increasing the number of feasible designs.
1D | 1-Dimensional |
AI | Artificial intelligence |
C | Compressor |
CNN | Convolutional neural network |
CPP | Candidate process-units pool |
DNN | Deep neural network |
DOE | Department of Energy |
DQN | Deep Q value network |
F | Feed |
FL | Flash |
γ | Decay factor |
H | Heater |
HDA | Hydrodealkylation reaction |
IDAES | Institute for the design of advanced energy system |
M | Mixer |
ML | Machine learning |
NETL | National energy technological laboratory |
P | Product |
Q | Q Value |
Q next | Future Q values |
R | Reactor, reward |
RL | Reinforcement learning |
Rlast | Reward of the last action |
RL-IDAES | RL-Guided energy and chemical systems design framework |
RMSProp | Root mean squared propagation |
This journal is © The Royal Society of Chemistry 2023 |