Open Access Article
Nishchaya Kumar Mishra
a and
Sameer Patel
*abc
aDepartment of Civil Engineering, India. E-mail: sameer.patel@iitgn.ac.in
bDepartment of Chemical Engineering, India
cKiran C. Patel Centre of Sustainable Development, Indian Institute of Technology Gandhinagar, Palaj, Gandhinagar, Gujarat 382355, India
First published on 8th April 2026
Minimizing indoor pollutant exposure while conserving energy is essential for protecting human health and the environment. Deep reinforcement learning (DRL) has emerged as a promising approach for optimizing residential ventilation and air conditioning systems. While DRL deployment is simpler than fully physics-driven strategies like dynamic optimization (DynOpt), its generalizability across diverse buildings and ambient conditions remains challenging. Although researchers have studied transfer and imitation learning techniques to address these challenges, they still require house characteristics and field measurements to adaptively train an agent. Therefore, the large-scale deployment of DRL agents can still be potentially challenging. This study assesses the performance of a trained DRL agent against the DynOpt (benchmark) when transferred to houses with varying characteristics and environmental conditions using digital twins. When varying house characteristics one at a time, the agent's performance remained comparable to DynOpt, with particulate matter (PM) exposure and energy ratios near unity (1.05 ± 0.03). Similarly, under simultaneous variations in house characteristics, the exposure (1.03 ± 0.07) and energy (1.09 ± 0.06) ratios remained close to one. However, the agent's performance declines in houses with high PM infiltration under high ambient parameters. The results indicate that the agent can still be integrated into different houses under varying ambient conditions by restricting the infiltration of PM, as evident by lower exposure and energy ratios in houses with lower infiltration. Moving forward, uncertainty quantification and benchmarking of the agent's performance are critical for enhancing confidence in predictions.
Environmental significanceIndoor environments govern occupants' health, comfort, and disease transmission, while also consuming a significant fraction of global energy, thereby necessitating optimization. Simultaneously, ensuring a healthy and comfortable indoor environment for the masses requires a robust, low-cost solution that can be deployed at scale. Therefore, it is critical to understand the capability of data-driven learning algorithms, such as reinforcement learning (RL), that could be a potential solution. This study aims to understand the scalability challenges associated with the adoption of a deep RL agent under varying household characteristics, different indoor pollutant emission scenarios, and diverse ambient weather conditions. |
Studies have demonstrated that reducing pollutant exposure inside buildings is associated with increased energy consumption of HVAC systems to ensure thermal comfort.14,15 Therefore, a complex, interdependent relationship exists between thermal comfort and pollutant exposure in indoor environments. Multiple studies have proposed physics-based optimization and artificial intelligence algorithms to balance the trade-off and optimize the operation of the HVAC systems to reduce exposure and energy consumption while ensuring thermal comfort.14,16–22 For example, Mishra et al.23 designed a deep reinforcement learning (DRL) agent for optimizing particulate matter (PM) exposure, energy consumption, and thermal comfort in a house. The same study compared the performance of the DRL agent with a dynamic optimization strategy and demonstrated that the DRL agent performed on par with it. Similar studies have shown the advantages of such agents over rule-based and traditional physics-driven algorithms in controlling indoor environments.16,24,25 Moreover, reinforcement learning (RL) agents have the potential for wide-scale deployment owing to many advantages over their conventional counterparts, such as learning an optimal decision-making policy directly through environmental interaction, requiring no knowledge of the system's physics and building characteristics.26–28 However, the dissemination of these agents at the community scale is limited owing to multiple challenges, such as transferability across buildings, non-intuitive performance, performance mismatch between simulation and real building, and datasets needed for training.29–32
RL agents are often trained in virtual indoor environments/digital twins of buildings.27,29,33,34 Since the agent's training and testing are conducted offline using a digital twin, the performance of the trained agent in an actual building is susceptible to uncertainties and variations when deployed in the field.35,36 On the contrary, online training (in real buildings) results in longer training times and discomfort for occupants during initial training phases when the agent is still learning.37 Researchers have proposed various methods for HVAC control to overcome these challenges, where transfer learning,31,38–41 imitation learning,34,42,43 and multi-agent reinforcement learning44,45 are some of the recently studied alternatives. Chen et al.41 utilized transfer learning to predict indoor air temperature and relative humidity over a time horizon ranging from 10 minutes to 2 hours in a building. The same study demonstrated that the transferred model achieves higher accuracy in predicting indoor air temperature and RH with a mean square error lower than that of the benchmark model trained only on source or target data. Similarly, Deng et al.40 transferred the behavior knowledge of an RL agent in different office buildings to control the set temperature and clothing level. The transferred model predicted occupant behavior with a high correlation (>0.8) and a mean square error of less than 1.1 °C. Further, Liu et al.43 developed an imitation–interaction learning control method for multi-zone ventilation systems that accelerated RL training towards higher control performance and energy efficiency. Dey et al.34 also proposed an RL-based building control method harnessing imitation learning, which reduced the training time while preventing unstable early exploration behavior and improving an accepted rule-based policy. However, techniques such as transfer learning and imitation learning still pose many challenges associated with their adoption. For imitation learning, the existence of an expert is crucial, as the agent learns the optimal policy by observing the expert's decisions. Therefore, learning an optimal policy via imitation learning is challenging when expert demonstrations are limited or the system is highly dynamic and complex. Similarly, in transfer learning, an optimal policy learned in a building is transferred to another building after retraining on a smaller dataset, which further requires data collection, monitoring, and an understanding of the target building's characteristics. These challenges restrict the wide-scale deployability of RL agents.
For the transferability of an RL agent, there are two key aspects to account for: (i) changes in household characteristics such as inner surface-to-volume ratio, thermal permeability, and pollutant penetration rate, and (ii) variations in climatic conditions and ambient pollutant concentration, since the performance of an RL agent could vary under these conditions. Therefore, for wide-scale deployment, it is imperative to evaluate the performance of these agents under varying climatic conditions across buildings with differing characteristics. Fig. 1 outlines the input and output parameters of a DRL agent and physics-based dynamic optimization strategy. Physical modeling of the house and HVAC systems is needed for dynamic optimization, in addition to sensor inputs (temperature, RH, and pollutant concentration). However, the trained DRL agent does not require house characteristics and physical models as inputs, and observations from low-cost monitors can be fed directly to the agent to obtain control actions. Based on this knowledge, the current study trains a DRL agent to optimize PM2.5 (particles with a diameter of 2.5 microns or less; hereafter referred to as PM) exposure, thermal comfort, and energy consumption for a single house (the training house), which is then transferred to different houses (test houses). The agent transfer has been done under two conditions: (i) transferred to test houses (emulated through changing house characteristics) in the same neighborhood (same ambient conditions), and (ii) transferred to testing houses in different geographical locations (emulated by changing ambient conditions). Simulations have been performed to evaluate the performance of the transferred DRL agent compared to that of a dynamic optimization strategy (DynOpt) inside test houses under the defined conditions. Subsequently, alternatives are proposed to address the challenges associated with the transferability of the DRL agent.
Succinctly, this work contrasts with prior studies that validate RL-based controllers within a fixed building configuration or climatic setting; it rigorously examines the cross-building and cross-climate transferability of a DRL agent trained on a single house. Rather than limiting evaluation to isolated parametric perturbations, the current work systematically analyzes multidimensional variations in house characteristics and shifts in ambient conditions to identify robustness boundaries relative to a physics-based dynamic optimization benchmark. The study further quantifies conjugate interaction effects that emerge under extreme configurations, an aspect that is underexplored in the existing literature. By doing so, this work advances DRL-based indoor environmental control from case-specific demonstrations toward scalable, generalizable real-world deployment.
The measurements from this digital twin are fed to the DRL agent for training. The agent takes simulated indoor parameters (temperature, RH, PM concentration, and energy consumption), measured outdoor parameters (temperature and RH), and hourly ambient PM concentration, obtained from48 for the test house location, downscaled to 1-minute resolution as inputs to the DRL agent to predict the indoor–outdoor AER. Multiple parametric combinations have been utilized to vary the house characteristics and evaluate the performance of the trained DRL agent when transferred to different houses.
The variations in the volume, PM deposition rate, and thermal permeability of the tested houses are in the range of ±60% of the training house. The PM penetration factor controls the infiltration rate of ambient PM into buildings and ranges from 0 to 1, where 0 represents 100% ambient air filtration, and 1 means no filtration. The penetration is governed by multiple factors, including house construction, cracks, gaps, openings, and transport through the ventilation system into the building envelope. Naturally, all houses allow penetration of a certain fraction of ambient PM. However, transport through the ventilation systems can be controlled by installing an air filter in the dedicated air supply system (DASS), which controls the indoor–outdoor AER. In this work, the penetration factor has been varied between 0.1 and 0.9, representing different PM infiltration scenarios.
The details of the training house and transferred house characteristics are shown in Table 1. Five cases (Case ID C1 to Case ID C5) have been defined where the impacts of different house characteristics on the DRL agent's performance have been assessed by varying one characteristic at a time. In other words, Case IDs C1–C5 were constructed as controlled univariate sensitivity analyses under identical ambient conditions. For each case, one key house characteristic (e.g., volume, PM deposition rate, thermal permeability of the building envelope, and PM penetration factor) was varied from its minimum to maximum bound while keeping all other parameters fixed at baseline values. The selection of the minimum and maximum bounds was performed heuristically. These heuristic bounds were used to test robustness across incremental variability rather than to identify extreme cases.
| Purpose | Case ID | Volume factor (Vtesting/Vtraining) (Vtraining = 250 m3) | PM deposition factor (λtesting/λtraining) (λtraining = 1.6 h−1) | Thermal permeability factor (αtesting/αtraining) (αtraining = 0.068 kJ s−1 C−1) | PM penetration factor (pDASS) |
|---|---|---|---|---|---|
| a V: volume of a house, λ: PM deposition rate, α: thermal permeability of a house, pDASS: PM penetration factor of dedicated air supply system (DASS). ‘Testing’ and ‘training’ refer to the houses where the agent is being tested and trained. | |||||
| Training | C1 | 1 | 1 | 1 | 0.5 |
| Testing | C2 | [0.4, 0.6, 0.8, 1.0, 1.2, 1.4, 1.6] | 1 | 1 | 0.5 |
| C3 | 1 | [0.4, 0.6, 0.8, 1.0, 1.2, 1.4, 1.6] | 1 | 0.5 | |
| C4 | 1 | 1 | [0.4, 0.6, 0.8, 1.0, 1.2, 1.4, 1.6] | 0.5 | |
| C5 | 1 | 1 | 1 | [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9] | |
| C6_1 | 0.4 | 0.4 | 0.4 | 0.1 | |
| C6_2 | 0.4 | 0.4 | 0.4 | 0.9 | |
| C6_3 | 0.4 | 0.4 | 1.6 | 0.1 | |
| C6_4 | 0.4 | 0.4 | 1.6 | 0.9 | |
| C6_5 | 0.4 | 1.6 | 0.4 | 0.1 | |
| C6_6 | 0.4 | 1.6 | 0.4 | 0.9 | |
| C6_7 | 0.4 | 1.6 | 1.6 | 0.1 | |
| C6_8 | 0.4 | 1.6 | 1.6 | 0.9 | |
| C6_9 | 1.6 | 0.4 | 0.4 | 0.1 | |
| C6_10 | 1.6 | 0.4 | 0.4 | 0.9 | |
| C6_11 | 1.6 | 0.4 | 1.6 | 0.1 | |
| C6_12 | 1.6 | 0.4 | 1.6 | 0.9 | |
| C6_13 | 1.6 | 1.6 | 0.4 | 0.1 | |
| C6_14 | 1.6 | 1.6 | 0.4 | 0.9 | |
| C6_15 | 1.6 | 1.6 | 1.6 | 0.1 | |
| C6_16 | 1.6 | 1.6 | 1.6 | 0.9 | |
For Case ID C6, 16 parametric combinations have been simulated, with the minimum and maximum variations for each house characteristic adopted, and each combination assigned a House ID (C6_1 to C6_16; see Table 1). These configurations were created to assess the conjugate (interaction) effects among parameters rather than from individual thresholds. It involved two methodological approaches: stress-testing the DRL agent at the edges of different house characteristics, and analyzing cross-factor interactions that might not be observed under univariate perturbation. Instead of sampling at intermediate levels of house characteristics, this case investigates the extreme envelope of joint house characteristics, where generalization limits can be tested. It is acknowledged that these different combinations may not describe all existing house characteristics in a community. Nevertheless, they provide critical insights into assessing the transferability of the DRL agent for indoor PM control and energy optimization.
| rt = −W1E − W2(max(0, C − Cmax))a | (1) |
The first term in eqn (1) corresponds to energy consumption, and the second is a proxy for exposure. W1 and W2 are the user-assigned weightage to energy (E) and exposure terms. Cmax is the threshold PM concentration whose exceedances are penalized, and a represents the order of penalty when the indoor PM concentration (C) exceeds Cmax. For all simulations and evaluation purposes, W1 is 1, W2 is 10, a is 2, and Cmax is 10 µg m−3. A detailed discussion on the selection of W1, W2, and a is presented in Mishra et al.,14 and discussed briefly in Section S4 of the SI.
The decision-making ability of the agent to take action is termed policy, denoted by πt(st|at), and the rewards of an action at a given state are determined using the values function (Qπ (s, a)), as shown in eqn (2).49
![]() | (2) |
![]() | (3) |
During training, the agent comprises of two fully connected neural networks-(a) behavior network, with weights wb, and (b) target network, with weights wt. The behavior network makes the decision and communicates with the environment, and the target network is used to update the behavior network.49–51 A replay buffer is defined to store the agent's experience while interacting with the environment, which allows self-learning through past experiences.49,51 A random batch sampling performed over the experiences stored in the replay buffer ensures learning from past mistakes, avoids overfitting, and eliminates the correlation between the input data in each batch.52,53 The state, st, stores various indoor and outdoor parameters and serves as input to the behavior network at each time step (as shown in Eq. (4)).
| st = [Cout(t), Tout(t), RHout(t), Cin(t), Tin(t), RHin(t), λDASS(t), E(t)] | (4) |
The agent takes action, at, through the behavior network
and the epsilon-greedy strategy. The epsilon-greedy policy refers to taking the action corresponding to the maximum value of
with a probability of epsilon (ε) and at random with a probability of 1 − ε. The action space, A, consists of the indoor–outdoor AER (λDASS), a step function ranging from 0.5 ACH to 10 ACH at a step size of 0.1 ACH, meaning a total of 96 actions are possible. The state, st, the action, at, the reward, rt, and the next state, st+1, are stored in the replay buffer, and a batch of samples (st, at, rt, st+1) is randomly selected to update the behavior network. The behavior network takes (sk, ak) as input and outputs Qb(sk, ak), while the target network takes (rk, ak+1) as input to give an output of maximum
. The behavior network is then updated based on the loss function, Lk, estimated as shown in eqn (5).49 The parameters of the target network are updated every m timesteps by replacing them with the behavior network. This update frequency, m, is a trainable parameter whose selection has been discussed later.
![]() | (5) |
The agent aims to learn the policy that maximizes the total reward, where the reward function consists of energy and exposure terms, and the indoor set temperature is kept at 25 °C at all times to ensure the thermal comfort of the occupants.
The DynOpt strategy has been formulated using a cost function defined in eqn (6), with the indoor–outdoor AER and physics-based knowledge of the indoor environment dynamics as constraints. The cost function in eqn (6) is identical to the reward (shown in eqn (1)) and is a weighted combination of two terms: the first term corresponds to energy consumption, and the second is a measure of pollutant exposure.
Minimize
| obj = W1E + W2(max(0, C − Cmax))a subjected to λmin ≤ λDASS ≤ λmax | (6) |
| Hyperparameters | Values |
|---|---|
| a Fixed values were taken for these parameters. | |
| Hidden layers and nodes | {128, 256, 128}, {128, 256, 128, 64}, {128, 256, 256, 128, 64} |
| Learning rate | 0.001, 0.01 |
| Discount factor (γ)a | 0.99 |
| Batch size | 512, 1024 |
| εsart, εmin, εdecaya | 1.0, 0.01, 0.99 |
| Target network update frequency (m) | 100, 200 timesteps |
| Replay memory sizea | 20 000 |
The state space (st) serves as the input layer, and the action space (at) serves as the output layer, with a learning rate of 0.001, batch size of 512, and target network update frequency of 200 timesteps. The selected network architecture of the DRL agent is a fully connected newyrok with dimensions st × 128 × 256 × 128 × at. The replay memory is set to store 20
000 to allow the agent to learn from past experiences.
The subsequent section first demonstrates the agent's performance in houses with different characteristics (volume, deposition rate, penetration factor, and thermal permeability as outlined in Table 1) for moderate variations in ambient temperature (25 °C to 33 °C), RH (40% to 73%), and PM (<20 µg m−3) similar to that in the training dataset. Then, the performance of the agent is assessed across varying house characteristics with higher variations in ambient temperature (28 °C to 44 °C), RH (40% to 80%), and PM (up to 110 µg m−3). The indoor PM emission periods for three days with low and high emission activities have been adopted and kept the same for all houses because the emission rate for the same type of activity is independent of house characteristics.
Cumulative exposure (Exp) and energy consumption (Enr) ratios
between the two control strategies (DRL agent and DynOpt) for the same house have been used to assess the performance of the DRL agent. A ratio of one for any parameter (exposure or energy) indicates equal values of that parameter in both control scenarios, and lower ratios indicate lower exposure and energy consumption for the DRL agent than DynOpt, indicating comparable or better performance of DRL. Fig. 2 shows the effect of variations in the house characteristics (A: volume, B: thermal permeability, C: penetration factor, and D: deposition rate) on the performance of the DRL agent.
The exposure and energy ratios (Fig. 2A–D) demonstrate that the differences between the two control strategies are minimal as the ratios range between 1.00 and 1.09, except for a few outliers. Both exposure and energy ratios remain in the range of 1.05 ± 0.03, indicating that the total exposure and energy, on average, are just 5% more for the DRL agent-based control than the DynOpt control. The largest difference of 14% in energy and exposure is observed for the house with the largest volume. This could be due to minor deviations in the indoor–outdoor AER, leading to a considerably increased cooling demand. However, the overall trend indicates that personal exposure and energy consumption are comparable across the two control strategies, regardless of changes in house characteristics. These findings suggest that the performance of the DRL agent is relatively independent of house characteristics under similar climatic conditions. Therefore, a DRL agent trained for a particular house can be deployed to other houses with little to no decline in performance.
The above discussion pertains to cases in which house characteristics were modified one at a time. The results indicate that, within realistic single-parameter perturbations, the trained DRL agent remains stable and near-optimal, and no distinct “turning point” for any individual parameter can be identified across C1–C5. Since isolated parameter variation did not produce significant degradation, it was hypothesized that performance limitation, if present, would emerge from conjugate (interaction) effects among parameters rather than from individual thresholds. Moreover, in the real world, multiple house characteristics are likely to change simultaneously. Therefore, C6 was designed to evaluate combinations at the minimum and maximum bounds of all four house characteristics, yielding a total of 16 combinations (C6_1 to C6_16; Table 1). Table 3 shows the ratios of exposure and energy for the DRL agent and DynOpt in these 16 houses.
| a V: Volume of a house, λ: PM deposition rate, α: thermal permeability of a house, pDASS: PM penetration factor of dedicated air supply system (DASS). ‘Testing’ and ‘training’ refer to the houses where the agent is being tested and trained. |
|---|
![]() |
In the cases where the house characteristics vary between two extremes, the exposure ratios remain within 0.92 and 1.07 for all houses, except for two cases (C6_14 and C6_16). Overall, the average ratios for exposure are slightly greater than one (1.03 ± 0.07), while a comparatively higher value (1.09 ± 0.06) is observed for the total energy ratios. Based on the observed standard deviation (0.07), the exposure ratios (1.19) for the two houses, C6_14 and C6_16, lie outside the central cluster of values and clearly separate them from the remaining distribution. Therefore, these cases are treated as outliers and share three extreme characteristics: maximum house volume, maximum deposition rate, and maximum penetration factor. These represent the corner of the multidimensional house parameter space where infiltration is maximized due to a high penetration factor, dilution effect due to high volume, and altered removal dynamics at high deposition. The simultaneous presence of these three maxima creates a compounded condition that is not encountered in single-parameter variations (C1–C5). Therefore, their deviation reflects a conjugate interaction effect rather than isolated parameter sensitivity. The energy consumption ratios also demonstrated similar variations (1.00 to 1.20), again demonstrating a compounded effect of multiple house characteristics.
To further analyse this compound effect, a multivariate regression analysis is performed, linking variations in exposure ratios to changes in house characteristics. Fig. 3 shows the exposure-energy ratio behaviour obtained from the multivariate analysis. The exposure ratio demonstrates a strong linear association (R2 = 0.90) with changes in house characteristics. The individual coefficients for changes in volume (0.074), penetration factor (0.080), and deposition rate (0.070) are comparable, indicating that no single driver dominates and that the agent's performance depends on multiple house characteristics. Simultaneously, changes in the thermal permeability have the least impact on the exposure, with an individual coefficient of 0.001. These observations suggest that the exposure ratio exhibits a stable, well-defined linear dependence on the selected independent variables, with a distributed multi-factor influence. On the other hand, for energy performance, multivariate regression only explains a part of the variability. This indicates that the energy dynamics involve more complex non-linear interaction and control trade-offs.
In brief, the indoor dynamics of PM and the thermal environment are governed by the synergistic effects of various house parameters, and it is challenging to attribute the observations to a specific characteristic. These results indicate that the DRL agent, when transferred to different houses under similar ambient conditions to those of the training house, performs comparably to a fully physics-driven strategy (DynOpt), with some outliers in extreme cases. Therefore, the proposed DRL agent can be transferred to different houses under the same climatic conditions after sufficient training. The next challenge in transferring the DRL agent for indoor environment control is ascertaining its performance under varying ambient conditions that differ from those during training, and the subsequent section discusses the same.
In addition to the initially trained DRL agent in the training house (C1 in Table 1), three more DRL agents are trained in the same house under varying ambient conditions. These four agents are referred to as DRLOriginal, the original agent trained under normal ambient conditions; DRLExtTempRH, an agent trained under higher variations in ambient temperature and RH; DRLExtPMTempRH, an agent trained under wider ranges of ambient PM, temperature, and RH; and DRLExtPM, an agent trained under wider ranges of ambient PM concentration. The ratios of exposure and energy, defined in Section 3.2, have been estimated for these agents under four ambient conditions (original ambient conditions, high ambient PM, high ambient temperature, and RH, and high ambient PM, temperature, and RH) in 17 houses (C1 and C6_1 to C6_16 from Table 1). Fig. 4 demonstrates these ratios for all DRL agents under high ambient conditions.
In Fig. 4, exposure and energy ratios of one signify equal exposure and energy consumption for the DRL agent and the DynOpt. Since both ratios are depicted on the same vertical axis, the composite bar is defined as the sum of exposure and energy ratios for any DRL agent. This bar should remain within the limits marked by dashed lines (shown in Fig. 4). Exceedances of these limits indicate a higher value of exposure or energy for the DRL agent relative to the DynOpt. The upward or downward shifting of the composite bar demonstrates the trade-off between exposure reduction and energy penalty.
Under normal ambient conditions (Fig. 4A), all agents (DRLOriginal, DRLExtTempRH, DRLExtPMTempRH, DRLExtPM) demonstrate similar levels of exposure and energy compared to DynOpt, barring a few houses with high PM penetration factors (pDASS = 0.9). For DRLOriginal, the exposure and energy ratios vary in the range of 1.03 ± 0.07 and 1.09 ± 0.05, respectively (Fig. 4A). At the same time, DRLExtPM has the highest exposure compared to DynOpt, with the corresponding ratios in the range of 1.26 ± 0.04, and DRLExtTempRH has the maximum energy ratios (1.29 ± 0.17) under the normal ambient conditions. Apart from these two notable deviations, the exposure and energy ratios are comparable for other DRL agents in different houses. However, a slight drift of the composite bar in the upward or downward direction exists, representing the inherent trade-off between exposure reduction and energy consumption while ensuring thermal comfort in houses. These variations in energy and exposure ratios for DRL agents with different household characteristics demonstrate that DRLOriginal can perform reasonably well across a wide range of values of household characteristics. These findings indicate that well-trained DRL agents can be deployed in the field without training for each house.
Under varying ambient conditions (Fig. 4B–D), results can be categorized into two groups based on the PM penetration factor (pDASS). The left and right halves of the plots show results for houses with PM penetration factors of 0.1 (low infiltration) and 0.9 (high infiltration), respectively. The other house characteristics for odd-numbered and their immediate next even-numbered houses are the same. For example, houses C6_1 and C6_2 have the same characteristics (volume, deposition rate, and thermal permeability) except for the pDASS of 0.1 and 0.9, respectively. The original house, C1, is shown at the center and has a pDASS of 0.5.
Under different configurations of high ambient conditions (temperature, RH, and PM), the exposure and energy ratios for all DRL agents are significantly lower in the houses with a lower PM penetration factor (pDASS = 0.1). The average exposure ratios for pDASS of 0.1 are 1.23 ± 0.13, while the same ratio varies in the range of 1.57 ± 0.41 for houses with high PM infiltration (pDASS = 0.9). Energy ratios also demonstrate similar trends, wherein for pDASS of 0.1, the average energy ratios are 1.06 ± 0.15, and for pDASS of 0.9, the variations are in the range of 1.95 ± 0.71. Looking explicitly at the performance of DRLOriginal, the average ratios of exposure and energy ratios for low-infiltration houses are 1.14 ± 0.11 and 1.10 ± 0.14, respectively, i.e., lower than the average exposure ratio (1.23) and higher than the average energy ratio (1.06) for all agents. Also, a clear differentiation can be made between the houses with low and high PM infiltration in terms of energy and exposure ratios for DRLOriginal. This distinct difference between exposure and energy ratios between low and high PM infiltration houses demonstrates an exacerbated decline in the DRL agents' performance due to the high infiltration of ambient PM. Therefore, transferring DRLOriginal to houses with high PM infiltration may result in suboptimal control. However, the agent performed well in houses with low PM infiltration rates under varying ambient conditions. Variability in agents' performance in houses with high PM infiltration may arise from fluctuations in the predicted indoor–outdoor AER. For example, increased indoor–outdoor AERs have a lesser impact on exposure and energy consumption in houses with lower PM infiltration.
From the preceding discussion in Sections 3.2 and 3.3, the following critical observations can be made on the at-scale deployability potential of a DRL agent trained with a limited range of household characteristics:
(1) The performance of DRLOriginal, when transferred to houses with different characteristics under normal ambient conditions, is reasonably comparable to DynOpt, indicating that a sufficiently trained DRL agent can be transferred to other houses for optimal indoor environment control under normal ambient conditions.
(2) Under varying ambient conditions (shown in Fig. 4B–D), a proxy for different geographical locations, DRLOriginal performs better in houses with lower PM infiltration. The performance differences in houses with low and high infiltration demonstrate high variability under varying ambient conditions, highlighting the challenges associated with wide-scale deployability.
(3) The PM infiltration in a house depends on the ventilation mechanism. When the house is positively pressurized, the inflow of ambient air occurs via the DASS, and infiltration of ambient PM can be reduced by installing an air filter in the DASS. In this scenario, DRLOriginal can still be deployed, as demonstrated by its performance in houses with low infiltration. However, when the house is negatively pressurized, ambient air enters through cracks and openings, making it difficult to restrict the infiltration of pollutants. Therefore, in leaky houses (with more cracks and openings), it may be challenging to integrate the trained DRL agent.
The study results demonstrate that the trained agent (DRLOriginal) can be transferred to houses with varying characteristics under normal ambient conditions. The ratios of exposure (1.05 ± 0.03) and energy (1.05 ± 0.03) between the DRL agent and the dynamic optimization remain close to one, indicating acceptable performance. Under different ambient conditions, the original agent's performance is sub-optimal compared to the dynamic optimization control for houses with high PM infiltration. In contrast, for low PM infiltration, the agent performs comparably to the dynamic optimization strategy, with exposure and energy ratios of 1.14 ± 0.11 and 1.10 ± 0.14, respectively. These trends suggest that PM penetration affects the agent's performance at locations with different ambient conditions from the original locations. Therefore, manual intervention is needed in houses that allow high PM penetration, such as installing an air filter in the indoor–outdoor ventilation unit, enabling the transfer of the original DRL agent to other houses under high ambient conditions.
While this study shows the potential of DRL and similar agents to be transferred to different houses with minor or no additional intervention and training, the real-world integration and performance evaluation of these agents is imperative to assess the challenges associated with field deployment. Moreover, exposure in houses varies spatially, so the assumption of a well-mixed indoor air does not remain valid in all conditions.54,55 Therefore, multi-agent control systems are needed to reduce personal exposure levels, considering the multi-zonal representation of a house. Furthermore, the performance of the reinforcement learning agents needs to be assessed under ambient conditions throughout the year to develop a solution suitable for all weather conditions. Simultaneously, uncertainty quantification and extensive benchmarking of their performance are critical for enhancing confidence in the agents' predictions. Moving forward, a collaborative effort from multidisciplinary stakeholders is necessary to integrate these agents to improve health and safety in buildings. Advances in control algorithms, wide-scale field deployment, and performance benchmarking of these agents could promote their adoption at the community scale.
| ACH | Air changes per hour |
| AER | Air exchange rate |
| DASS | Dedicated air supply system |
| DQN | Deep Q network |
| DRL | Deep reinforcement learning |
| DynOpt | Dynamic optimization |
| HVAC | Heating, ventilation, and air conditioning |
| PM | Particulate matter |
| RH | Relative humidity |
Supplementary information (SI) is available. See DOI: https://doi.org/10.1039/d5va00438a.
| This journal is © The Royal Society of Chemistry 2026 |