Wisam
Mohammed
a,
Adrian
Adamescu
a,
Lucas
Neil
b,
Nicole
Shantz
ab,
Tom
Townend
c,
Martin
Lysy
*d and
Hind A.
Al-Abadleh
*a
aDepartment of Chemistry and Biochemistry, Wilfrid Laurier University, 75 University Ave West, Waterloo, ON N2L 3C5, Canada. E-mail: halabadleh@wlu.ca; Tel: +1 519-884-0710 ext. 2873
bAusenco, 100–1016B Sutton Dr, Burlington, Ontario L7L 6B8, Canada
cAQMesh, Environmental Instruments Ltd, Unit 5, The Mansley Centre, Timothy's Bridge Road, Stratford-upon-Avon, CV37 9NQ, UK
dDepartment of Statistics and Actuarial Science, University of Waterloo, 200 University Ave West, Waterloo, ON N2L 3G1, Canada. E-mail: mlysy@uwaterloo.ca; Tel: +1 519-888-4567 ext. 45503
First published on 6th October 2022
Machine learning is used across many disciplines to identify complex relations between outcomes and numerous potential predictors. In the case of air quality research in heavily populated urban centers, such techniques were used to correlate the impacts of Traffic-Related Air Pollutants (TRAP) on vulnerable members of communities, future pollutant levels, and potential solutions that mitigate adverse effects of poor air quality. However, machine learning tools have not been used to assess the variables that influence measured pollutant levels in a suburban environment. The objective of this study is to apply a novel combination of Random Forest (RF) modeling, a machine learning algorithm, and statistical significance analysis to assess the impacts of anthropogenic and meteorological variables on observed pollutant levels in two separate datasets collected during and after the COVID-19 lockdowns in Kitchener, Ontario, Canada. The results highlight that TRAP levels studied here are linked to meteorology and traffic count/type, with relatively higher sensitivity to the former. Upon taking statistical significance into account when assessing relative importance of variables affecting pollutant levels, our study found that traffic variables had a more discernible influence than many meteorological variables. Additional studies with a larger dataset and spread throughout the year are needed to expand upon these initial findings. The proposed approach outlines a “blueprint” method of quantifying the importance of traffic in mid-size cities experiencing fast population growth and development.
Environmental significanceAssessing air quality at the neighborhood scale provides information on pollutant levels and sources that is often missed by regional stations. Since natural variability and human activities factors affect pollutant levels used in communicating air quality to the public, quantifying the relative importance of these factors would guide strategic urban planning and regulations aimed creating healthy air for all. Here we employ innovative mathematical tools that analyze data from a network of low-cost air quality stations in Kitchener, Ontario, Canada for correlation with meteorology and traffic counts/type during after COVID-19 related lockdowns. Our findings show that pollutant levels are sensitive to traffic changes even when meteorology plays the dominant role in their levels. |
The air pollutants that are routinely measured due to their known health impacts include nitric oxide (NO), nitrogen dioxide (NO2), ground-level ozone (O3), fine particulate matter (PM2.5), carbon monoxide (CO), and carbon dioxide (CO2). The combustion of fossil fuels in automotive vehicles is known to be a major contributor to levels of nitrogen oxides (NOx, x = 1, 2),8–10 CO,11 and volatile organic compounds (VOCs). The photochemical decomposition of NO2 in the troposphere gives rise to oxygen radicals, which react with atmospheric oxygen and VOCs to form ground-level ozone, a secondary pollutant.12,13 Primary sources of PM2.5 include wind-blown dust particles, biomass burning, industrial activity,14,15 and non-exhaust emissions stemming from the wear and tear of automotive breaks, tires, and roads.16 Secondary PM2.5 form in the atmosphere from complex atmospheric multiphase reactions involving VOCs and other chemicals in the gas and condensed phases.8,14
Prolonged exposure to the aforementioned pollutants results in adverse health effects in local communities, with a larger impact on vulnerable members with pre-existing conditions such as heart disease and asthma.17 Long periods of exposure to these pollutants can worsen asthmatic symptoms, increase chances of genetic defects in unborn children,18 impact adolescent health,19 increase the risk of cardiovascular diseases, and cause organ failure.9,14,20 Recent studies on the impacts of TRAP on cardiovascular health highlighted that even at lower levels, TRAP is a significant contributor to pollution-induced diabetes mellitus,21 myocardial infarction,22 and cancer development in the respiratory tract.23 In 2016, the United Nations Children's Fund (UNICEF) reported 600000 deaths in adolescents globally as a direct result of exposure to unfavorable air quality conditions.24 Hence, air quality continues to be a major concern among governmental bodies worldwide, particularly in urban communities, despite decades since enacting pollution control regulations.25
With the rapid expansion of urban communities comes the increase in industrialization and automotive use, both of which serve to emit harmful pollutants that pose a hazard to both human health and the environment.26,27 Over the past several decades, the World Health Organization (WHO) had been gradually lowering exposure limits of TRAP deemed as “safe” to provide countries with realistic targets to reach over a specified time interval. The most recent air quality guidelines (AQG) released by the WHO in 2021 lowered the exposure limits once again for NO2, O3 and PM2.5 to 13 ppb (24 h), 50 ppb (8 h), and 15 μg m−3 (24 h), respectively.28
The first step in mitigating the negative impacts of air pollution is enhancing monitoring at multiple scales, from regional to hyperlocal, to better identify “hot-spots”. For example, in July 2018, the Breathe London Blueprint project was launched in London, UK with over 100 AQMesh air quality monitoring multisensor pods.29 Similar projects were also launched in Glasgow,30 San Francisco,31 Paris,32 and Mongolia.33 More recently, our research group launched a pilot project in Kitchener, Ontario (ON), Canada using five AQMesh multisensor pods distributed near different elementary schools to assess local air quality across different locations in the network relative to the provincial reference station located in a city park.34 Our first published study highlighted the difference in pollutant levels among different locations relative to the reference station and analyzed the effect of the wildfires season on local air quality.34 One major conclusion from our study was the need for additional measurements of traffic count, vehicle and fuel type, and local meteorology that account for the effect of the built environment on wind speed and direction, and temperature.
The objective of this study is to apply a combination of machine learning and statistical significance modeling to isolate the variables that influence pollutant levels collected using the AQMesh multisensor pods in Kitchener, ON during a two-week period in fall 2020 and 2021. This analysis allowed for the identification of the most probable traffic-related emission sources and the sensitivity of pollutant levels to meteorology. Our analysis also investigated the impact that lockdowns may have had on traffic-related emission sources of air pollutants.
Raw traffic data were obtained from the city of Kitchener, which were collected on an hourly basis and contained traffic counts classifications of thirteen separate categories in compliance with the Federal Highway Administration (FWHA) protocols.37 This data allowed for more detailed observations regarding the size and frequency of each classification type. Rigorous quality assurance protocols were also conducted on the traffic counts to ensure the validity of the data. Significant outliers were investigated comprehensively through a combination of reviewing recent documents pertaining to construction-driven detours, communication with the staff in the city of Kitchener, and physical observations of traffic flow prior to incorporating manual adjustments.
The traffic count data were provided by the city of Kitchener for only two locations: Pod 1 and Pod 2. Furthermore, the data collected was further limited to the following periods: (1) October 20–November 10, 2020, and (2) October 7–October 18, 2021. There were concerns that this small dataset of to 477 and 287 for 2020 and 2021, respectively, would present challenges with the machine learning algorithm. This concern was found to be less of an issue for accurately predicting levels than for quantifying the statistical significance of the predictive importance of the meteorological and traffic variables, as discussed in Section 3.3.
One benefit of the RF model is that it can rank the importance of the feature variables in predicting pollutant levels. This ranking is done for each feature variable by calculating the percent increase in mean square error (MSE) between the RF model fit to the original dataset (MSEoriginal) and to a dataset with the values of the given feature variable randomly permuted (MSEpermute), relative to the variance of the pollutant levels (var(pollutant)) as shown in eqn (1):
(1) |
(2) |
By considering the square of the differences, the MSE is very sensitive to drastic changes. Since the correlation between the permuted feature variable and pollutant levels is effectively zero, the higher the calculated %IncMSE for a given variable, the higher its importance when it comes to predicting pollutant concentration.
Fig. 1 compares the observed meteorological variables for the two-week period studied in 2020 and 2021 for Pods 1 and 2. Upon visual inspection, there are notable variations between the two datasets, indicating that the influence of meteorology on pollutant levels may vary between the two years, as shown later in Section 3.3. These visual observations were further validated by the method of statistical quantification used in our previous publication46 (see R code in ESI†), i.e., by calculating the p-value against the null hypothesis that the median of each variable in question is the same in 2020 and 2021. When the p-value is greater than 0.05, the difference in medians is not deemed to be statistically significant, i.e., cannot be distinguished from natural day-to-day variations. In contrast, a p-value less than 0.05 indicates that there is a significant difference between the medians in 2020 and 2021, which could potentially account for the difference in pollutant levels. The p-value for each variable in question is reported in Table 1, which are all close to zero, meaning that any of these variables could potentially account for the difference in pollutant levels between 2020 and 2021. Additional comparisons between precipitation, wind speed, and solar irradiance were also conducted using hourly data (ESI Fig. S2 and S3†). Statistical analyses for these variables highlighted no significant variations, indicating that the influence they had on pollutant levels should remain consistent over the two years compared.
Variable | Median 2020 | Median 2021 | p-Value calculated |
---|---|---|---|
a Calculated p-values >0.05 indicate no significant variations between the two compared years. Values of 0.00 are in fact very small numbers <5 × 10−6. | |||
Pod 1 | |||
Pressure (mbar) | 981.8 | 977.2 | 0.00 |
Temperature (°C) | 9.00 | 15.9 | 0.00 |
Relative humidity (RH%) | 69.8 | 93.0 | 0.00 |
Pod 2 | |||
Pressure (mbar) | 980.8 | 977.2 | 0.00 |
Temperature (°C) | 9.30 | 15.9 | 0.00 |
Relative humidity (RH%) | 65.9 | 93.0 | 0.00 |
Pollutant | Median 2020 | Median 2021 | p-Value calculated |
---|---|---|---|
a Calculated p-values >0.05 indicate no significant variations between the two compared years. Values of 0.00 are in fact very small numbers <5 × 10−6. | |||
Pod 1 | |||
NO2 (ppb) | 4.03 | 4.99 | 0.00 |
O3 (ppb) | 25.1 | 20.3 | 0.00 |
PM2.5 (mg m−3) | 8.36 | 4.92 | 0.00 |
CO (ppb) | 319 | 282 | 0.00 |
Pod 2 | |||
NO2 (ppb) | 3.63 | 4.77 | 0.00 |
O3 (ppb) | 22.7 | 20.8 | 0.00 |
PM2.5 (mg m−3) | 6.72 | 4.72 | 0.00 |
CO (ppb) | 320 | 277 | 0.00 |
Predictions in less clustered areas of the plots in Fig. 3 were shown to reduce the performance of the model's ability to predict pollutant levels, which is shown most prominently in Fig. 3C. This is believed to be caused by two main factors: (1) the data presented are below the limit of confidence (20 μg m−3 for PM 2.5, 10 ppb for NO2 and O3) for the sensors. This means that the data collected below these thresholds cannot be taken with 100% certainty, an assumption that the model does not make when predicting values, (2) the model was trained with most of the data having very low concentrations of PM2.5 (<10 μg m−3). This resulted in greater accuracy when making predictions in the clusters, and less accuracy in the “scattered” sections (10 μg m−3 ≤ PM 2.5 ≤ 20 μg m−3). Both factors influencing the model's accuracy can be rectified with a dataset exceeding the limit of confidence of the sensor and with a larger dataset to provide a reference for a wider range of pollutant levels.
Table 3 lists the traffic counts provided by the city of Kitchener near Pods 1 and 2. Results indicate that traffic volume did not change significantly (p-value >0.05) between the two years at the two locations. With that said, for Pod 1 there were cases when traffic counts were higher in 2021 (when lockdowns were relaxed) than the 2020 period, and for Pod 2 (located near a main road), there were instances where the traffic counts were higher in 2020 than 2021. While evidence from the p-values alone cannot distinguish these differences from normal day-to-day variation in traffic flow, upon further examination of the local context provided by the city of Kitchener, we found that a construction project had been launched near Pod 2 during the lockdown in 2020. Hence, the observed counts for 2020 in Table 3 originate from a combination of construction vehicles traversing through the roadside maintenance site and residents. This construction project was for the long-term and evolved as time progressed. In fall 2021, the project expanded to nearby roads, resulting in certain routes being closed off, causing vehicles and buses to deviate from their regular routes via detours, ultimately resulting in missed traffic counts at that location.
Variable | Median 2020 | Median 2021 | p-Value calculated |
---|---|---|---|
a Calculated p-values >0.05 indicate no significant variations between the two compared years. Variations in the total counts are attributed to rounding medians to whole numbers. | |||
Pod 1 | |||
Cars | 25 | 29 | 0.34 |
Vans/pickups | 2 | 2 | 1 |
Buses/trucks | 1 | 1 | 1 |
Motorcycles | 0 | 0 | 1 |
Total traffic | 28 | 32 | 0.36 |
Pod 2 | |||
Cars | 26 | 23 | 0.41 |
Vans/pickups | 3 | 2 | 0.65 |
Buses/trucks | 3 | 3 | 1 |
Motorcycles | 0 | 0 | 1 |
Total traffic | 32 | 29 | 0.44 |
The RF model was used to determine the importance of meteorological factors and traffic on pollutant concentrations. The model relayed this importance through the calculation of %IncMSE, where higher percentages indicate a greater importance attributed to the influence of the variable under study.39 The %IncMSE are shown for each meteorological and traffic variable for years 2020 and 2021 on each pollutant concentration for Pod 1 in Fig. 4 and Pod 2 in ESI Fig. S5.† Also displayed are 95% confidence intervals for %IncMSE.51 Both pod locations used in this study showed several similarities. For example, large values of %IncMSE often come with the largest 95% confidence intervals, and these error bars are usually skewed towards lower values. This is because %IncMSE is very sensitive to how well the RF model performs on the extreme observations it has most difficulty predicting. Therefore, large values of %IncMSE can often be due to a handful of extreme values. However, since the error bars are obtained via subsampling,51 these extreme values become much harder to predict, leading to a drop in %IncMSE. While the sample sizes obtained were sufficiently large to justify the calculation of error bars via subsampling,51 the asymmetry issue is mitigated with larger sample sizes, thus underscoring the need to obtain larger datasets in future studies. Accounting for only the statistically significant %IncMSE values (those with error bars above zero), Fig. 4 shows that the most important meteorological variables are temperature, relative humidity, pressure, and solar irradiance. This result seems to be in line with the literature review,52,53 where both temperature and pressure play a significant role in facilitating the formation of secondary pollutants.
Fig. 4A and B show the calculated %IncMSE for NO2 in fall 2020 and 2021, respectively for Pod 1. There were significant variations in meteorological values between the two years as shown in Table 1, and hence they remain the dominant variables affecting NO2 levels. However, aside from temperature, the only statistically significant %IncMSE values are for total traffic and cars. While the %IncMSE value for these traffic variables is small compared to that of pressure, relative humidity, and wind speed, the fact that the error bars are also small and above zero indicates that the influence of these variables is not merely driven by predictions on a handful of extreme observations, as the large error bars on the aforementioned meteorological variables would suggest. Also, for the NO2 data, the %IncMSE calculated for solar radiation was very close to those calculated for traffic-related variables in Fig. 4 and ESI Fig. S5.† This observation is likely due to the influence of solar radiation on the photochemical decomposition of NO2 to form nitric oxide and free oxygen radicals,34,54 the latter of which goes on to form O3.55
Fig. 4C and D show the calculated %IncMSE for each meteorological and traffic-related variable to assess its importance for O3 levels recorded by Pod 1 in fall 2020 and 2021, respectively. It appears that O3 levels are mainly dependent on meteorology, namely relative humidity, solar radiation, and pressure. As mentioned earlier, the formation of ground level O3 is mainly driven by the photochemical decomposition and reaction of NO2 and VOCs.20,54,55 As for %IncMSE values for Pod 2 (Fig. S5†), the only significant meteorological variable for predicting O3 is solar radiation. For the traffic variables, small but statistically significant %IncMSE values for total traffic, cars, and buses/trucks are found in Pod 1, and buses/trucks in Pod 2, again suggesting that the importance metric for these variables is not merely driven by a handful of extreme observations.
Fig. 4E and F show the calculated %IncMSE for each meteorological and traffic-related variable to assess its importance in predicting PM2.5 levels recorded by Pod 1 in fall 2020 and 2021, respectively. In this case, the most significant meteorological variable by %IncMSE is temperature. However, this result does not mean that temperature is the only variable responsible for the elevated levels of PM2.5, but rather interacts with one or more other variables to have a synergetic effect on the measured data. PM2.5 has primary and secondary sources (see Introduction above), where vehicles emit PM2.5 from the incomplete combustion of fuel, and the precursors, namely VOCs, react in the atmosphere to form fine particulates. Furthermore, the chemical and physical properties of these particulates are influenced by relative humidity,56 which has a statistically significant value of %IncMSE at Pod 2 in 2021 (Fig. S5†). While total traffic is significant at Pod 1 in 2020 and cars are significant at Pod 2 in 2021, it remains unclear as to how influential traffic variables are on the measured values of PM2.5. Additional studies using different time periods are needed prior to making any definite conclusions.
Fig. 4G and H show the calculated %IncMSE for each meteorological and traffic-related variable to assess its importance in predicting CO levels recorded by Pod 1 in fall 2020 and 2021, respectively. Carbon monoxide is directly emitted from the combustion of fuel from vehicles much like NO, a precursor to NO2.57,58 This pollutant is relatively short-lived in the atmosphere and participates in atmospheric reactions to form CO2. The lower CO levels in fall 2021 relative to 2020 listed in Table 2 suggest additional sinks for CO becoming important such as reactions with hydroxyl radical,12,13 which in the presence of light and NOx can lead to O3 formation. An interesting observation in Fig. 4G andH is that %IncMSE for total traffic and cars is larger and more significant in 2021 than 2020. However, this is not due to a statistically significant increase in the corresponding traffic counts, according to the p-value calculation of Table 3. Rather, it is due to the fact that CO has far more extreme values in 2020 than 2021 (Fig. 2D). To explain this in more detail, ESI Fig. S6† plots the CO measurements against total traffic and car counts for 2020 and 2021. Also pictured in each plot is the LOESS curve of the RF model predictions against the given variable. In Fig. S6I and M† (Pod 1, 2020), there are several extreme values of CO (around 25 total traffic and 20 car counts, respectively) which are far above the LOESS curve. Those extreme values still exist in Fig. S6J and N† (Pod 1, 2021), but they are much closer to the LOESS curve. Thus, the predictions attributable to these traffic variables were better in 2021 than 2020, hence the larger %IncMSE. More interesting still is that the story in Pod 2 is essentially reversed. That is, cars drop significantly in %IncMSE from 2020 to 2021. At first, it might seem that it is due to changes in important meteorological variables between 2020 and 2021 such as temperature and wind speed, which affect the relative importance of cars. A more fulsome explanation is offered by Fig. S6.† Once again, the extreme values of CO are much closer to the LOESS curves in 2021 than 2020, as shown when comparing Fig. S6L and P to K and O, respectively.† This explains the increase in %IncMSE for temperature. However, %IncMSE is a combined measure of distance from the mean trend and departure of the mean trend from zero (if the mean trend of a variable is zero, then it cannot have a predictive effect on the outcome). Thus, for cars and wind speed, it appears that the smaller distances from the LOESS curve in 2021 are offset by the larger variation in the curve in 2020, resulting in an overall decrease in %IncMSE. Additional studies during high pollution events where traffic counts are more varied between the periods of study are needed to better understand the significance of traffic on CO levels.
A recent study50 conducted in Los Angeles used a similar RF model to quantify the impacts that TRAP and meteorology have on pollutant levels during the early days of the COVID-19 lockdown period. This RF model was more comprehensive, and included higher specificity for vehicle types, fuel used, miles travelled, etc. The results highlighted that pollutant levels showed a significant decline during the studied period compared to previous years. Furthermore, “heavy-duty trucks”, or vehicles used to transport resources, were the biggest contributor to pollutant levels during the lockdown period. When comparing this study, which took place in a large city, to the findings presented here (a medium-sized city), there was some overlap in the findings: (1) traffic appears to play a statistically significant role in the levels of NO2, O3, and PM2.5, and (2) meteorology has the biggest influence on all pollutants, evident in the high ranking of importance for its variables in both studies. While direct variable importance rankings are not possible since the Los Angeles study used a different importance metric from ours, the main methodological difference between our studies is that ours computes the statistical significance (via error bars containing zero) on the variable importance metric, whereas the Los Angeles study does not. This allowed us to conclude that even the relatively small %IncMSE for several traffic variables was statistically significant, whereas much large %IncMSE values for a number of meteorological variables was not. Upon taking statistical significance into account, our study found that traffic variables had a more discernible influence than many meteorological variables, whereas in the Los Angeles study, based on the magnitude of variable importance metrics alone, the meteorological variables were almost always the dominant factors for predicting pollutant levels.
Machine learning algorithms offer a new way to analyze air pollution data in relation to meteorology and traffic. Scaling up the project presented here would be feasible, as the computational power required to run the machine learning code and p-value calculations is fairly low. Future studies with traffic data that run over a longer time frame would be beneficial and could be used to predict pollution levels based on changes to meteorological and traffic data. Additionally, a larger dataset will not only improve the accuracy of the model but also allow for more definite conclusions to be made regarding pollutants and variable metrics that hover near the statistical significance threshold. The analysis presented here also examines the contrast between the variables that influence air quality in large and mid-size cities. This contrast suggests that local contexts matter in drafting bylaws and regulations to lower emissions and minimize exposure of citizens to air pollutants. Electrifying the transport system in mid-size cities experiencing population growth would ensure that TRAP would have lower importance than meteorology. Continuous monitoring is highly recommended for regular assessment of seasonal and human influences.
Footnote |
† Electronic supplementary information (ESI) available: Detailed experimental procedures, and figures and tables showing data analysis. See DOI: https://doi.org/10.1039/d2ea00084a |
This journal is © The Royal Society of Chemistry 2022 |