Leonardo Y.
Kamigauti
*ab,
Gabriel M. P.
Perez
cd,
Thomas C. M.
Martin
bd,
Maria
de Fatima Andrade
b and
Prashant
Kumar
ae
aDepartamento de Ciências Atmosféricas, Universidade de São Paulo, Brazil. E-mail: leonardo.kamigauti@usp.br
bGlobal Centre for Clean Air Research (GCARE), School of Sustainability, Civil and Environmental Engineering, Faculty of Engineering & Physical Sciences, University of Surrey, Guildford GU2 7XH, Surrey, UK
cDepartment of Meteorology, University of Reading, UK
dMeteoIA, São Paulo, Brazil
eInstitute for Sustainability, University of Surrey, Guildford GU2 7XH, Surrey, UK
First published on 25th January 2024
Ensuring environmental justice necessitates equitable access to air quality data, particularly for vulnerable communities. However, traditional air quality data from reference monitors can be costly and challenging to interpret without in-depth knowledge of local meteorology. Low-cost monitors present an opportunity to enhance data availability in developing countries and enable the establishment of local monitoring networks. While machine learning models have shown promise in atmospheric dispersion modelling, many existing approaches rely on complementary data sources that are inaccessible in low-income areas, such as smartphone tracking and real-time traffic monitoring. This study addresses these limitations by introducing deep learning-based models for particulate matter dispersion at the neighbourhood scale. The models utilize data from low-cost monitors and widely available free datasets, delivering root mean square errors (RMSE) below 2.9 μg m−3 for PM1, PM2.5, and PM10. The sensitivity analysis shows that the most important inputs to the models were the nearby monitors' PM concentrations, boundary layer dissipation and height, and precipitation variables. The models presented different sensitivities to each road type, and an RMSE below the regional differences, evidencing the learning of the spatial dependencies. This breakthrough paves the way for applications in various vulnerable localities, significantly improving air pollution data accessibility and contributing to environmental justice. Moreover, this work sets the stage for future research endeavours in refining the models and expanding data accessibility using alternative sources.
Environmental significanceIn the field of air quality and machine learning, most research focuses on places with abundant data, often sidelining regions with limited resources like low-income countries and cities. This happens because better results are often achieved when using local-specific datasets. Our study aims to balance this by creating detailed maps of particle distribution in Woking, UK. We used deep learning and easily available datasets like ERA5's global reanalysis and local road data from Ordnance Survey, along with affordable Plantower PM sensors. Despite some limitations in how well these datasets match the location or how reliable they are, our model performed impressively, with an RMSE of less than 2.9 μg m−3. Our paper explains different strategies we used to handle data gaps, showing that powerful machine learning can work even when resources are limited. |
In terms of data availability, machine learning (ML) solutions have been developed in recent years to spatially predict air pollution dispersion and other atmospheric properties. These ML models leverage reference pollutant monitoring networks along with supplementary datasets specific to each city, thereby reducing the reliance on a high number of monitors in a given location. Hu et al.5 compared a wide range of ML models for carbon monoxide spatial inference. The features used in their models included carbon monoxide concentration, geographic coordinates, hour of the day, day of the week, and season. Support Vector Regression exhibited the best overall performance in their evaluation. Similarly, Song et al.6 emphasized the importance of advanced feature engineering and utilized gradient boost decision trees with real-time traffic conditions and social media usage data. More recently, Martin et al.7 employed modern Neural Network models to downscale meteorological variables, demonstrating the applicability of their approach to air pollution variables. Their method involved principal component analysis (PCA) on atmospheric variable observations, enabling spatial and temporal predictions through the separation of loadings and scores. The best-performing ML model in their study was an artificial neural network (ANN) configured with two fully connected hidden layers, employing rectifier linear units (ReLU) and a dropout layer for regularization.
By modelling the relationship between reference monitors and complementary datasets containing information about local pollution sources, meteorological conditions, and background pollution, ML approaches improve the spatial resolution and accuracy of air pollutant concentration estimates. Martin et al.7 showed that Artificial Neural Networks (ANN) and Extreme Quantile Mapping (EQM) techniques significantly improve predicting the occurrence of extreme events. These methods have been particularly effective in capturing the variability associated with events like the formation of intense cold air pooling or heavy precipitation in valleys. However, existing ML air pollution models heavily rely on extensive city-specific data, such as smartphone data and real-time traffic information, which are often unavailable in many low-income locations. This limitation is evident in prominent ML models developed thus far.
In this article, we showcase the application of recent ML techniques using particulate matter (PM) measurements obtained from an LCM network managed by a local community in Woking, United Kingdom. We trained an ANN model to learn the relationships between the nearby PM concentrations, meteorology and nearby roads using the LCM network, the ECMWF's ERA5 reanalysis dataset8 for meteorological variables and the UK government's local roads data. It allowed the calculation of the PM concentrations for 200 points on Woking. Unlike previous air pollution studies employing ML, our proposed approach solely relies on widely available datasets, thereby facilitating the replication of our methodology in other locations and advancing the cause of data and environmental justice. While our study site is situated in a prosperous country, we selected Woking due to its status as a small town with identifiable emission sources, which offers an ideal setting for evaluating ML models. Importantly, our approach is not limited to LCM networks but can be readily applied to other networks comprising reference monitors.
(1) |
(2) |
The feature engineering process was applied in the training and test datasets separately, making sure that there were no information leaks between them, i.e., the evaluation dataset does not have information about the other monitors in their features, and vice versa. In total, 151 features were used as the input of the model from different sources related to spatial and temporal information (Table S2†). Regarding the final PM distribution map over the study area, the map features were calculated for each point of a 10 × 20 grid of 600 m resolution.
The temporal features were obtained with one-hour frequency and then resampled to one-day frequency using 24 hours average. The temporal-only features were: (i) day of the week, (ii) month, (iii) ERA5 variables, (v) and the average PM concentration of the monitors. The nearest data point of ERA5 to the city centre was chosen to represent the local meteorology. The 8 data points around the city in the cardinal and ordinal directions were inputted in the model as well.
The spatiotemporal features were composed of the 3 nearest monitoring monitors' PM concentrations (in all PM sizes; excluding the monitor itself in the training dataset), distances, angles between the point and the monitors, and a parameter of concordance between the wind direction and the monitor angle (0 if the monitor is downwind and 1 if is upwind); the IDW interpolation of the 3 nearest monitors (weight = 2, and with the same considerations as before); and the difference between the average PM concentration and the IDW. The PM concentrations were used as the target variable to train and evaluate the models.
The spatial-only features were the density of roads in a 200 m2 square around each monitor subdivided by road type. The road data from OS has 8 categories based on road usage: A Road, B Road, Local Access Road, Local Road, Minor Road, Motorway, Restricted Local Access Road, and Secondary Access Road. The data were available as vector files covering the areas SU95, SU95, TQ05, and TQ06 of the Ordnance Survey National Grid reference system. Each category vector was rasterized to 4 m resolution grids containing the counting of each road in the area. Then, the pixels in a 200 m2 square around each monitor were summed. This number is proportional to the local road density of each category.
Model evaluation employed metrics such as RMSE, Symmetric MAPE (SMAPE), coefficient of determination (R2), and Pearson's r. A visual comparison of the model's PM distribution with the monitor's data was performed as a sanity test. The sensitivity of the model to each input feature was estimated using a One-at-a-time (OAT) sensitivity analysis adapted from Loucks et al.15 which provides insights into feature importance and helps guide future model design. More details on the OAT analysis can be found in the ESI.† Being y the target value and ŷ the calculated value, and n the number of samples, the model evaluation metrics are calculated as follows:
(3) |
(4) |
(5) |
(6) |
Model | RMSE [μg m−3] | SMAPE [%] | R 2 [adim.] | r [adim.] |
---|---|---|---|---|
PM1 | 2.57 | 25.31 | 0.88 | 0.95 |
PM2.5 | 2.89 | 26.82 | 0.90 | 0.95 |
PM10 | 2.36 | 22.23 | 0.93 | 0.97 |
The RMSE shows an error within the spatial differences in the dataset, indicating a capacity to distinguish the spatial information, especially in highly polluted scenarios, what is relevant in terms of air pollution alerts to population. The SMAPE indicates a bias in the model. This bias indicates an underestimation of the concentrations. The R2 shows that most of the variance of the data is explained by the model. It indicates that the model can describe major processes that dominate the variance. Pearson's r indicates a strong linear correlation between our model and the evaluation data, suggesting a linear fit. In comparison to the model developed by Song et al.,16 which achieved RMSE of 13.17 μg m−3, SMAPE of 14.65%, and R2 of 0.91 for PM2.5, the models developed in our study performed better. The RMSE was more than 10 μg m−3 lower, and the R2 was similar (except for PM1). However, the SMAPE of our models was on average 10.14 percentage points higher compared to Song et al.'s model. It is important to note that the average PM2.5 concentration in their study is around 40 μg m−3, as estimated by Song et al. Our concentration is almost four times lower than the levels observed in their study. The sanity test shows the ability of the model to replicate the overall shape of the weekly variation curve for all monitors and PM sizes (Fig. 4). There is no clear difference between the evaluation and training data in this test. In agreement with the SMAPE, the model consistently underestimates the concentrations, with no apparent effect on the day of the week or PM concentration. The differences between the model and the measurements are within the RMSE in all cases.
Fig. 4 Comparison of the weekly variation of PM1 between the model and the dataset. Stations numbers 2, 7 and 8 are the evaluation ones. The other belongs to the training set. |
The OAT analysis is not sensitive to non-linear relationships between the model features. Therefore, physical interpretations of the models' intern reasoning need to be taken cautiously. However, interpretations are valuable to future feature engineering in related applications. The sensitivity analysis (Fig. S17–S19†) indicates that all models used the monitors dataset as the primary source of the final prediction. It is expected because the monitors data are the most direct information of the PM concentration in the local. The meteorological variables of most influence were the boundary layer dissipation (BLD), boundary layer height (BLH), total precipitation, and mean total precipitation rate. The BLD is the amount of kinetic energy converted to heat due turbulence, inside the boundary layer. This turbulence is related to the mixing rate of the PM, which influences how much a local event spreads and dilutes not considering the transport by wind. The BLD also influences the pollutants exchange between the stable boundary layer and the residual layer formed at night, which can trap the nocturnal emissions closer to the ground. It is especially problematic as the local community burn wood in fireplaces for heating in the winter. The BLH dictates the volume of atmosphere available for easy dispersion of pollutants, directly influencing the concentration on the surface. Precipitation governs the process of wash-out, being directly related to PM removal, especially in larger particle sizes.
Regarding the differences between the models in the OAT analysis, for the PM1 model, the northeast meridional wind speed was especially influential, being comparable to the interpolation of PM2.5 and the average PM1. It shows a high sensitivity of PM1 to windspeed, indicating a high importance of the transport of PM1 from far sources. The northeast direction is the closes to London, which is potentially the main far source of the region. The PM2.5 model was especially sensitive to northwest total precipitation, with influence near the second nearest PM10 concentration and the minor roads. The total precipitation is expected to be influential in the model, however, it is not clear why the northwest region was the one with most impact in the model. The PM10 model also presented a meteorological variable among the PM concentrations, with boundary layer dissipation at the northeast and southwest between the interpolation of PM2.5 and the nearest PM10 concentration. The models were less sensitive to the road data than the monitors' data, and there were more differences between the road types than the monitors' features. It is expected, as the monitor data are directly related to PM concentrations, and roads are indirect. The higher difference in road types indicates that the model learned different relations the roads can have with the monitors' data. The PM1 and PM2.5 models were more sensitive to minor roads, and B roads. It can be attributed to more variable accelerations in the smaller roads, and to the residential zones of the city. The PM10 model was sensitive to motorways, B roads, and A roads. It can be attributed to soil resuspension by high velocity heavy vehicles in higher speed roads.
Fig. 5 Average PM1 (a), PM2.5 (b), and PM10 (c) over Woking, UK in the timeframe of the study. The training monitors' locations are represented by blue dots and the evaluation monitors are in red. |
PM2.5 and PM10 have similar spatial distribution due to their similar small concentrations (there is a low mass of particles between 2.5 and 10 μm). Its peak concentration is in East Hill/College Road, which leads to Woking's shopping centre, and a bridge that crosses the train line that divides the city. It is important to note that the city centre bridge (Victoria Arch) was blocked due to the Victoria Arch Widening Scheme. The traffic was diverted to the bridge near the peak concentrations of PM2.5 and PM10 and to the bridge in Triggs Ln. in the west of the city centre. There are also local maximums near the A3 and M25 highways. These pollution maps showing hot spots of pollution near roads and traffic intersections are consistent with other authors.21 The concentration values are compatible with the PM model requested by the local authorities in 2019 to CERC,18–20 however, the dispersion is different, lacking a higher concentration around the Victoria Arch bridge. A probable explanation is that CERC used a model based on Surrey's Department of Transport Traffic Model,18 which was ingested with data from Surrey Traffic Surveys. However, in the region of Victoria Arch, there is only four days of data collection in 2019 (13 and 15 May, 9 and 11 September22).
Previous configurations generated many models with worse metrics performances. The main cause of failure was the use of a one-hour timestep. Figures of the evaluation of the best model trained with one-hour timestep are presented in the ESI.† The low R2 (of 0.54) and high SMAPE (53.56) indicate that the model could not account for major sources of variability in the physical system. The sanity tests revealed that the model could not reproduce the amplitude of the variance in the hourly variation over the hours, severely underestimating the concentration of the pollutant in the night period. It may be caused by the imprecision in the ERA5 planetary boundary layer height data, which is not derived from direct measurements. The use of a timestep of one day reduced the complexity of the model while at the same time maintaining its usability as most proposed air pollution limits use daily intervals.
In the case of Woking, UK, our model successfully identified areas with high pollution levels associated with local traffic. However, it is important to note that achieving these results required multiple iterations and adjustments. We identified the influence of the planetary boundary layer as a significant challenge for the model, despite incorporating the layer height information from the ERA5 dataset. This indicates a weakness in the model's sensitivity to errors in the features dataset. To address this, future versions of the model will focus on improving data accessibility by utilizing the OpenStreetMap road database instead of the UK-focused OS dataset. Additionally, we plan to investigate the model's sensitivity and experiment with different preprocessing techniques prior to data input, aiming for further improvements and enhanced performance.
Footnote |
† Electronic supplementary information (ESI) available: Additional dataset details, software description, plots of the PM models for PM2.5 and PM10, sensitivity analysis, and a description of the early version of the model discussed in Section 4.4 (DOC). See DOI: https://doi.org/10.1039/d3ea00126a |
This journal is © The Royal Society of Chemistry 2024 |