Enhancing spatial inference of air pollution using machine learning techniques with low-cost monitors in data-limited scenarios

Ensuring environmental justice necessitates equitable access to air quality data, particularly for vulnerable communities. However, traditional air quality data from reference monitors can be costly and challenging to interpret without in-depth knowledge of local meteorology. Low-cost monitors present an opportunity to enhance data availability in developing countries and enable the establishment of local monitoring networks. While machine learning models have shown promise in atmospheric dispersion modelling, many existing approaches rely on complementary data sources that are inaccessible in low-income areas, such as smartphone tracking and real-time traffic monitoring. This study addresses these limitations by introducing deep learning-based models for particulate matter dispersion at the neighbourhood scale. The models utilize data from low-cost monitors and widely available free datasets, delivering root mean square errors (RMSE) below 2.9 μg m−3 for PM1, PM2.5, and PM10. The sensitivity analysis shows that the most important inputs to the models were the nearby monitors' PM concentrations, boundary layer dissipation and height, and precipitation variables. The models presented different sensitivities to each road type, and an RMSE below the regional differences, evidencing the learning of the spatial dependencies. This breakthrough paves the way for applications in various vulnerable localities, significantly improving air pollution data accessibility and contributing to environmental justice. Moreover, this work sets the stage for future research endeavours in refining the models and expanding data accessibility using alternative sources.

A calibration was performed in order to improve the RMSE of the PM measurements (Figures S6,S7,and S8).We used the calibration data from the ENVILUTION® Chamber characterization.The PM measurement is reported in modes, containing redundant information about the particle sizes.For this reason, we calculated the concentrations in size ranges, rather than in size modes.The ranges used were 0-1, 1-2.5 and 2.5-10 μm (PM0-1, PM1-2.5, and PM2.5-10, resp.)being the former equivalent to PM1, the second derived from the subtraction of the PM2.5 and PM1, and the latter derived from the subtraction of the PM10 and the PM2.5.The equation was performed by applying a ridge polynomial regression using the following equation: , Where XP3 is the scaled matrix of the polynomial combination without the constant term of the particulate matter, temperature and relative humidity until degree 3. The w is the Ridge Coefficients matrix, and the PMref is the particulate matter concentration measured by the reference sensor.The scaling was performed using the subtraction of the mean and division by standard deviation.The ridge coefficients were calculated using the following equation: Where α is the complexity parameter.The second member of the equation controls the minimization, forcing it to maintain the regression coefficients as low as possible, which is reasonable to the physics underlying the problem.The complexity parameter prevents overfitting by restricting the size of the ridge coefficients.We selected α as 30 for PM2.5-10, 10 for PM1-2.5, and 5 for PM0-1 to maintain a good compromise between the representations of small variances, and at the same time improve the peak representation without overfitting the data.A weight vector was attributed to the measurement matrix in order to prioritise the data within the expected range of the local background.The weights were assigned as 1 to samples whose concentrations of PM10 were lower than 20 μg/m³ and 0.5 to the rest.Features description and sources

PM data description
In this section we present details of the PM concentrations measured by the low-cost monitors.
Figure S9 show a comparison between the monitors mean PM concentrations.The values are in respect to the average of all monitors.Figure S10 show the PM10 concentration distribution in the dataset of each monitor.Table S3 presents the quartile distribution of the monitors' data.Figure S11 show the calibrated PM concentration over time by monitor.Table S4 presents a statistical description of the PM data by monitor.Figure S12 shows a heatmap of the average difference between each monitor for each interquartile, where the rows and columns are respective to the monitor`s number.Figure S12: Average difference of PM1, PM2.5, and PM10 concentrations between the monitors for each interquartile of the data.The rows and columns are respective to the monitor`s number.The difference is expressed by the colour of the cell.The diagonal is zero because the difference between the monitor and itself is zero.

The ML models
In order to maintain a reasonable length in the manuscript, the figures of the evaluation of the PM2.5 and PM10 are presented here (Figures S11 and S12) alongside their sanity tests (Figures S13 and S14).

Sensitivity analysis
The algorithm used consists of the following steps: 1. Calculate the min, max and median values of the training set features.
2. Calculate the model output with the median values calculated in step 1.
3. Define a resolution constant (10 in this work).4. Calculate an array of values equally spaced between the min and max values calculated in step 1, with a size equal to the constant defined in step 3. 5.For each feature, calculate the model output using the median values calculated in step 1, swapping the values of the target feature with each value of the array calculated in step 4. 6.For each feature, calculate the absolute difference between the output values calculated in step 5 and the output values calculated in step 2, and divide by the constant defined in step 3. 7.For each feature, sum the values calculated in step 6.The values of the analysis cannot be directly applied in other studies, nor can be used to extrapolate physical meaning.It can be used as a tool to rank the features in order of importance considering only their linear influence on the model.The full analysis output is available in Figures S15, S16, and S17.

Figure S1 :
Figure S1: Chamber calibration data of the low-cost monitors.The variables inside the chamber (relative humidity, temperature, and particles) where individually and simultainiously controlled in order to make varied scenarios for calibration.The temperature and relative humidity where the chosen variables for calibration as they affect the sensors structure and internal geometry by thermal expansion and contraction, and the particles' size by condensation.

Figure S2 :
Figure S2: Heatmap of the coefficient of determination of the uncalibrated low-cost monitors by the measured variables.

Figure S3 :
Figure S3: Heatmap of the root-mean-square error of the uncalibrated low-cost monitors by the measured variables.

Figure S4 :
Figure S4: Heatmap of the mean absolute percentage error of the uncalibrated low-cost monitors by the measured variables.

Figure S5 :
Figure S5: Heatmap of the Pearson's r of the uncalibrated low-cost monitors by the measured variables.

Figure S6 :
Figure S6: Calibration results for the monitors.The left graphs show the calibrated data and the right graphs show the original data.

Figure S7 :
Figure S7: Data of the uncalibrated PM monitors in relation to the reference sensor.

Figure S8 :
Figure S8: Data of the PM monitors in relation to the reference sensor after calibration.

Figure S9 :
Figure S9: Comparison of the mean difference between the PM concentrations and the average between the monitors for each monitor.

Figure S10 :
Figure S10: Histograms of the difference between the PM10 concentrations and the average between the monitors for each monitor.

Figure S11 :
Figure S11: PM concentration over time, from the calibrated monitors data.The concentrations are in μg/m³.Each colour represents one monitor.

Figure S13 :
Figure S13: Comparison between the values of PM2.5 predicted by the model (y axis) and the actual values (x axis) in the evaluation dataset.The color of the dots are porportional to the density of points.The dashed line is the 1:1 line.The number of neurons per layer and dropout rate are in the top-right corner in the format "[number of neurons, dropout rate]".The evaluation metrics are displayed over the figure, and the name of the model is below it.

Figure S14 :
Figure S14: Comparison between the values of PM10 predicted by the model (y axis) and the actual values (x axis) in the evaluation dataset.The color of the dots are porportional to the density of points.The dashed line is the 1:1 line.The number of neurons per layer and dropout rate are in the top-right corner in the format "[number of neurons, dropout rate]".The evaluation metrics are displayed over the figure, and the name of the model is below it.

Figure S15 :
Figure S15: Comparison of the weekly variation of PM2.5 between the model and the dataset.Stations numbers 2, 7 and 8 are the evaluation ones.The rest belongs to the training set.

Figure S16 :
Figure S16: Comparison of the weekly variation of PM10 between the model and the dataset.Stations numbers 2, 7 and 8 are the evaluation ones.The rest belongs to the training set.

Figure S17 :
Figure S17: Sensitivity analysis of the PM1 model.(Continue below)

Figure S19 :
Figure S19: Sensitivity analysis of the PM10 model.(Continue below)

Figure S20 :
Figure S20: Comparison between the values of PM1 predicted by the estou version of the model (y axis) and the actual values (x axis) in the evaluation dataset.The color of the dots are porportional to the density of points.The dashed line is the 1:1 line.The number of neurons per layer and dropout rate are in the top-right corner in the format "[number of neurons, dropout rate]".The evaluation metrics are displayed over the figure, and the name of the model is below it.

Figure S21 :
Figure S21: Comparison of the daily (left) and weekly (right) variation of PM1 between the early version of the model and the dataset.Stations numbers 2, 7 and 8 are the evaluation ones.The rest belongs to the training set.The hour is in local time.

Table S1 :
Average scores of the uncalibrated monitors in relation to the reference sensor and standard deviation in parenthesis.

Table S2 :
Features description and sources.

Table S3 :
Minimun, first, secont, third quartile and maximun values of PM data.

Table S4 :
Mean, median, and standard deviation of PM.The numbers are in a "mean / median (standard deviation)" format.The PM values are in μg/m³.