Khanh
Do
ab,
Arash Kashfi
Yeganeh
ab,
Ziqi
Gao
c and
Cesunica E.
Ivey
*bd
aDepartment of Chemical and Environmental Engineering, University of California Riverside, Riverside, CA, USA. E-mail: iveyc@berkeley.edu
bCenter for Environmental Research and Technology, Riverside, CA, USA
cDepartment of Civil and Environmental Engineering, Georgia Institute of Technology, Atlanta, GA, USA
dNow at Department of Civil and Environmental Engineering, University of California, Berkeley, Berkeley, CA, USA
First published on 29th March 2024
We combine machine learning (ML) and geospatial interpolations to create two-dimensional high-resolution ozone concentration fields over the South Coast Air Basin (SoCAB) for the entire year of 2020. The interpolated ozone concentration fields were constructed using 15 building sites whose daily trends were predicted by random forest regression. Spatially interpolated ozone concentrations were evaluated at 12 sites that were independent from the machine learning sites and historical data to find the most suitable prediction method for SoCAB. Ordinary kriging interpolation had the best performance overall for 2020. The model is best at interpolating ozone concentrations inside the sampling region (bounded by the building sites), with R2 ranging from 0.56 to 0.85 for those sites. All interpolation methods poorly predicted and underestimated ozone concentrations for Crestline during summer, indicating that the site has a distribution of ozone concentrations that is independent from all other sites. Therefore, historical data from coastal and inland sites should not be used to predict ozone in Crestline using data-driven spatial interpolation approaches. The study demonstrates the utility of ML and geospatial techniques for evaluating air pollution levels during anomalous periods. Both ML and the Community Multiscale Air Quality model do not fully capture the irregularities caused by emission reductions during the COVID-19 lockdown period (March–May) in the SoCAB. Including 2020 training data in the ML model training improves the model's performance and its potential to predict future abnormalities in air quality.
Environmental significanceIn the spring of 2020, shifts in emissions and subsequent air pollution levels associated with COVID-19 lockdown measures were significantly different compared with any previous period in the Anthropocene. We investigate the utility of deterministic and machine learning models in capturing the observed anomalies in ozone concentrations across the South Coast Air Basin, a region with spatially heterogeneous formation of secondary pollutants. The directionality of model biases before, during, and after the lockdown period gives insight into the NOX and VOC limited characteristics of locations across the Basin, which guides future emissions reduction strategies. |
Of particular interest is the exploration of possible differences in ozone prediction performance of different modeling approaches during periods of significant emissions and meteorological anomalies. The Community Multiscale Air Quality (CMAQ) modeling system, developed by the U.S. Environmental Protection Agency (EPA), is widely-used for multi-day air quality simulations to estimate air pollutant concentrations with prescribed emissions and meteorology inputs (Ooka et al., 2011; Rao et al., 1996; Wong et al., 2012).4–6 From the model outputs, scientists and regulators can better predict the interactions between future emissions, meteorology, and air pollutants to strengthen recommendations for emissions control programs. Chemical transport models (CTMs), such as CMAQ, are based on first principles equations and are initiated with interpolated observation data, hence avoiding most obstacles introduced by data missingness in observations. Machine learning (ML) as an alternative modeling approach has attracted more attention from air quality researchers. Although ML and CTMs have a similar goal to accurately predict air pollution, ML heavily depends on the quality and quantity of historical data. In contrast with CTMs, which produce larger scale, spatially resolved outputs, ML only provides accurate predictions strictly at trained locations when used for ambient air quality applications.
As most ML approaches depend heavily on observational data, we introduce spatial interpolation as a central procedure for increased comparability with the CMAQ data. Also, the relative sparseness of monitoring stations and the locality of air pollutants have been shown to misrepresent spatially-varying air quality over a large area.7 Spatial interpolation methods (e.g., nearest neighbors, linear or polynomial interpolation, continuous natural neighbor interpolation, etc.) have proven useful for overcoming these limitations.8 Yu et al. evaluated 14 unique spatial modeling methods for eight air pollutants in Atlanta, Georgia for developing spatiotemporal air pollutant concentrations fields.9 Wong et al., assessed four spatial interpolation methods (spatial averaging, nearest neighbor, inverse distance weighting (IDW), and kriging) to estimate ozone and PM10 concentrations.10 In California, the South Coast Air Quality Management District (SCAQMD) operates 38 air monitoring stations in Southern California over an area of approximately 10743 square miles, including SoCAB, portions of the Salton Sea Air Basin, and Mojave Desert Air Basin, with an average of 283 square miles per monitoring station.11,12 Therefore, spatial interpolation is expected to enhance the observational analyses that follow.
This paper focuses on the performance of deterministic and ML models under rapid changes in emissions and meteorological conditions, specifically during the COVID-19 lockdown period in March through May of 2020. We compare three spatial interpolation techniques to the CMAQ model and evaluate biases related to COVID-19 lockdown anomalies. Furthermore, we aim to answer the question of whether there were other periods with emissions changes similar to the COVID-19 lockdown period within the past few decades and how those changes impacted the behavior of ozone in different regions of Southern California.
Fig. 1 Ozone design values for the South Coast Air Basin from 2006 to 2020 (https://www.epa.gov/air-trends/air-quality-design-values). |
We re-projected gridded 4 km emissions from 2019 for the year 2020 using a two-step adjustment to account for changes due to the COVID-19 lockdown.15 In the first step, a linear projection factor (eqn (1)) was applied to 2019 gridded emissions based on SCAQMD basin-wide, total annual emissions spanning from 2012 to 2034, where the District's future projections began in the year 2020. The correction factor was calculated for seven air pollutant groups (total organic gases, reactive organic gases, CO, NOX, SOX, NH3, PM).
(1) |
The second step accounted for traffic reductions due to the COVID-19 lockdown, and reductions were highest from March to May 2020, then slowly but not fully rebounding to pre-lockdown levels toward the end of 2020.1 SCAQMD basin-wide projections understandably did not reflect the decrease in mobile source emissions due to unforeseen traffic reductions. Moreover, weekly traffic metrics in 2020 were acquired for the total flow, flow change, and speed change at 2991 locations in Southern California.16 Since the traffic data were not evenly distributed over the study domain, we used k-nearest neighbors (k-NN) to obtain the traffic data for grid cells (locations) that had no more than five reported data points (k value ≤ 5). For the grid cells with more than five reported data points, we normalized traffic volume and then averaged the normalized data.
Ground Monitoring Locations | Anaheim, Azusa, Banning, Compton, Fontana, Glendora, Lake Elsinore, Los Angeles International Airport (LAX), LA North Main Street, Mira Loma, Rubidoux, San Gabriel, Santa Clarita, San Bernardino, Upland |
Features | NO2, NO, temperature, relative humidity, wind speed, wind direction |
Label | Ozone |
Data sources | EPA AQS data mart, CARB air quality and meteorological information system (AQMIS) |
Training years | 2009, 2010, 2016, 2017, 2018, 2019 |
Evaluation year | 2020 |
Fig. 3 The third and inner-most domain (red boundary) with 4 km horizontal grid spacing covered the entire SCAQMD region (thick black lines). |
The random forest (RF) algorithm is a supervised learning method employing a tree-based ensemble approach. Each decision tree is derived from training data and represents a subset of the training data. In our model, we have a vector x with n features, denoted as x = (xi, …, xn)T. The goal is to find a function f(x) for predicting ozone concentrations. RF is a collection of decision trees consisting of J trees that are split into j branches from hi, …, hj. The learning function computes the average of all decision trees, expressed as . RF is a combination of multiple decision trees trained on an independent collection of input variables. To reduce the model bias, RFR selects a random subset of features from the input features for each tree, and the output of RFR is the average result from all the decision trees (Rodriguez-Galiano et al., 2015; Zhang & Ma, 2012).22,23
In this study, we selected six training features to predict ozone concentrations, which included two air quality features (NO and NO2) and four meteorological features (temperature, relative humidity, wind speed, and wind direction). The two air quality features are directly related to ozone formation in the troposphere. Ozone undergoes the photolytic cycle during the day and is removed by NOx during nighttime.24–26 The four meteorological features were well studied in our previous work and were shown as the most important features to capture the variability in annual ozone, especially in SoCAB.27–29
We used the scikit-learn 0.22 library supported by the Python programming language to train our RFR model. Again, the input features are NO2, NO, temperature, relative humidity, wind speed, and wind direction, and the label is ozone. We tuned the algorithm by varying the number of decision trees, the depth of the tree, sample split, and the sample leaf to obtain the best prediction accuracy. We used the same model tuning approached described in Do et al. (2023) (Table 2).21
Hyperparameter | Description |
---|---|
n_estimators = 16 | The number of trees in the forest |
max_features = ‘auto’ | The number of features to consider when looking for the best split |
max_depth = none | The maximum depth of the tree |
min_samples_split = 5 | The minimum number of samples required to split an internal node |
min_samples_leaf = 30 | The minimum number of samples required to be at a leaf node |
min_weight_fraction_leaf = 0 | The minimum weighted fraction of the sum total of weights required to be at a leaf node |
max_leaf_nodes = none | Best nodes are defined as relative reduction in impurity |
Ordinary kriging was applied to interpolate ozone concentration at 10 km resolution over the study area. Generally, kriging predicts the values for unknown locations by performing a series of linear combinations of values at known locations. Eqn (2) expresses the generic form of the estimator to predict the optimum value Z* of an unknown location by combining the known values Zi with their weights λi.30 We can write the variance σ2 as an optimization problem (eqn (3)) that can be solved using the Lagrange multiplier μ (eqn (4)).
(2) |
(3) |
(4) |
(5) |
Bicubic interpolation is another method for interpolating data points on a 2-D grid. The interpolated surface can be written in terms of two variables (eqn (6)). The polynomial p consists of sixteen coefficients aij that are solved with sixteen boundary conditions (i.e., (x = 0, y = 0), (x = 1,y = 0), (x = 0,y = 1), (x = 1,y = 1)) and its derivatives with respect to x, y, and xy.32
(6) |
The IDW interpolation method accounts for the distances between the interpolated points and the measured locations. The assumption for IDW is that points close to each other are more alike and have more significant influence than those farther apart. Thus, the nearest measured values have greater weights assigned. Eqn (7) shows that the predicted value Z(x) is inversely proportional to the distance between the measured and interpolated points d(x,xi).
(7) |
Fig. 5 Hourly ozone heatmap (16:00 on June 22, 2020) using ordinary kriging. The dots with white borders are the evaluation sites, and dots without borders are the training sites. |
The performance of the models was evaluated based on commonly used statistical metrics: mean bias (MB), correlation coefficient, root mean square error, and R2 (equations listed in ESI†). The models were evaluated based on data from 27 air monitoring stations in SoCAB, of which 15 sites were used to evaluate the training sites, and the other 12 sites were used to evaluate the performance of the three interpolation methods at non-training sites. Tables 3 and 4 highlight R2 for daily average ozone for the bicubic, IDW, and ordinary kriging interpolations, as well as R2 for the CMAQ comparison. We used the entire year to evaluate the interpolation methods, but we only used the five highest ozone months from May to September for the CMAQ evaluation.
Sites | Bicubic R2 | IDW R2 | Ordinary kriging R2 | CMAQ R2 |
---|---|---|---|---|
Anaheim | 0.66 | 0.67 | 0.74 | 0.41 |
Azusa | 0.52 | 0.64 | 0.77 | 0.59 |
Banning | 0.17 | 0.46 | 0.73 | 0.26 |
Compton | 0.65 | 0.67 | 0.77 | 0.48 |
Fontana | 0.88 | 0.89 | 0.87 | 0.59 |
Glendora | 0.46 | 0.53 | 0.72 | 0.52 |
Lake Elsinore | 0.52 | 0.70 | 0.79 | 0.56 |
LA North Main ST | 0.36 | 0.67 | 0.78 | 0.48 |
LAX | 0.31 | 0.48 | 0.65 | 0.25 |
Mira Loma | 0.56 | 0.71 | 0.86 | 0.67 |
Rubidoux | 0.46 | 0.65 | 0.86 | 0.68 |
San Bernardino | 0.68 | 0.85 | 0.86 | 0.67 |
San Gabriel | 0.53 | 0.77 | 0.81 | 0.62 |
Santa Clarita | 0.27 | 0.72 | 0.84 | 0.61 |
Upland | 0.76 | 0.80 | 0.86 | 0.61 |
Sites | Bicubic R2 | IDW R2 | Ordinary kriging R2 | CMAQ R2 |
---|---|---|---|---|
Crestline | 0.35 | 0.42 | 0.42 | 0.23 |
La Habra | 0.75 | 0.80 | 0.77 | 0.44 |
Long Beach | 0.46 | 0.60 | 0.56 | 0.30 |
Mission Viejo | 0.15 | 0.36 | 0.49 | 0.39 |
North Hollywood | 0.67 | 0.67 | 0.79 | 0.59 |
Pasadena | 0.55 | 0.71 | 0.78 | 0.57 |
Perris | 0.55 | 0.72 | 0.80 | 0.56 |
Pomona | 0.71 | 0.83 | 0.84 | 0.68 |
Redlands | 0.60 | 0.74 | 0.71 | 0.57 |
Reseda | 0.63 | 0.63 | 0.71 | 0.01 |
West LA | 0.29 | 0.56 | 0.60 | 0.28 |
Winchester | 0.37 | 0.40 | 0.39 | 0.45 |
The bicubic R2 indicates the poorest performance of the three interpolation methods. IDW showed a significant improvement compared to bicubic interpolation. Since IDW accounts for the distances between the interpolation points and the data points, farther data points have less influence on the interpolation points. Ordinary kriging resulted in the best interpolation method because the method not only accounts for the distance between building points and interpolated data by assigning larger weight λi to the near neighbors, but it also considers the variability of data by considering the variance of input data, σ2.34
ML with interpolation gave a poor performance for Crestline and Winchester locations. Crestline is located in the mountains and to the northeast of SoCAB, which is elevated terrain associated with upper air and a different air mass at times. Crestline ozone was not well-correlated with coastal or inland sites. Thus, interpolated Crestline ozone based on coastal or inland data points will likely yield poor results. The Winchester air monitoring site is located near the Skinner Reservoir (Fig. S1†), far away from other data points (Lake Elsinore and Banning). Low R2 for Winchester can be explained by the influence of the lake and local meteorology and air quality. The ordinary kriging model performed well for locations bounded by data points with R2 above 0.56. However, poor interpolation results occurred for peripheral locations in SoCAB (Crestline, Mission Viejo, and Winchester). LAX ozone levels were not well correlated with meteorology, and training the ML model with fewer meteorological features did not affect the performance of the LAX location. Overall, model performance increased from the West to the East, with better prediction for inland sites.
The distribution of the monthly mean bias (MB) for ordinary kriging interpolation centered around zero with the range between +9 ppb for Compton (August) and −11 ppb for Glendora (October). Eleven building sites have a net positive monthly MB, and four have a net negative monthly MB (Fig. 6). The results from the CMAQ simulation overestimated the ozone levels. CMAQ's best performance was from May to October when the MBs were the smallest. CMAQ underestimates the ozone concentrations at the LAX location, due to the site's proximity to the Pacific Ocean, colder model temperatures, and potential discrepancies in aviation emissions. In general, ozone concentrations in the SoCAB are highest during the summer and lowest in the winter, corresponding with the temperature. Although the CMAQ simulation captures diurnal variation, the seasonal variation is not as well-represented (Fig. S4, S5, S7, and S11†). Lower performing CMAQ results could come from uncertainties in emissions estimates. CMAQ generally overestimated ozone concentrations because the simulated nighttime ozone concentrations were higher than those observed, potentially due to underestimated nighttime NOx emissions.15 In other words, there was not enough NOx emitted in the model during the daytime for ozone formation and at night for ozone removal.35,36
Training features can be varied to study the sensitivity to modeled ozone response. For example, we can perturb the temperature, RH, or emissions values and examine the ozone levels corresponding to the change in the features. However, because the formation of ozone results from a complex combination of chemical reactions, resulting impacts are nonlinear and interdependent. Therefore, when using ML to test for sensitivity to a feature, one should consider feature dependencies. For example, in testing temperature impacts on ozone concentration, we must consider both how temperature impacts photolysis rates (NO2 degradation) as well as simultaneous correlations/anticorrelation with other meteorological variables, such as RH or wind speed.
Although the interpolation R2 values for the 15 building sites are high, the accuracies of the 12 evaluation sites are somewhat lower than those reported in other studies. In our previous work, where we employed RFR to predict O3 levels in Fontana, we achieved an R2 of 0.86. Additionally, Lyu et al. utilized the RFR method to predict ozone concentrations in the Beijing–Tianjin–Hebei region, achieving a monthly R2 of 0.93 (Lyu et al., 2022).37 Two factors contribute to the performance of the evaluation sites in our approach. First, the estimation of O3 concentrations in evaluation sites relied on historical data from neighboring building sites. However, the building sites are not evenly distributed in Southern California, and the performance of the interpolating locations is inversely proportional to the distance of the building sites. Second, O3 levels are more locally influenced in SoCAB, and the relationship between NOx and VOC is not strictly linear. Therefore, the estimation from interpolation might not fully capture this locality. We also note that the choice of averaging period will impact R2, such that comparison of daily vs. monthly values will lead to discrepancies that favor a longer averaging period.
Fig. 7 Averaged diurnal profiles of 2016–2019 (blue), actual 2020 (red), and ML predicted 2020 (black) ozone concentrations (ppb) at Lake Elsinore (a–c) and Fontana (d–f) for three different periods: (a and d) pre-lockdown (Jan to Feb), (b and e) lockdown (Mar to May), and (c and f) post-lockdown (after May). The shaded area is the standard deviation of the 2016–2019 measurements. Additional sites are provided in the ESI.† |
Post-lockdown differences compared to the four-year average were not significant across the 15 sites. The RFR model captured ozone trends throughout 2020, although slightly lower during lockdown and despite the observed reduction in NOx, suggesting that meteorological features would play an important role in predicting ozone levels during anomalous episodes in addition to air quality features. Actual and modeled discrepancies also indicate anomalous ozone behavior during lockdown. For instance, several sites in the SoCAB showed an increase in ozone levels based on the diurnal profile implying that the urban locations in the SoCAB were in VOC limited regimes, where there was NOx reduction-initiated ozone enhancement.38
The diurnal NOx concentrations at all sites in Southern California exhibit a consistent pattern, in which both pre-lockdown and post-lockdown NOx levels were significantly higher than during the lockdown period. In Fig. S13,† the diurnal changes in NOx levels are illustrated for pre-lockdown (blue), lockdown (orange), and post-lockdown (green) between the 2020 NOx and the average from 2016–2019. Positive values before the lockdown suggest an increase in NOx levels in 2020 compared to the historical average of 2016–2019. However, during the lockdown, the differences are negative, indicating a significant decrease in 2020 NOx levels compared to the historical data due to a substantial decrease in traffic and anthropogenic activities.
We computed the diurnal differences between 2020 O3 and historical O3 (average from 2016 to 2019) for both actual 2020 O3 and ML 2020 O3 (Fig. S17 and S18†) to show the trends in O3 concentrations for the pre-lockdown, lockdown, and post-lockdown periods. During the lockdown (orange line), the Lake Elsinore site exhibits negative changes of −4 ppb at 15:00, the peak O3 concentration time of the day. However, in the early morning, the O3 changes turn positive (∼3 ppb), attributed to the reduced NOX titration. Post-lockdown (green line) shows mostly positive differences, indicating an increase in O3 concentrations due to rising emissions and transition to summertime. In Fontana, O3 trends do not show significant differences across the three periods. Notably, during peak O3 hours (13:00–16:00), O3 levels are more than 3 ppb higher compared to historical values, suggesting that the reduction in emissions has an inverse effect on O3 concentrations. It's worth noting that the ML model successfully predicted O3 trends in Lake Elsinore for all three periods. However, the ML model failed to predict the behavior of O3 in Fontana, as it estimated a decrease in O3 during the lockdown. The summary of the machine learning method and its performance across different regimes in the SoCAB is illustrated in Fig. 8.
Fig. 8 Flow chart of the ML model to summarize the ML method and the evaluation results in the South Coast Air Basin. |
To illustrate the variations in NOx corresponding to changes in O3 for three periods (pre-lockdown, lockdown, and post-lockdown), we calculated the O3 sensitivity using the ratio of differences in O3 and NOx between 2020 and historical data, as shown in eqn (8).
(8) |
In the VOC limited regimes, we forecast the sensitivity of O3 to be minimal regarding the changes in NOx. This is evident for areas with substantial NOx emissions, such as Azusa, Fontana, and Upland (Fig. 9), where the sensitivities of O3 (dO3/dNOx) during the lockdown are minimal. Conversely, in NOx limited regimes, we expect to observe a reduction in O3 corresponding to the decrease in emissions. Therefore, the sensitivities of O3 in NOx limited regimes are maximized during the lockdown, as illustrated in Fig. 9 for Lake Elsinore and Banning. At hour 14:00, O3 concentrations in Lake Elsinore decreased more than 12 ppb per 1 ppb reduction in NOx.
Fig. 9 O3 sensitivity for six locations in Southern California during the lockdown period reflecting the change in O3 with respect to the change in NOx between 2020 data and historical data. |
The ML model with interpolation successfully predicted O3 trends by utilizing four meteorological parameters and two observed ozone precursors (listed in Table 1). It is important to note that O3 exhibits strong relationships with meteorology, NOx, and VOCs. Due to data availability, VOC data were omitted from the training set. The current ML model has some weaknesses for testing the sensitivity of O3 to anomalous precursor levels and meteorology. Our ML model performs well in predicting O3 levels where the test data resembles the training data. However, the model struggles to give accurate predictions when the test data significantly differs from the training sets. For instance, during the lockdown, the model failed to predict the O3 concentrations in the VOC limited regimes. This suggests that relying on ML models to predict future scenarios may be unreliable under new regimes. Considering additional features, such as VOCs in the training sets, may enhance the model's ability to predict accurately when extrapolating beyond the feature space. To our understanding, there are no ML models known for effective extrapolation of the training data to provide reliable predictions.
Ordinary kriging interpolation using ML building provided daily data, addressed data missingness, and captured 2020 ozone trends with low bias despite the sudden change in emissions. The ML model with the interpolation method successfully captured ozone trends throughout three periods in 2020, particularly in locations operating under a NOx limited regime, such as Lake Elsinore. However, it faced challenges in predicting ozone levels during the lockdown period in areas characterized by a VOC limited regime, like Fontana. ML inherently relies on patterns learned from historical data to make predictions, especially for inputs that resemble past occurrences. In this study, the ML model struggled to make accurate predictions for VOC limited regime, suggesting that events akin to the COVID-19 lockdown had not been encountered in the past. Unfortunately, due to the unavailability of speciated VOC data, we didn't incorporate them as a training feature in the model. Since ozone formation exhibits a non-linear correlation with both NOx and VOC, the inclusion of speciated VOC data would likely enhance the model's accuracy, especially for regions with a VOC limited atmosphere. Our ML model provides regulators with valuable insights into NOx and VOC limited regimes across the Southern California domain, enabling policymakers to devise more effective emission reduction strategies and improve air quality at hyperlocal scales.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d3ea00159h |
This journal is © The Royal Society of Chemistry 2024 |