Open Access Article
Yusheng
Wu
*a,
Martha Arbayani
Zaidan
ab,
Runlong
Cai†
a,
Jonathan
Duplissy
ac,
Magdalena
Okuljar‡
a,
Katrianne
Lehtipalo
ad,
Tuukka
Petäjä
a and
Juha
Kangasluoma
a
aInstitute for Atmospheric and Earth System Research (INAR)/Physics, Faculty of Science, University of Helsinki, Finland. E-mail: yusheng.wu@helsinki.fi
bDepartment of Computer Science, Faculty of Science, University of Helsinki, Finland
cHelsinki Institute of Physics, HIP, Faculty of Science, University of Helsinki, Finland
dFinnish Meteorological Institute, Helsinki, Finland
First published on 4th February 2025
The submicron aerosol number size distribution significantly impacts human health, air quality, weather, and climate. However, its measurement requires sophisticated and expensive instrumentation that demands substantial maintenance efforts, leading to limited data availability. To tackle this challenge, we developed estimation models using advanced deep learning algorithms to estimate the aerosol number size distribution based on trace gas concentrations, meteorological parameters, and total aerosol number concentration. These models were trained and validated with 15 years of ambient data from three distinct environments, and data from a fourth station were exclusively used for testing. Our estimative models successfully replicated the trends in the test data, capturing the temporal variations of particles ranging from approximately 10–500 nm, and accurately deriving total number, surface area, and mass concentrations. The model's accuracy for particles below 75 nm is limited without the inclusion of total particle number concentration as training input, highlighting the importance of this parameter for capturing the dynamics of smaller particles. The reliance on total particle number concentration, a parameter not routinely measured at all in air quality monitoring sites, as a key input for accurate estimation of smaller particles presents a practical challenge for broader application of the models. Our models demonstrated a robust generalization capability, offering valuable data for health assessments, regional pollution studies, and climate modeling. The estimation models developed in this work are representative of ambient conditions in Finland, but the methodology in general can be applied in broader regions.
Environmental significanceAerosol particles, especially submicron particles, play a critical role in air quality, climate, and human health. However, traditional measurements of aerosol number size distributions are limited by expensive, high-maintenance instruments. This study addresses these limitations by using deep learning models to predict aerosol particle size distributions from widely available air quality data. By offering a reliable and accessible method to estimate critical environmental data, the study facilitates better health assessments and pollution management. The approach also broadens access to data that can support climate modeling, particularly in regions lacking the resources for continuous physical monitoring. |
In order to understand the impacts of submicron particles, comprehensive measurements have been carried out worldwide. For example, continuous long-term observation of atmospheric variables, including ambient particles, has been performed as the key method for gaining a comprehensive understanding of the interactions between humans, nature, and the atmosphere.9 Additionally, air quality monitoring stations have been equipped with instruments to measure total particle number concentration and particle size distributions, which is crucial for understanding the health impacts of submicron particles, enhancing air quality assessments by capturing detailed data on particle size and count, identifying pollution sources, ensuring compliance with emerging regulations, and supporting environmental research on atmospheric chemistry and climate change.10–12
Unfortunately, collecting long-term particle size distribution (typically represented as dN/d
log
Dp, indicating the particle number concentration across size bins normalized by using the logarithmic span of particle diameters) data over a large spatial scale poses a significant challenge due to the high cost of instrumentation and the substantial maintenance workload required. Advanced particle measurement instruments are often expensive, necessitating considerable financial investment for widespread deployment. Additionally, these instruments require regular calibration and maintenance to ensure accurate data collection, which adds to the operational burden.13,14 As a result, many regions may lack comprehensive particle size distribution data, limiting the ability to fully understand and mitigate the impacts of particulate matter on public health and the environment.15
Machine learning is a promising tool for addressing gaps in particle size distribution data, particularly in the context of aerosol physical properties. In the era of artificial intelligence and big data, data mining and machine learning technologies have significantly advanced atmospheric science by enabling more sophisticated data processing and analysis. These technologies have broad applications in understanding aerosol properties, including new-particle formation (NPF), but are not limited to this phenomenon. For instance, in Hyytiälä, Finland, clustering and classification methods have been used to investigate the relationship between the formation and growth of new particles and environmental variables, such as relative humidity and the condensation sink of gaseous precursors.16,17 Similarly, in the Po Valley, Italy, discriminant analysis has been employed to classify nucleation events, identifying relative humidity, O3, and radiation as significant factors influencing NPF.18 A multivariate non-linear mixed-effect model has further demonstrated that relative humidity and O3 are major predictors of NPF across multiple European sites, including the Po Valley, Melpitz in Germany, and Hohenpeissenberg in Germany.19 Beyond NPF, machine learning approaches such as mutual information have been effectively utilized to explore non-linear associations between atmospheric variables and various aerosol phenomena.17,20 Deep learning techniques, including image identification and Bayesian classification methods, have successfully classified NPF events, demonstrating the versatility of these tools in atmospheric research.21,22 Furthermore, the use of remote sensing generates substantial image data, which are well-suited for analysis through deep learning and other machine learning algorithms. For example, a transfer learning-based method has been applied to study temporal changes in dust properties,23 while other studies have demonstrated the potential of machine learning in remote sensing applications by providing technical tutorials and showcasing specific architectures, such as a pre-trained AlexNet with pyramid pooling for image scene classification.24,25 Machine learning algorithms have also been used to develop virtual sensors for estimating the concentration of atmospheric variables, showcasing their ability to model aerosol characteristics without the need for extensive traditional measurement infrastructures.26,27 A study28 explores the use of random forest techniques to gain quantitative insights into the impact of air mass history and coastal conditions on the formation and growth of nucleation mode particles in the atmosphere. Another recent study29 presents a convolutional neural network-based approach to identify new particle formation events from longitudinal global particle number size distribution data, providing a valuable tool for understanding new atmospheric particle formation processes. However, despite these advancements, the application of machine learning to estimate time series data of aerosol particles remains relatively uncommon. This is primarily due to the limited availability of sufficient training data and the complexities involved in hyperparameter tuning, which pose significant challenges in achieving accurate estimation models.
In this paper, we demonstrate that the particle number size distribution can be accurately estimated using data from routine air quality measurements. Estimative models for the aerosol number size distribution are developed based on pre-processed ambient trace gas concentrations, meteorological conditions, and aerosol number concentration. We utilize recurrent neural networks (RNNs) to build these models and systematically tune hyperparameters using an automated machine learning (AutoML) approach.
| Station | Location | Environment | Data type | Variables measured | Years |
|---|---|---|---|---|---|
| SMEAR I | Northern Finland | Subarctic forest | Meteorological, trace gas, and particle data | Wind speed, wind direction, temperature, relative humidity, pressure, radiation, NOx, SO2, CO, O3, Ntot, and particle size distribution (DMPS and APS) | 2005–2019 (training) |
| SMEAR II | Southern Finland | Boreal forest | Meteorological, trace gas, and particle data | Wind speed, wind direction, temperature, relative humidity, pressure, radiation, NOx, SO2, CO, O3, Ntot, and particle size distribution (DMPS and APS) | 2005–2019 (training) |
| SMEAR III | Helsinki, Southern Finland | Urban environment | Meteorological, trace gas, and particle data | Wind speed, wind direction, temperature, relative humidity, pressure, radiation, NOx, SO2, CO, O3, Ntot, and particle size distribution (DMPS and APS) | 2005–2019 (training) |
| Qvidja | Southwestern Finland | Coastal agriculture | Meteorological, trace gas, and particle data | Wind speed, wind direction, temperature, relative humidity, pressure, radiation, NOx, SO2, CO, O3, and particle size distribution (DMPS and APS) (no Ntot) | 2019 (test) |
| Test data | SMEAR Stations (I, II, and III) | Various environments | Meteorological, trace gas, and particle data | Same as the respective training data | 2020 (test) |
Interpolation of missing values: gaps in the data, where measurements are missing for periods up to 6 hours, are filled through interpolation. The threshold reflects a balanced approach for optimizing both data quality and data availability. Linear interpolation is used, which estimates the missing values based on the values of the nearest available data points. This ensures that the time series is complete and continuous, which is important for the model to learn temporal patterns effectively.
Handling of negative size distribution values: negative values in the particle size distribution data are not physically meaningful and are likely due to measurement errors or instrument noise. These negative values are replaced with a small positive number (10−5). This value is chosen to be small enough to minimise its impact on the model training process while still allowing the data to be visualised on a logarithmic scale.
Extraction of temporal features: the day of the year and hour of the day are extracted from the timestamps of the data and added as new features. This allows the model to learn seasonal and diurnal patterns in the data. For example, the model can learn that particle concentrations tend to be higher during certain times of the year or day.
Normalisation of input features: all input features, including meteorological parameters, trace gas concentrations, and particle number concentration, are normalised by removing the mean and scaling to unit variance. This ensures that all features have a similar range and distribution, which can improve the performance and stability of the neural network. Normalisation can prevent features with larger scales from dominating the learning process and can help the optimisation algorithm converge faster.
RNNs are particularly well-suited for this task because of their ability to capture temporal dependencies in sequential data. Unlike other deep learning architectures, such as convolutional neural networks, RNNs have an internal memory that allows them to learn from past information and use it to estimate future events. This is essential for accurately modelling atmospheric processes, which are often influenced by historical conditions. RNNs inherently reduce the reliance on extensive feature engineering by automatically learning and extracting relevant features from sequential data. Their ability to capture complex dependencies, temporal patterns, and contextual information within the input data enables them to effectively process and model intricate relationships without requiring additional manual feature augmentation.36,37 By utilising RNNs, our model can effectively learn the complex relationships between meteorological variables, trace gas concentrations, and particle size distribution over time, leading to more accurate estimations.
We use Mean Squared Error (MSE) as the loss function because it effectively measures the average squared difference between estimated and actual values, providing a clear indication of model accuracy and helping to minimize estimation errors by penalizing larger deviations more significantly.
In RNNs, hyperparameter tuning is a crucial step in training models. Traditionally, with many parameters to optimize, long training times, and multiple folds to prevent information leakage, this process can be cumbersome. We utilize Optuna with default settings for model tuning. Optuna is a popular AutoML tool based on Bayesian methods.38 The hyperparameters are tuned as follows: the number of layers in the long short-term memory (LSTM) network is set to 1 or 2 to balance model complexity and training efficiency; the batch size ranges from 32 to 256 to manage computational resources and training stability; the number of units per layer ranges from 8 to 128 to capture varying levels of feature representation; the output size varies from 32 to 256 to accommodate different estimation requirements; and the dropout ratio ranges from 0 to 0.5 to prevent overfitting while maintaining model generalization. The hyperparameter space includes three gradient descent optimizers: Root Mean Squared Propagation (RMSprop) with a learning rate from 10−5 to 10−1, decay from 0.85 to 0.99, and momentum from 10−5 to 10−1; Adaptive Moment Estimation (Adam) with a learning rate from 10−5 to 10−1; and Stochastic Gradient Descent (SGD) with a learning rate from 10−5 to 10−1 and momentum from 10−5 to 10−1. Tuning each model required approximately two hours on a single GPU machine.
To examine the interactions between variables from different stations, we train models using data from individual stations as well as a combined dataset from all three stations to capture both station-specific and overall patterns. In total, we build 8 models and generate 13 estimations (see Table 2).
| Input station and variables | Testing station | Model | Estimation |
|---|---|---|---|
| SMEAR I: met + gas | SMEAR I | SM1Train-MetGas | SM1Train-SM1Test-MetGas |
| SMEAR I: met + gas + Ntot | SMEAR I | SM1Train-MetGasNtot | SM1Train-SM1Test-MetGasNtot |
| SMEAR II: met + gas | SMEAR II | SM2Train-MetGas | SM2Train-SM2Test-MetGas |
| SMEAR II: met + gas + Ntot | SMEAR II | SM2Train-MetGasNtot | SM2Train-SM2Test-MetGasNtot |
| SMEAR III: met + gas | SMEAR III | SM3Train-MetGas | SM3Train-SM3Test-MetGas |
| SMEAR III: met + gas + Ntot | SMEAR III | SM3Train-MetGasNtot | SM3Train-SM3Test-MetGasNtot |
| ALL SMEAR: met + gas | SMEAR I | AllTrain-MetGas | AllTrain-SM1Test-MetGas |
| ALL SMEAR: met + gas + Ntot | SMEAR I | AllTrain-MetGasNtot | AllTrain-SM1Test-MetGasNtot |
| ALL SMEAR: met + gas | SMEAR II | AllTrain-MetGas | AllTrain-SM2Test-MetGas |
| ALL SMEAR: met + gas + Ntot | SMEAR II | AllTrain-MetGasNtot | AllTrain-SM2Test-MetGasNtot |
| ALL SMEAR: met + gas | SMEAR III | AllTrain-MetGas | AllTrain-SM3Test-MetGas |
| ALL SMEAR: met + gas + Ntot | SMEAR III | AllTrain-MetGasNtot | AllTrain-SM3Test-MetGasNtot |
| ALL SMEAR: met + gas | Qvidja | AllTrain-MetGas | AllTrain-QvidjaTest-MetGas |
log
Dp) against particle diameter from SMEAR II and Qvidja. The solid lines represent the median values (Q2) of the measured and estimated data, while the shaded regions indicate the interquartile range, bound by the first (Q1) and third quartiles (Q3). This visualization highlights how closely the model's estimations align with the actual measurements, providing insight into the accuracy and reliability of the estimative model, particularly across different particle sizes. As shown in Fig. 1b and c, the time series of the particle size distribution is shown, with particle diameter plotted against time and the data colored by the corresponding dN/d
log
Dp values. Fig. 1b and e present the observed data, while Fig. 1c and f depict the model's estimations. These figures allow for a temporal comparison, demonstrating the model's capability to replicate not only the overall distribution but also the temporal evolution of particle concentrations across different sizes. One notable aspect is the model's underperformance in estimating particles smaller than 10 nm. This is largely due to the limitations of the Condensation Particle Counter (CPC) used in the measurements, which has a lower detection limit of 10 nm. As a result, the model does not have sufficient training data for these smaller particles, leading to discrepancies in this size range. Despite this limitation, the model performs robustly for larger particles, successfully capturing variations in particle size distribution. This capability is crucial for understanding the temporal dynamics of aerosol concentrations and their implications for environmental and health-related studies. The observations at Qvidja (Fig. 1e) indicate low values from January to March and high values from April to September, while the model (Fig. 1f) does not capture the trend well. This discrepancy likely arises because the Qvidja station is located in a coastal agricultural environment, which differs from the environments of the three SMEAR stations (subarctic forest, boreal forest, and urban) used for training the model. The training data may not fully capture the unique processes influencing the aerosol size distribution at the coastal Qvidja site. Furthermore, the Qvidja data were only used for testing and were limited to the year 2019 due to the availability of data from that station. This is the only year for which a complete and usable dataset is available for model evaluation. This limitation might also contribute to the observed discrepancies.
Fig. 2 also highlights the distinct patterns in estimative accuracy across different environmental settings, as evidenced by the variation in coefficient of determination (R2) values. R2 is a statistical measure indicating the proportion of the variance in the dependent variable that is predictable from the independent variables, and is used to evaluate the models' goodness of fit. Negative R2 values, observed in some cases, suggest that the model's estimations, for those specific instances, do not accurately capture the variations in the observed data and perform worse than simply estimating the mean value. The convex shape of the R2 curves across all test sets suggests that the estimative models exhibit the highest accuracy in the intermediate particle size range (10–300 nm), while estimations for smaller (<10 nm) and larger (>300 nm) particles are less accurate. This pattern likely stems from the relatively higher abundance and stability of mid-sized particles, making them easier for the models to predict. Notably, the inclusion of the total particle number concentration (Ntot) as an input variable significantly enhances the model's performance for smaller particles, particularly those below 75 nm. This improvement is evident in the higher R2 values observed in the models that include Ntot, indicating that Ntot provides critical information that helps capture the dynamics of smaller particle formation and growth processes. While the inclusion of Ntot significantly improves the model's performance, particularly for smaller particles, it's essential to acknowledge that Ntot measurements are not standard at most monitoring sites. This limitation restricts the model's applicability to sites with available Ntot data, highlighting the need for either wider implementation of Ntot measurements or alternative approaches to capture the dynamics of smaller particles in the absence of these data. For the SMEAR II and III stations, the model trained on data from all stations performs comparably to those trained on station-specific data, reflecting the similarity in environmental conditions and particle formation processes between these sites. However, the performance for SMEAR I is noticeably lower, likely due to the unique subarctic environment at this station, which differs significantly from the other locations. This discrepancy underscores the importance of tailoring estimative models to the specific characteristics of the station's environment. In the case of the Qvidja station, which was not included in the training set, the model trained on all available data still shows reasonable performance within the 100–400 nm size range. This suggests that while the general patterns of particle size distribution can be captured even in previously unseen environments, the model's estimative power diminishes outside the trained size range, particularly for very small and very large particles. This highlights the need for more diverse training data to improve the generalizability of the models across different environments.
![]() | ||
| Fig. 4 Random forest feature importance (WX: north component of horizontal wind speed and WY: east component of horizontal wind speed). | ||
log
Dp), with warmer colors indicating higher concentrations. Subplots (a) and (e) show the measured particle size distributions for the non-event and event days, respectively. On the event day (subplot (e)), a distinctive “banana” shape appears at particle sizes below 30 nm, signaling the occurrence of a new particle formation event. Subplots (b) and (f) present estimations from a model trained without Ntot. This model fails to accurately capture the particle formation and growth dynamics on the event day, missing the characteristic “banana” shape entirely. This indicates the model's inability to predict the burst of particles smaller than 30 nm during an NPF event when Ntot is excluded. In contrast, subplots (c) and (g) display estimations from a model trained with Ntot. These estimations successfully capture the burst of particles smaller than 30 nm on the event day, accurately reproducing the “banana” shape observed in the measured data. This demonstrates the importance of including Ntot as an input variable for accurately modeling particle formation dynamics. Furthermore, subplot (d) provides an even more detailed estimation for the event day using a model specifically trained for the SMEAR II station with Ntot included. This model not only captures the particle formation event but also offers a more precise depiction of the particle growth dynamics compared to the generalized model shown in subplot (g). The discrepancy in capturing the small event observed at 6–9 am local time between the generalized and specified models could be attributed to several factors. The generalized model, trained on data from all three SMEAR stations, may have learned a broader range of atmospheric conditions and particle formation patterns, allowing it to detect subtle variations that the specified model, trained solely on SMEAR II data, might miss. Additionally, the small event might be associated with specific local meteorological conditions or source contributions that are less pronounced at SMEAR II compared to the other stations, contributing to the specified model's inability to capture it. Further investigation is needed to fully understand the reasons behind this difference in model performance. These comparisons underscore the critical role of including Ntot as an input variable in models estimating new particle formation events. The detailed analysis of feature importance and the performance comparison across different models clearly demonstrate that incorporating Ntot significantly enhances the accuracy of particle size distribution estimations, particularly for smaller particles during NPF events. Since the model's performance for particles below 10 nm is limited, the NPF event case study may primarily reflect the model's ability to capture overall trends and patterns of particle formation and growth, rather than providing precise quantification of particle concentrations in the sub-10 nm range.
In terms of the accuracy of the estimated results, the model including Ntot as an input can predict the new particle formation event, but it is not sufficient for more detailed quantitative analysis, such as calculating the new particle formation rate and growth rate. The reason that models cannot predict particles smaller than 100 nm in Qvidja (Fig. 2) may be related to the marine agricultural environment, which has a fast nucleation and growth rate locally, and these particle formation mechanisms are not present at the other stations. Fig. 3a shows that the correlation of the derived number concentrations of the test and estimation is weak because particles smaller than 100 nm contribute significantly to the total particle number concentrations.
The model's performance was found to be most accurate within the 10–500 nm particle size range, which can be attributed to the availability of more extensive and reliable data for these particle sizes. The limitations of the instruments used to measure particle size distribution, particularly for particles smaller than 10 nm and larger than 500 nm, likely contributed to the reduced accuracy in these ranges. To improve the model's ability to predict the full spectrum of particle sizes, future research should focus on incorporating data from instruments with wider detection ranges and enhanced sensitivity, particularly for the smallest and largest particle sizes.
The methods of this research can still be improved in the future. Use of longer data sets may train better models. When we selected the features, we did not use PM2.5. This is because the concentration of PM2.5 in the atmosphere in Finland is relatively low and close to the detection limit of the instruments, so the measurement results fluctuate significantly. PM2.5, as a common measurement parameter, is likely to be used as a feature for model input in the future, especially in urban environments. This study is based on the understanding that Ntot, reflecting the total number of particles present at a given time, directly influences their distribution across different size ranges at the same time point. However, future studies could explore using lagged Ntot values, representing past measurements, as an alternative approach to better capture the temporal evolution of particle size distribution. While our current analysis focused primarily on diurnal and seasonal trends, future studies could investigate the models' performance in estimating weekly variations, particularly in environments with distinct weekly patterns in anthropogenic emissions.46 The stability of the boundary layer, which can significantly influence vertical mixing and aerosol transport processes, was not explicitly considered in the current model. Incorporating parameters reflecting boundary layer stability, such as mixing layer height or stability indices, could potentially enhance the model's ability to capture the temporal variations in aerosol size distribution, particularly during periods of strong diurnal changes in atmospheric stability.47 As our models are trained exclusively on data from Finland, their applicability might be limited to geographical regions with comparable meteorological conditions, such as similar temperature ranges, relative humidity levels, and wind patterns, as well as similar atmospheric composition, including concentrations of trace gases and pre-existing aerosols. To assess the model's generalizability, testing with data from different environments, particularly those with distinct meteorological and atmospheric characteristics, is crucial. It is foreseeable that if data from stations of different environment types are used to train the model, a model with better performance and generalization will be trained. It is possible to train models for different environments; for instance, models trained specifically for urban environments may be used for aerosol particle exposure estimates.
Transfer learning using certain time series neural network models is worth trying, instead of training a model from scratch.48 It is worth mentioning that ensemble learning can integrate multiple models into one, demonstrating strong estimative performance in many fields,49 and it can also be a direction to improve models in the future.50 This study primarily highlights the estimative capabilities of RNN models; however, a thorough evaluation of their computational efficiency compared to that of traditional models remains an important area for future exploration. Finally, causal inference has flourished in the past two decades and is considered one of the key directions for data science.51,52 If the tools of causal inference are applied to follow-up studies, the interpretability of the model will be improved, potentially guiding traditional research based on physical and chemical analyses.
Footnotes |
| † Present address: Shanghai Key Laboratory of Atmospheric Particle Pollution and Prevention (LAP3), Department of Environmental Science & Engineering, Fudan University, Shanghai, China. |
| ‡ Present address: International Laboratory for Air Quality and Health, School of Earth and Atmospheric Sciences, Queensland University of Technology, Brisbane, Australia. |
| This journal is © The Royal Society of Chemistry 2025 |