L. C.
Richards
ab,
N. G.
Davey
a,
C. G.
Gill
abcd and
E. T.
Krogh
*ab
aApplied Environmental Research Laboratories, Chemistry Department, Vancouver Island University, Nanaimo, British Columbia, Canada. E-mail: Erik.Krogh@viu.ca
bDepartment of Chemistry, University of Victoria, Victoria, British Columbia, Canada
cChemistry Department, Simon Fraser University, Burnaby, B.C., Canada
dDepartment of Environmental and Occupational Health Sciences, University of Washington, Seattle, WA, USA
First published on 29th November 2019
Volatile and semi-volatile organic compounds (S/VOCs) are ubiquitous in the environment, come from a wide variety of anthropogenic and biogenic sources, and are important determinants of environmental and human health due to their impacts on air quality. They can be continuously measured by direct mass spectrometry techniques without chromatographic separation by membrane introduction mass spectrometry (MIMS) and proton-transfer reaction time-of-flight mass spectrometry (PTR-ToF-MS). We report the operation of these instruments in a moving vehicle, producing full scan mass spectral data to fingerprint ambient S/VOC mixtures with high temporal and spatial resolution. We describe two field campaigns in which chemometric techniques are applied to the full scan MIMS and PTR-ToF-MS data collected with a mobile mass spectrometry lab. Principal Component Analysis (PCA) has been successfully employed in a supervised analysis to discriminate VOC samples collected near known VOC sources including internal combustion engines, sawmill operations, composting facilities, and pulp mills. A Gaussian mixture model and a density-based spatial clustering of application with noise (DBSCAN) algorithm have been used to identify sample clusters within the full time series dataset collected and we present geospatial maps to visualize the distribution of VOC sources measured by PTR-ToF-MS.
Environmental significanceVolatile organic compounds (VOCs) are emitted from anthropogenic and biogenic sources and are present at trace levels in the atmosphere. They contribute to degraded air quality as precursors to ground level ozone and secondary organic aerosol formation. In addition, some VOCs are toxic and some are associated with nuisance odours. We describe direct mass spectrometry continuously operated in a moving vehicle to capture changes in VOC composition over time and space. Principal component analysis was employed to discriminate sources with the results visualized to map sources, distributions, and potential impacts. This work has applications to neighbourhood scale air quality mapping, and can be generalized to other sensor data. It is an important first step in real-time source tracking and apportionment. |
More recently, the miniaturization and ruggedization of analytical instrumentation has facilitated their use in moving vehicles, allowing for geospatial chemical mapping. In particular, the miniaturization of direct mass spectrometry instrumentation allows for continuous, on-road,19–23 underwater,24,25 and airborne15,26,27 measurements to map spatial and temporal distributions of S/VOC concentrations. Further, techniques such as proton-transfer reaction time-of-flight mass spectrometry (PTR-ToF-MS) and membrane introduction mass spectrometry (MIMS) allow for the simultaneous analysis of the mixture of S/VOCs in an air sample without chromatographic separation.28
MIMS employs a semipermeable membrane as an interface between the sample and the mass spectrometer.29–31 Air samples are continuously passed over a polydimethylsiloxane membrane and S/VOCs pervaporate into a carrier gas, where they are transferred to the mass spectrometer as a mixture for analysis. Electron ionization (EI) typically employed by these systems results in the initial formation of [M+˙] ions and subsequent fragmentation. MIMS is well suited for the analysis of small, hydrophobic molecules present in an air sample. The resulting full scan mass spectra represent the superposition of all of the ions produced by the mixture of VOCs permeating the membrane.19
In PTR-ToF-MS analysis, VOCs in a continuously sampled air stream are reacted with H3O+ reagent ions in a drift tube with a uniform electric field, where ion-molecule collisions occur. If the proton affinity (PA) of the analyte (M) is higher than that of water, protonated molecular ions [MH+] are formed. PTR is a softer ionization method than EI resulting in considerably less fragmentation.32 PTR-ToF-MS is an excellent method for directly analyzing both polar and non-polar trace organic compounds in ambient air without chromatographic separation.15,33 When coupled with time-of-flight mass spectrometry, full scan mass spectra are obtained in microseconds.
Both MIMS and PTR-ToF-MS systems have been operated on a mobile platform for temporally and spatially resolved quantitative analysis of ambient VOCs in the parts-per-trillion by volume (pptv) to parts-per-billion by volume (ppbv) range. PTR-ToF-MS systems have been operated on aircraft campaigns to measure and quantify VOCs in an agricultural fire plume and urban environments,26 as well as over oil and gas producing regions,34 and have been used ‘on-road’ to measure emissions from vehicles.23 Several researchers have employed PTR with quadrupole mass spectrometers to assess BVOC emissions above forested areas.35 Portable MIMS systems have been operated from a moving vehicle for quantitative VOC analysis near oil and gas activites,19,20 to detect products and impurities from methamphetamine precursor production,21 and to measure VOCs associated with traffic and woodsmoke emissions.22 These field studies focused on the identification and quantitation of individual VOCs or isomer classes by analyzing the signal intensity at specific mass-to-charge ratios (m/z) or by using targeted methods, such as tandem mass spectrometry (MS/MS) and/or selected ion monitoring (SIM) scans. Alternatively, qualitative chemometric analysis taking advantage of the wealth of non-targeted information present in a full scan mass spectrum can be used to discriminate between samples containing different VOC mixtures, but has not yet been applied to data collected on a mobile platform.28
Principal component analysis (PCA) is a multivariate dimensional reduction technique that calculates new variables, Principal Components (PCs), that are linear combinations of the original variables (m/z), as shown in eqn (1):
| PC = a1m/z1 + a2m/z2 + … + anm/zn | (1) |
PCA has been applied to data collected from MIMS and PTR-MS systems for sample discrimination in environmental28,38–41 and food39,42–44 analysis. Alberici et al. analyzed the full scan MIMS spectra of the water soluble fraction of Brazilian commercial gasolines using PCA,41 and Ketola et al. have used a similar approach to distinguish commercial drink products based on the manufacturer.42 PCA has also been employed to differentiate and classify terpene photo-oxidation mechanisms in atmospheric simulation chambers using data collected with a PTR-ToF-MS system.38 We have recently employed full scan MIMS data to discriminate lab-based constructed air samples, VOCs produced in the combustion of different species of wood, and headspace samples influenced by aqueous hydrocarbon solutions.28 None of these studies have exploited the high temporal resolution capabilities of direct mass spectrometry in a moving vehicle to provide geospatial maps of air masses influenced by different VOC sources.
Here we present MIMS and PTR-ToF-MS data collected on-road in a mobile mass spectrometry laboratory during field campaigns on central Vancouver Island, British Columbia (BC), Canada. We use PCA to analyze the full scan mass spectra to discriminate VOC sources in ambient air samples impacted by a variety of activities including vehicle exhaust, pulp mills, sawmills, and composting facilities. We describe a supervised approach, where selected data from the field campaigns was analyzed based on known nearby point sources, as well as an unsupervised approach, where the entire data set was analyzed using PCA followed by a Gaussian mixture model and density-based spatial clustering of application with noise (DBSCAN) algorithm in order to identify clusters within the data set. Both approaches allow for the discrimination of real-world VOC sources and can be used in combination with spatial information to generate neighborhood-scale maps illustrating the distribution of sources.
000, McMaster Carr, Elmhurst, IL, USA) in order to reduce the effect of the Earth's magnetic field on the measured signal intensity.45 The MIMS instrument described has low ppbv sensitivity for hydrocarbons such as the BTEX suite (benzene, toluene, ethylbenzene, xylenes) and a compound dependent response time of 15–30 seconds due to membrane transport.
The MIMS system was operated from an independent 24 VDC power supply (4 × 6 VDC lead acid batteries, Model S6-275AGM, Surrette Battery Company Ltd, Springhill, NS, Canada) in the vehicle. Ambient air was continuously flowed over the membrane interface at 2 L min−1 using a 1/4 inch outer diameter and 3/16 inch inner diameter fluorinated ethylene propylene (FEP) sampling line (Cole-Parmer, Montreal, QC, Canada) that was passed through the passenger side window and attached to the front of the vehicle 40 cm above the windshield. Two in-line stainless steel frit filters at 15 and 5 μm (Swagelok, Solon, OH, USA) were used to remove particulate matter from the sampling stream.
Field data was collected with the MIMS instrument on August 4, 5, 9, and 10, 2016 in the Nanaimo area of Vancouver Island, BC, Canada. Meteorological data for this field campaign is found in the ESI and summarized in Table S1.† Full scan MIMS mass spectra were collected every 7 seconds on August 4, 9, and 10, and every 1 second on August 5. Air was sampled near fresh paving at two different locations (asphalt samples), vehicle exhaust samples were collected roadside as vehicles were unloading from a ferry, as well as from vehicles encountered when sampling, gasoline samples were collected at a gas station (fresh) and from a storage container (aged), and additional road works samples were collected in the vicinity of the paving activities.
In addition to collecting full scan mass spectra for non-targeted VOC fingerprinting, several targeted quantitative scans were also included for context. Details of the quantitative analysis are found in the ESI, with Table S2† detailing calibration information. Ambient VOC concentrations were generally observed to be in the low ppbv range, with less than 5% of the observations above our quantitation limits (typically 3–6 ppbv) as illustrated in the whisker plots shown in Fig. S1 (ESI†). Concentration excursions above these limits were associated with sampling in close proximity to known point sources with concentrations up to 18 ppbv for benzene, 91 ppbv for toluene, 88 ppbv for ethylbenzene, and 26 ppbv for α-pinene.
Field campaign data was collected on August 21 and 22, 2017 between Nanaimo and Crofton, BC. Winds were light (2–4 m s−1) on all sampling days. Additional meteorological data for this field campaign is found in the ESI and summarized in Table S3.† Samples were collected near two different pulp mills, two sawmills, a wood storage site, two composting facilities (commercial operator and a topsoil producer), a landfill, a gas station, an auto wrecking facility, and a ferry terminal. Some sources were sampled on both days (e.g., composting facility). Mass spectra were collected at 1 Hz using the PTR-ToF-MS system, and high resolution positional data was collected using a roof mounted antenna (Hemisphere A45, Scottsdale, AZ) and GPS receiver (Hemisphere R330, GNSS Receiver, Scottsdale, AZ, USA). While the focus of this work was on qualitative analysis, direct calibrations for benzene, toluene, ethylbenzene, dimethyl sulfide, and α-pinene were also performed using certified permeation tubes in a Dynacalibrator Gas Dilution System with concentrations ranging up to 150 ppbv. Details of this work is found in the ESI, with Table S4† detailing the calibration information. Observed VOC concentrations were generally below 1 ppbv, with elevated levels up to 150 ppbv of dimethyl sulfide (m/z 63.022) being measured in the vicinity of a pulp mill. BTEX concentrations up to 25 ppbv benzene (m/z 79.049), 50 ppbv toluene (m/z 93.061), and 60 ppbv ethylbenzene (m/z 107.076) were measured near an auto wrecking facility. Terpene concentrations up to 56 pbbv (as α-pinene, m/z 137.110) were measured near a wood chip truck and sawmills. Whisker plots for VOC concentrations over the sampling period are shown in Fig. S2 in the ESI.†
![]() | (2) |
is the full scan mass spectrum as a vector of unit length. Each column was mean-centered to give each variable (m/z) a mean of zero. PCA was applied to the normalized and mean centered datasets. Data analysis was done using Microsoft Excel (Microsoft Corporation, Redmond, WA, USA), MATLAB (Mathworks, Natick, NA, USA), MATLAB's Statistics and Machine Learning Toolbox, and the PLS_Toolbox (Eigenvector Research Inc, Manson, WA, USA).
000 algorithm iterations per replicate. Models were run with covariance matrices being shared among the groups, unshared among the groups, and with both full and diagonal covariance matrices using modified MATLAB code.48 Akaike information criterion (AIC), and Bayesian information criterion (BIC) values were calculated for each of the 19 clustering options. Additional AIC and BIC values were calculated for up to 40 groups using full, unshared covariance matrices. The results from these metrics were then compared to inform the number of groups within the dataset. The final number of groups used in the analysis was determined by producing geospatial maps with different colours used for each group. These maps were then interrogated, and compared with field notes to determine the optimum clustering model. For the DBSCAN algorithm, neighbourhood size and the number of neighbours needed to identify core points were varied until the algorithm would identify multiple clusters within the PCA model. Once again, the final number of clusters was determined using a combination of geospatial maps and field notes.
The PCA for the MIMS data is shown in Fig. 2, with the scores plot shown in panel D, and the loadings plot in panel E. For this dataset, PC 1 describes 42% of the variance, PC 2 describes 9%, and PC 3 describes 6%. The variance captured by each of the PCs can be visualized using a Scree plot, in which the relative variances are plotted in descending order. The Scree plot for this analysis is shown in Fig. S3 in the ESI† indicating that the first 10 PCs account for 71% of the variance in the data. Samples in the scores plot are colour-coded based on nearby sources identified in the field notes, and samples from similar hydrocarbon sources are seen clustered together. PC 1 discriminates the asphalt samples from the other measured samples. The asphalt samples generally have negative scores on PC 1 and scores close to zero on PC 2. PC 2 allows for the discrimination of vehicle exhaust from fugitive emissions from gasoline and road works samples. Generally, the vehicle exhaust samples have more positive scores on PC 2. The aged gasoline samples (red diamonds) have more positive scores than the fresh gasoline samples (red triangles) on PC 1. Weathering of gasoline samples leads to the loss of the more volatile compounds which could account for the separation of the fresh and aged gasoline samples.53 The samples associated with other road works activities plot near the gasoline samples. We hypothesize that these samples could have been influenced by fugitive emissions from hydrocarbon cleaning mixtures or improperly sealed gasoline containers. For the vehicle exhaust samples, the samples with the highest positive scores on PC 2 were collected across the street from a sawmill as vehicles were unloading from a ferry. These samples are also impacted by emissions from the sawmill, discriminating them from samples impacted solely by vehicle exhaust observed elsewhere. It was not possible for us to sample air impacted only by the sawmill without the confounding influence of vehicle exhaust on this field campaign. The scores plot for PC 1 versus PC 3 is found in Fig. S4 (panel A) in the ESI.† In this projection, the asphalt samples remain well separated, with negative scores on PC 1, but there is more overlap between the other hydrocarbon samples.
The loadings plot identifies which m/z lead to sample discrimination in the scores plot. Many of the measured m/z that are more dominant in the asphalt mass spectrum have negative loadings on PC 1, while m/z 91 demonstrates a high positive loading on PC 1, and was the base peak in most of the mass spectra from other sources. Ions due to the presence of α-pinene or other monoterpenes (m/z 93 (C7H9+) and 136 (C10H16+)) from fresh cut forest products have positive loadings on PC 2 allowing the vehicle exhaust samples collected near a sawmill to be discriminated from those collected elsewhere in the region. The loadings plot for PC 1 versus PC 3 is found in Fig. S4† (panel B), and indicates that it is once again the asphalt samples that are differentiated from the others due to a greater number of observed compounds. Overall, this analysis demonstrates that full scan MIMS data of ambient, volatile organic hydrocarbons collected while driving can be used to discriminate samples arising from different sources.
A total of 290 peaks were identified between m/z 30–215 in the full scan data. Table S5 in the ESI† provides a peak list of the major ions detected in the field campaign, with our measured m/z, possible chemical formula, calculated exact mass for the formula, potential compound identities, and observed sources. The range of compounds detected included hydrocarbons (e.g., BTEX, isoprene, monoterpenes), oxygenated species (e.g., methanol, acetone, acetaldehyde), and sulfur compounds (e.g., dimethyl sulfide). As the PTR-ToF-MS does not employ chromatography or tandem mass spectrometry for additional selectivity, isomers cannot be distinguished, and the signal intensity of each m/z is for the sum of isomers (e.g., m/z 59.048 would have a chemical formula of C3H7O+ which could be both protonated acetone and protonated propanal). It should be noted that peaks at m/z 93.036, 107.076, 121.085, and 135.092, attributed to the protonated molecular ions of hydrocarbons associated with fugitive emissions and/or incomplete combustion of fossil fuels, have also been observed to be derived from the fragmentation of mono and sesquiterpenes, even with the softer chemical ionization of proton transfer.54,55 This is reflected in the listed sources for these ions, as these ions had major sources near an auto wrecking facility, gas station, and ferry, but were also detected near biomass sources such as the sawmills, composting facilities, and a wood chip truck. For the purpose of the analysis described below the different samples have been broadly classified as follows: aged biomass samples–samples collected near a landfill, municipal composting facility, and topsoil producers; fresh biomass samples–samples collected near sawmills, wood chip trucks, wood storage, and sawdust piles; hydrocarbon samples–samples collected near a gas station, ferry terminal, and auto wrecking facility; pulp mill samples–samples collected near two pulp mills; and the farm vehicle samples were collected near a tractor carrying hay.
During the 2017 campaign there were several compounds measured exclusively at or near particular sources. For example, m/z 63.022 (C2H7S+) was only detected near the two pulp mills on the drive route. Peaks at m/z 57.070 (C4H9+), 107.076 (C8H11+), and 121.085 (C9H13+) were generally detected near a gas station, ferry, and an auto wrecking facility. Example mass spectra are shown in Fig. 3 (panels A–C), with some ions of interest labeled on the mass spectra. The mass spectrum in panel A was collected near the municipal composting facility, the mass spectrum in panel B was collected near the auto wrecker, and the mass spectrum shown in panel C was collected near a sawmill. The mass spectrum collected near the auto wrecker is dominated by hydrocarbons, while the mass spectra collected near the compost and sawmill also contain ions for oxygenated species.
For the supervised PCA, 298 average mass spectra were used. The PCA scores and loadings plots for the PTR-ToF-MS data are shown in Fig. 3 (panels D and E). PC 1 accounts for 51% of the variance in the data set, with PC 2 accounting for 17%. The Scree plot for the analysis is shown in Fig. S5 in the ESI† with the first three PCs accounting for 78% of the variance in the data set (>97% of the variance is described by the first 10 PCs). Samples in the scores plot have been colour-coded based on the source type they represent (e.g., hydrocarbon, fresh biomass), and the different shapes of the same colour indicate different encounters with the same source type. For example, we encountered three primarily hydrocarbon sources (red) and they are marked as triangles (ferry), plus signs (auto wrecking facility), and circles (gas station) in Fig. 3. In the scores plot, hydrocarbon samples have negative scores on PC 1, discriminating them from other source types. PC 2 provides discrimination between VOCs from fresh biomass, pulp mills, and aged biomass samples. The latter have positive scores on PC 2, samples with scores near zero on PC 2 are associated with pulp mill emissions, and samples with negative scores on PC 2 are due to emissions from fresh biomass near sawmills, wood chip trucks, etc. Additionally, samples collected near a farm vehicle carrying hay are located between the hydrocarbon and biomass samples on the scores plot, potentially due to the mixed nature of this source.
The loadings plot can be used to identify possible compounds leading to sample discrimination. For example, the [MH]+ ions for many hydrocarbons (m/z 93.061, 107.076, 121.084) have negative loadings on PC 1, which allows the hydrocarbon samples to be discriminated from the biogenic emissions. Acetaldehyde (m/z 45.025) has a negative loading on PC 2, while acetic acid (m/z 61.029) which was mainly detected in the vicinity of the compost facility has the highest positive loading on PC 2.
The PCA showing the scores and loadings for PC 1 versus PC 3 are shown in Fig. S6 in the ESI.† In this projection, the hydrocarbon samples are still separated from the biomass samples, as that discrimination falls along PC 1, but there is more overlap between the biomass and pulp mill samples. Some of the pulp mill samples have high positive scores on PC 3 due to the presence of dimethyl sulphide (m/z 63.022) in these samples.
The mass spectra used for this analysis did not include the signal intensity for protonated methanol as this improved sample discrimination for samples impacted by biomass sources and removed the confounding influence of detecting periodic windshield washing fluid from vehicles during on-road sampling. The PCA including methanol is shown in Fig. S7 in the ESI.† When methanol is included, hydrocarbon sources remain well separated, but there is some overlap between the pulp mill and fresh biomass samples and the samples associated with pulp mills and aged biomass sources overlap completely.
A regional scale map of the samples is shown in the top panel of Fig. 4 along with the time series of the data (bottom panel). The drive route is shown by the black line, with the samples used in the supervised PCA colour-coded based on source type. The size of the dot is scaled by the total VOC concentration using the total ion current for m/z 30–215 (excluding the reagent ions). The labels on the map show the nearby VOC sources. Those labeled in black text are stationary sources (mostly industrial), and those labeled in red text were moving. The time series data shows the overlay of all measured m/z observed during the field campaign, with the exception of methanol (as it was excluded from the chemometric analysis). Mass spectral peaks of interest are labelled on the time series, and the portions of the time series data collected near the identifiable sources used in the PCA are indicated by the coloured bars along the top of the plot. Fig. S8 (panels A–D) in the ESI† show time series of subsets of the measured compounds, with methanol in panel A, major hydrocarbon ions and their fragments in panel B, major non-methanol oxygenated species and their fragments in panel C, and sulphur compounds in panel D. It is also important to note that the on-road measurements can be influenced by mobile sources and are not always associated with stationary area sources. With the exception of acetic acid (which exhibits a slow decay time due to carry-over in the sample line), we observe only small differences in the decay times of the remaining VOCs described here (<5 s). These differences are dampened out for the supervised analysis (15 s averaged mass spectra) and have only a minor influence on the unsupervised analysis using 1 second averaged mass spectra. In both cases, the decay times do not change the distribution of identified sources at the spatial resolution presented here.
Given the success of discriminating known VOC samples above, we have extended this method to identify groupings within the full data set in an unsupervised analysis. The complete dataset contains 16
737 mass spectra measured at 1 second intervals across 290 mass channels. As has been noted by others, the high sample-to-variable ratio reduces the likelihood of spurious correlations.56 Two clustering algorithms (GMM and DBSCAN) were employed to group samples based solely on their mass spectral fingerprints. The GMM algorithm clusters all data points into groups, whereas DBSCAN only assigns dense data clusters within the PCA model. A PCA model was used as the algorithm input as this greatly reduced computational time, especially in the case of the GMM. A 15 component PCA model was used as the input. The first three PCs represent 44% of the variance in the dataset. Most PCs past the 10th, modelled noise, with only a few beyond that showing some structure related to specific geographic locations (i.e., PCs 12 and 15). The 15 PC model describes 79% of the variance in the data, with each subsequent PC accounting for less than 1.2% of the variance. It was also found that the addition of more PCs had little impact on the results of the clustering algorithms (data not shown).
For the calculated GMMs both the AIC and BIC values were minimized using unshared, full covariance matrices for 2–20 clusters. For these models, the AIC and BIC values decreased with increasing cluster number, with very similar values obtained for 12–20 clusters. Additional AIC and BIC values were calculated for up to 40 clusters using the models with full, unshared covariance matrices. The plots of the AIC and BIC values are found in Fig. S9 of the ESI.† The calculated AIC values are similar for models using 12 or more clusters (they continue to decrease up to 40 clusters), whereas the BIC values reach a minimum value at 27 clusters. As BIC penalizes model complexity more than AIC,46 the BIC minimum value of 27 was used as an indicator of the most complex model needed to describe the data, with models containing 9, 10, 11, 12, 13 and 14 clusters also being calculated. Geospatial maps of the different models were compared, and the 12 group model was selected as it was able to identify sources within the dataset, without overcomplicating the model.
The PCA scores plots of PC 1 versus PC 2 and PC 1 versus PC 3 for this analysis are shown in Fig. 5 (panels A and B), with a map of the results in Fig. 5 (panel C). Many of the point sources on the map are identified by the large dots in their vicinity (gas station, compost facility, ferry terminal, pulp mill, auto wrecker). Samples are coloured based on group membership as determined by the 12 group GMM. Fig. S10 in the ESI† show the loadings plots of PC 1 versus PC 2 and PC 1 versus PC 3 in panels A and B respectively. In the scores plots similar samples are grouped together, with many point sources grouping along the corners and edges of the data. Groups with individual sources or source types have been identified on the scores plot. For clarity, Fig. S11† (panels A–L) depicts the PC 1 versus PC 2 scores plots for the individual groups, as well as the average mass spectra for each of the groups with some ions of interest labeled. Scores along PC 2 discriminate hydrocarbon sources from pulp mill samples, fresh, and aged biomass samples. PC 1 can be used to discriminate the biomass sources, with aged biomass having negative scores on PC 1 and fresh biomass having positive scores on PC 1.
The loadings plots shown in Fig. S10† (panels A and B) indicates that similar ions lead to sample discrimination in the supervised and unsupervised data sets, with signals associated with acetic acid, acetaldehyde, formic acid/ethanol, ethylbenzene/xylenes, and toluene having high loadings. For the groups identified using GMM, some groups were associated with a single source, while others contained samples from multiple sources. For example, Group 10 (magenta triangles) was only measured near a municipal composting facility with a mass spectrum dominated by a C2H3O+ ester or acid fragment, acetaldehyde, acetone, acetic acid, and monoterpenes signals (Fig. S11,† panel J) while Group 9 (red triangles) was collected mainly in the vicinity of a pulp mill with signals associated with acetaldehyde, acetone, acetic acid, dimethyl sulfide, and monoterpenes (Fig. S11,† panel I). Other groups were associated with multiple sources with similar emissions. For example, Group 5 contained samples collected near multiple sawmills, and from driving near a wood chip truck with a mass spectrum influenced by acetaldehyde and monoterpene fragments (Fig. S11,† panel E); and Group 12 is impacted by the three hydrocarbon sources, and an unknown source, with a mass spectrum dominated by the BTEX suite and other alkylated aromatics (Fig. S11,† panel L). Finally, some sources are present in multiple groups in the scores plot. For example, a group of pulp mill samples has positive scores on PC 1 and PC 2 (Group 9, red triangles), while other pulp mill samples are have positive scores on PC 1 and negative scores on PC 2 (Group 3, blue diamonds). As pulp mills have multiple different sources for VOC emissions (e.g., wood chip piles, pulp production, wastewater ponds), it is not surprising that not all the pulp mill samples are grouped together. Some VOC sources encountered during the field campaigns were unidentified at the time of sampling as can be seen in the grey shaded areas in Fig. 1. We have tentatively assigned source classes to these VOC signals based on the group membership in the unsupervised data analysis. For example, the unidentified signals that appear at 16
:
40 in Fig. 1 could be attributed to a hydrocarbon source, whereas those at 16
:
53 and 17
:
15–17
:
25 could be assigned to pulp mill emissions. The assignment of the samples collected from 17
:
15–17
:
25 is supported by field observations noting a sulfur smell. Given the limited data presented here, we have chosen not to build a classification model. Future field campaigns will include more sources and replicates to enable us to train and verify a robust predictive model.
For the DBSCAN analysis a neighbourhood size of 0.022 was used with 9 neighbours being needed to identify core points. This algorithm identified 13 groups within the data (1 of which, Group 13, represents data that was not clustered). Unlike the GMM algorithm, DBSCAN has not assigned all data points to a cluster, however it has identified most of the high VOC concentration samples within the dataset. The results of the analysis are seen in Fig. S12,† with the PC 1 versus PC 2 and PC 1 versus PC 3 scores plots shown in panels A and B, and maps of the analysis shown in panels C and D. The data is coloured by the DBSCAN group membership, with the black dots representing the data left ungrouped. The map in Fig. S12† (panel C) shows all data, while the map in Fig. S12† (panel D) omits the black data points for clarity. The PC 1 versus PC 2 scores plots and average mass spectra for each of the groups is shown in Fig. S13 (panels A–M) in the ESI.† Samples are clustered based on source type, with most of the clusters falling on the extremities of the scores plot. Group 8 (Fig. S13,† panel H), which has scores near the origin on each of the three PCs shown contains high concentration samples measured in the vicinity of a pulp mill at approximately 14
:
15 on August 22, 2017. These samples were higher in dimethyl sulfide (m/z 63.022) than those measured elsewhere, but dimethyl sulfide does not have a significant loading until higher PCs (e.g., PCs 6, 7, 12, and 15 (data not shown)). Groups 2, 4, 5, 8, and 12 (Fig. S13,† panels B, D, E, and L) have identified single sources within the data (gas station, wood chip truck, compost, pulp mill, and an unknown (possibly hydrocarbon) source respectively), while groups 3 and 10 (Fig. S13,† panels C and J) contain samples from multiple sources of similar type (fresh biomass samples and hydrocarbon samples).
In some cases, the clusters identified by GMM have also been identified by DBSCAN, while in other cases DBSCAN has identified multiple clusters within a single group identified by GMM. For example, GMM Group 10 (Fig. S11,† panel J) and DBSCAN Group 5 (Fig. S13,† panel E) contain almost the same set of samples. The mass spectra associated with them have a correlation coefficient of 0.999 (Table S6†). In the case of GMM Group 12 (Fig. S11,† panel L), the DBSCAN has identified 3 sub-clusters (Fig. S13,† panels B, J, and L) with correlation coefficients between the identified mass spectra of 0.916, 0.990, and 0.604 for DBSCAN Groups 2, 10, and 12, respectively. In this case, DBSCAN has separately identified different hydrocarbon sources described by mass spectra containing different ratios of the measured hydrocarbons, while GMM has identified one cluster containing all the hydrocarbon samples. A full correlation matrix between the mass spectra for the clusters identified by the two methods is shown in Table S6.†
The purpose of the unsupervised analysis done using both GMM and DBSCAN algorithms was not to construct a definitive model of the data collected, but to demonstrate the ability of the techniques to identify different VOC sources measured using a PTR-ToF-MS operated in a moving vehicle. Both techniques were able to discriminate VOC sources using the normalized mass spectal fingerprint and associate these sources with specific geographic locations.
Mass spectrometry yields molecular level information about the sample composition, which provides chemical insight that can be used to inform targeted analysis, or be used to identify molecular markers. Although the analysis presented here cannot identify VOC sources that have not been previously encountered, it will direct the user to ‘unknowns’, which can be the subject of subsequent investigation. Future work includes the application of data fusion techniques to improve source discriminating power by incorporating data from inorganic gas and particulate matter sensors. The incorporation of this data into a Geographic Information System will allow us to map plume boundaries and inform dispersion models. We are currently exploring Positive Matrix Factorization and Multivariate Curve Resolution – Alternating Least Squares to identify sources contributing at each sampling location and apportion their relative contributions to ambient air samples.
Footnote |
| † Electronic supplementary information (ESI) available. See DOI: 10.1039/c9em00439d |
| This journal is © The Royal Society of Chemistry 2020 |