Vanessa L.
Speight
*,
Stephen R.
Mounce
and
Joseph B.
Boxall
Department of Civil and Structural Engineering, University of Sheffield, Sir Frederick Mappin Building, Mappin Street, Sheffield, S1 3JD, UK. E-mail: v.speight@sheffield.ac.uk
First published on 27th February 2019
Understanding the processes and interactions occurring within complex, ageing drinking water distribution systems is vital to ensuring the supply of safe drinking water. While many water quality samples are taken for regulatory compliance, the resulting data are often simply archived rather than being interrogated for deeper understanding due to their sparse nature across time and space and the difficulties of integrating with other data sources. This paper opens a new direction of research into distribution system water quality by mining large, historical drinking water quality datasets using machine learning techniques, in this case self-organizing maps (SOMs). Application of the methodology to national-scale datasets from three different UK water companies demonstrates the ability to identify the dominant mechanisms of iron release. Factors leading to discolouration such as low disinfectant residual, nitrification, and corrosion of unlined cast iron mains were identified at scales ranging from city to country, thereby enabling targeted interventions to ensure drinking water quality.
Water impactThis paper advances the management of drinking water quality in distribution systems through identification of dominant causes of discolouration. Using machine learning approaches, historical water quality databases are mined to identify site-specific relationships as well as to compare findings across scales ranging from city to region to country. The results support targeted interventions and appropriate investment to ensure water quality. |
There is a lack of practicable mechanistic models for prediction of WDS water quality so utilities rely on analysis of field data to prioritize interventions. But given the sparsity of data that is collected, it is difficult to understand trends and relationships among dozens of interacting variables and to create actionable information to support decisions. For large utilities operating multiple systems, such as those in the UK, prioritizing interventions is further complicated by the site-specific nature of the data collected, driving the need for globally applicable tools that can incorporate local data and identify dominant mechanisms leading to complex water quality outcomes like discolouration. Currently decisions about WDS interventions are heavily weighted to customer complaints and single noncompliant samples, leading to reactive management such as spot flushing that does not necessarily address the underlying causes of poor water quality. The advances now emerging in data analysis tools and techniques to work with incomplete and disparate data sources offers an opportunity to improve water quality understanding and support informed decision-making about interventions.
The characterization of discolouration in WDSs is complicated by two issues. First, primary iron corrosion is an extremely complex electro-, physico-, and bio-, chemical process. Second, while primary corrosion can contribute material for subsequent mobilization, the mechanisms of release of iron and other metals into the bulk water are often unrelated to primary corrosion and are not well understood.5 Factors influencing iron release and discolouration include water chemistry, microbiology, pipe material, WDS configuration, and hydraulic conditions over the entire service life of each pipe.6
Qualitative conceptual models for corrosion scale formation and degradation provide insight into mechanisms but have not been used for prediction of iron release at a WDS scale.7 Empirical models for iron release focusing on yield of colour or iron have been developed8 but may not be widely applicable to other systems and water chemistries.
Changes in hydraulic conditions are known to mobilize material accumulated at the pipe wall, resulting in discolouration events and non-compliance with drinking water standards.9 Mechanistic modelling of the hydraulic influence on water quality is hampered not only by a lack of understanding of fundamental reactions but also by uncertainty in the exact state of the network at a given time, including flow rates, pipe condition, concentration of key parameters, and degree of microbiological activity. Discolouration modelling favours empirical approaches because of this underlying complexity.6
To derive knowledge about the influence of variables on a numerical or categorical quantity, regression or classification techniques can be useful; these supervised techniques learn a mapping from predictor variables to output(s) given some training data. However, if the relationships between variables are poorly understood then it may be useful to apply an unsupervised clustering or dimensionality-reduction method, which typically requires little prior knowledge about the data. A simple unsupervised approach to identifying the relationships between variables in a multi-variate dataset is principle component analysis (PCA).11 However, PCA cannot handle missing values in the input matrix and can only describe linear relationships between variables, making this method ill-suited to WDS water quality sampling data with its many unsampled parameters and complex reaction mechanisms.
Both the missing-values and nonlinearity limitations are overcome by the Kohonen self-organizing map (SOM),12 a clustering and dimensionality-reduction algorithm. During the construction of a SOM, multidimensional data points are arranged in a (usually two dimensional) representation so that similar data points are clustered together. When visualized, this representation allows non-linear relationships between variables to be identified. SOMs have been used for analysis and modelling of water resources, including applications such as river flow, rainfall-runoff and surface water quality.13 The application of SOMs has been demonstrated in water distribution system data mining for microbiological and physico-chemical data at laboratory-scale14 as well as clustering of water quality, hydraulic modelling, and asset data for a single water supply zone.15
This paper demonstrates the application of SOMs to the complex multivariate problem of discolouration within WDS and synthesizes results from studies using country- and city-wide datasets from three UK water companies. The aim of this work is to develop and demonstrate a methodology that can overcome the current limitations of WDS water quality data to derive knowledge from historical datasets about complex processes so that utilities can make defensible decisions about maintenance and related interventions and move from a reactive to proactive maintenance strategy. Iron release and more generally discolouration were the focus of the work given the importance of these parameters in drinking water compliance in the UK.
Three cities within Water Company B were analysed in detail. WDS1, which serves a population of 142000 people, is served by three different surface WTW using free chlorine as a secondary disinfectant with highly variable quantities of water produced at each WTW depending on localized customer demand. The resulting water quality for WDS1 is likely to be highly variable at a given location because of the daily fluctuation in source water proportions. WDS2 serves 199000 people predominantly from a single surface water WTW (free chlorine secondary disinfectant) but with a mix of groundwater sources contributing some variability. WDS3, with a population of 261000 people, is fully served by two nearly identical WTWs using the same surface water source (chloramine secondary disinfectant) so would be expected to have the least variability due to source water quality. WDS water quality data from the period January 2009 to December 2013 along with pipe asset data, and hydraulic model results were obtained for the study.
Water Company C serves approximately 3.1 million customers from dozens of systems with 63 WTW and more than 17000 miles of pipe. Water quality sampling data from January 2008 to September 2014 along with pipe characteristics at the district metering area (DMA) level was compiled for the evaluation (1312 DMAs in total). Hydraulic model result and a detailed pipe asset database from GIS were not available for Company C.
Fig. 1 SOM for Company A, labelled with region and secondary disinfectant. See Table 1 for variable definitions. |
Fig. 2 SOM and labelled component planes for pipe material (enlarged for visibility) for WDS1 from Company B. See Table 1 for variable definitions. |
Fig. 3 SOM for Company C. See Table 1 for variable definitions. |
Fig. 4 SOM for WDS3 in Company B, labelled with pipe material (enlarged for visibility). See Table 1 for variable definitions. |
To interpret the SOMs, the plots were visually inspected to identify common patterns at the same physical location within different component and labelled planes, which indicate correlations between those variables. For example, if a red cluster for one variable occurs in the upper left corner of its component plane, a yellow cluster for a second variable in the upper left corner of that second variable's component plane can be considered to be correlated with the first variable's red cluster because it appeared at the same location. In this way, these plots can be used to identify and locate a variety of different behaviours related to discolouration as illustrated in the results to follow.
In this case, the SOM showed a very strong correlation over large spatial and temporal scales (all samples nationwide, multiple years) between iron, manganese and turbidity. Given the extent of local variability amongst all of the systems across Company A, the strength of this correlation is perhaps surprising but at this scale of analysis, a few nonconforming samples will be difficult to detect. The lack of correlation between high iron and low chlorine or chloramine is also perhaps counter-intuitive. Again, small local variations are not likely visible in this SOM but the SOM does indicate that low disinfectant residual is not a predominant global factor leading to high iron and discolouration for Company A. However, conclusions about individual pipes or service areas cannot be drawn using this large scale SOM, which is better used as a screening tool to point towards further areas for analysis such as the south region chloraminated systems.
There does not appear to be strong correlations between high iron and water age or velocity parameters in Fig. 2, likely because of the complicated hydraulics within WDS1 with multiple water sources and operational configurations that change the flow direction on a regular basis. Furthermore, the velocity within the pipe where the sample is collected does not fully capture the journey through the WDS that the water has taken, which may have included high velocity pipes where iron has been mobilised upstream of the sample location.
For WDS1 at Company B, there is evidence of a correlation between PVC pipe and high iron but only a small proportion of the unlined cast iron pipe appears to be correlated to elevated iron. This result indicates that a focus on cast iron pipes as the intervention to solve discolouration problems in this system may not be warranted. Furthermore, the link with high temperature and low chlorine in this case may indicate high chlorine demand and microbiological activity due to seasonal water quality changes, although this cannot be conclusively determined with the input data used in this SOM. Transport of iron from upstream WTW or unlined CI sources seem more likely than cast iron pipe deterioration as the cause of discolouration for WDS1.
For Company C, the SOM analysis was performed at the DMA level and the effect of pipe material was of particular interest. Fig. 3 shows the number of iron failures per DMA during the historical data period rather than iron concentration. This SOM has three clusters of elevated iron failures that are correlated with turbidity and manganese failures. The cluster of iron failures at the top of the component plane corresponds to DMAs with a high percentage of unlined cast iron pipe, while the other two clusters of iron failures correlate with a high percentage of other pipe material, which is predominantly plastic for Company C. Thus this example also illustrates that cast iron pipes are not necessarily the direct cause of discolouration in all DMAs.
Much of the unlined cast iron pipe (green in the pipe material labelled component plane in Fig. 4) appears to align with these high iron clusters. Some of the high iron clusters (bottom left of planes) correlate with high nitrite, low total chlorine and elevated temperature, showing the link between nitrification, unlined cast iron pipe, and high iron.
Similar results were seen for Company A in a SOM produced to explore chloraminated systems in the south region (Fig. 5). In this SOM output plot, a strong correlation between high nitrite and high iron can be seen. These clusters are linked to high turbidity. A few clusters are also correlated with high retention time in service reservoirs (middle of component planes). One cluster of slightly elevated iron appears to be correlated with higher WTW organic carbon and higher WTW turbidity (lower right of component planes). The iron clusters correlate with high percentage of cast iron or spun iron pipe in most cases with the exception of the moderate iron cluster at the bottom of the component plane, which is more strongly correlated with high percentage of ductile iron pipe.
Fig. 5 SOM for south region, chloraminated systems in Company A. See Table 1 for variable definitions. |
The strong relationship between nitrification indicators and high iron for certain unlined cast iron pipes in Fig. 4 and 5 demonstrate the impact of this phenomenon at two different scales (regional and city). This finding agrees with previous studies that have shown scale destabilization to occur when there are low oxidant concentrations,8 an increase in microbiological activity, and nitrification.16 However, not all of the nitrification is associated with increased metals or turbidity. These findings emphasize the need for good nitrification management as a discolouration intervention, including total organic carbon reduction in the treated water, residual disinfectant management and active control of water circulation.
Variable (units) | Short form variable name | Data source | Included in SOM analysis by Company | Comments | ||
---|---|---|---|---|---|---|
A | B | C | ||||
Iron (mg L−1) | Fe iron conc., Fe | Water quality | Y | Y | Y | Routine WDS measurements plus additional samples to investigate events |
Manganese (mg L−1) | Mn, Mngs | Water quality | Y | Y | Y | Routine WDS measurements plus additional samples to investigate events |
Turbidity (NTU) | Turbidity, Turb | Water quality | Y | Y | Y | Routine WDS measurements plus additionalsamples to investigate events |
Aluminium (μg L−1) | Al | Water quality | N | Y | N | Routine WDS measurements plus additional samples to investigate events |
Nitrate (mg L−1) | NO3− | Water quality | N | Y | N | Routine WDS measurements plus additional samples to investigate events |
Nitrite (mg L−1) | NO2−, nitrite | Water quality | Y | Y | N | Routine WDS measurements plus additional samples to investigate events, chloraminated systems only |
Free chlorine (mg L−1) | Free Cl | Water quality | Y | Y | Y | Routine WDS measurements plus additional samples to investigate events |
Total chlorine (mg L−1) | Total Cl, Tot Cl | Water quality | Y | Y | Y | Routine WDS measurements plus additional samples to investigate events |
pH | pH | Water quality | N | Y | Y | Routine WDS measurements plus additional samples to investigate events |
Temperature (°C) | Temperature | Water quality | N | Y | Y | Routine WDS measurements plus additional samples to investigate events |
Iron failures | Iron_Nfails | Calculated | N | N | Y | Calculated number of regulatory failures by DMA |
Manganese failures | Mang_Nfails | Calculated | N | N | Y | Calculated number of regulatory failures by DMA |
Turbidity failures | Turb_Nfails | Calculated | N | N | Y | Calculated number of regulatory failures by DMA |
Average WTW iron (mg L−1) | Fe AVE | Water quality | Y | N | N | Annual average in finished water at WTW |
Average WTW manganese (mg L−1) | Mngs AVE | Water quality | Y | N | N | Annual average in finished water at WTW |
Average WTW total organic carbon (mg L−1) | Tot org carbon AVE | Water quality | Y | N | N | Annual average in finished water at WTW |
Pipe internal diameter (mm) | Internal diameter | Asset database | Y | Y | N | Nominal value used for Company A, actual value used for Company B |
Condition code | Cond. code | Asset database | N | Y | N | Derived from asset management system based on maintenance history, inspection data, and other related information, values from 1 (good condition) to 5 (poor condition) |
Pipe material | Pipe material | Asset database | N | Y | N | Categorised into 8 groupings of similar material for Company B: iron unlined (FeU), iron lined (FeL), polyvinyl chloride (PVC), polyethylene (PE), steel (ST), concrete (CONC), lead (PB), and unknown (?) |
Percentage of cast iron pipe in DMA | Pc CI – cast Fe, iron_unlined | Asset database | Y | N | Y | Pipe asset data, considered to be unlined pipe |
Percentage of ductile iron pipe in DMA | Pc DI – ductile Fe, iron_lined | Asset database | Y | N | Y | Pipe asset data, considered to be lined pipe |
Percentage of spun iron pipe in DMA | Pc SI – spun Fe | Asset database | Y | N | N | Pipe asset data, considered to be unlined pipe; not differentiated from cast iron for Company C |
Percentage of non-metallic pipe in DMA | Other_material | Asset database | N | N | Y | Pipe asset data |
Minimum velocity (m s−1) | Min vel. (avg demand) | Hydraulic model | N | Y | N | Samples linked to pipes by asset ID, maximum value over average hydraulic simulation with 24-hour demand patterns |
Maximum velocity (m s−1) | Max vel. (avg demand) | Hydraulic model | N | Y | N | Samples linked to pipes by asset ID, maximum value over average hydraulic simulation with 24-hour demand patterns |
Minimum water age (h) | Min age (avg demand) | Hydraulic model | N | Y | N | Samples linked to pipes by asset ID, maximum value over average hydraulic simulation with 24-hour demand patterns |
Maximum water age (h) | Max age (avg demand) | Hydraulic model | N | Y | N | Samples linked to pipes by asset ID, maximum value over average hydraulic simulation with 24-hour demand patterns |
Service reservoir retention (h) | Cur SR store retention h | Standalone data | Y | N | N | Company A, internal calculations of retention time within reservoirs (tanks) |
Pipe age | Pipe age | Calculated | N | Y | N | Based on pipe installation date |
When plotting component planes, the SOM Toolbox by default sets each colour bar scale to the numerical range of the corresponding dimension of the set of reference vectors. Under these conditions, the SOM for the analysis of each individual WDS would have a different value assigned to each colour to reflect the range in values within the given dataset only, making it difficult to compare results from different study areas. Additionally, outliers within the input data can skew the colour shading within the SOM component planes, with a resulting loss of detail for the values closer to the median. To ensure that the analysis produced consistent results over all study areas, the mapping from numerical input data ranges to colour bar ranges was standardized using the 5th (low value, blue colour) and 95th (high value, red colour) percentile for the combined dataset across all study areas (‘reference ranges’). Outliers beyond those values were not required to be removed from the datasets but rather would be shown as the high or low value colour. Retention of outliers in the analysis of water quality is desirable in that they may characterize atypical water quality events that are of interest. The size and number of hexagons making up the component plane is a function of the size of the input dataset and the strength of the clustering relationships so the output plots look slightly different across the analyses.
The SOMs presented in this study were generated using the Imputation SOM algorithm17 as implemented in the SOM Toolbox v2.1 for MATLAB,18 which provides a robust handling of missing values in the training dataset compared to the standard algorithms.
Cast iron pipes are often the focus of a discolouration analysis and, as with most UK water companies and many internationally, the companies in this study have extensive quantities of unlined cast iron and similar pipes in service. This study demonstrates that not all unlined cast iron pipes were associated with high iron, meaning that some pipes are performing well despite their age and/or condition. Understanding that the dominant mechanism of discolouration risk is not necessarily the deterioration of the cast iron pipes themselves allows for appropriate interventions to be selected and could avoid unnecessary rehabilitation or replacement of cast iron pipe. For Company C, the ability to identify a few high-risk DMAs where high iron is associated with unlined cast iron pipe has successfully directed their intervention strategies.19,20
There are many different mechanisms that can result in iron release and discolouration in a WDS but not all will occur to the same extent in every system over space and time so identifying the dominant mechanism(s) is a key research and WDS management need. In this study, the data-driven analysis techniques provide insight into the dominant mechanisms influencing iron release and discolouration for each of the systems. The method has proven to be particularly robust yet flexible, given that each water company had different types of data available for the analysis and considered different questions related to discolouration over different spatial scales and time periods. The method has captured system-specific factors and has facilitated comparisons between systems and regions by using different sets of input variables selected at different groupings and scales.
Interpretation of the SOM output plots is manual and subjective so this method is not suited to all types of historical data mining analyses. However, within this study the interpretation task was found to be a valuable opportunity for operational staff to derive a deeper understanding of system performance and allowed for their expert knowledge to be incorporated. Most importantly, the concept of extracting value and knowledge from historical water quality data, which has taken considerable effort and expense to collect but is frequently archived and forgotten, needs to be embraced across the water sector so further data mining research in this field should be pursued.
This journal is © The Royal Society of Chemistry 2019 |