Mahshid S. Z.
Farzanehsa
a,
Guido
Carvajal
b,
John
Mieog
c and
Stuart J.
Khan
*a
aSchool of Civil and Environmental Engineering, University of New South Wales, Sydney, NSW, Australia. E-mail: s.khan@unsw.edu.au
bFacultad de Ingeniería, Universidad Andrés Bello, Antonio Varas 880, Providencia, Santiago, Chile
cMelbourne Water, 990 La Trobe St, Docklands, Victoria 3008, Australia
First published on 9th December 2022
Continuous online monitoring of water treatment process performance is an essential step in ensuring reliable water quality outcomes. In particular, it is important to ensure effective removal of microbial substances during advanced wastewater treatment processes. However, most microbial indicators cannot be continuously monitored by online processes. Therefore, it is necessary to monitor treatment process performance based on surrogate measures which can be reliably and continuously monitored. For example, water quality data such as colour, turbidity and chemical oxygen demand (COD) can be measured quickly and easily. In this study, a combined ozonation–biological media filtration process (O3/BMF), was used to reduce microbial indicator concentration. After gathering water quality data and corresponding microbial indicator concentrations, we applied machine learning to develop models for predicting the amount of change in microbial indicator concentration following O3/BMF treatment. Three microbial indicators were studied, namely Clostridium perfringens, E. coli, and somatic coliphage. The most effective physico-chemical predictors for the removal of these microbial indicators were determined by means of mutual information. Associations between changes in the predictors' concentration during O3/BMF and the reduction of the microbial indicators were identified using a range of supervised learning algorithms including Naïve Bayes, random forest, support vector machines and generalised linear model. The impact of the type of prediction algorithm on prediction accuracy was investigated and the superior classifier was determined. Performance measures for microbial removal prediction were found to be superior for the support vector machines (SVM) classifier. Using SVM with a Gaussian kernel classifier, prediction accuracy for all microbial removal was above 75%. Moreover, other performance measures such as area under curve (AUC) and kappa statistics (KS) were higher in SVM compared to the other applied classifiers (AUC ≥ 0.80; KS ≥ 0.34). From this study, we have identified an objective and efficient method that can predict the effectiveness of the O3/BMF process in removing the three microbial indicators in water from a short list of commonly measured physico-chemical parameters.
Water impactTo ensure water quality, online monitoring of pathogen removal is very important which is not possible with current technologies. Therefore, it is essential to find surrogate measures that can be easily monitored online. In this study, we have investigated machine learning techniques to predict microbial removal through surrogates such as water quality data in ozonation and biofiltration processes. |
Current international guidelines for water recycling promote a risk management approach for the control of hazards with a primary emphasis on pathogens because of their potential acute, severe and widespread impacts.5–7 These guidelines draw principles from Hazard Analysis Critical Control Point (HACCP) standards for the monitoring and control of hazards across a multi-barrier system. The performance of treatment barriers is evaluated through validation, which is the process of ensuring that the system can effectively control the hazards. For pathogens, this performance is measured by log10 reduction values (LRV). Log removal value (LRV) is the percentage that a pathogen will be inactivated by a disinfection method. In mathematical form,
.
A LRV equal to 1 means that the pathogen is reduced by 90% from its initial value. The monitoring of critical control points requires online parameters indicative of the hazard removal performance of the process. For ozonation and biological media filtration (BMF), however the water matrix characteristics play a crucial role in the disinfection effectiveness which has limited the use of indicators such as ozone dose and CT, applied elsewhere in water treatment.8–10 This limitation has impeded the formal attribution of LRV credits to ozonation and BMF for wastewater in many jurisdictions. LRV credits have been awarded to post-ozonation at Melbourne Water's Eastern Treatment Plant (ETP) (the subject of this study) on the basis of CT disinfection. Post-ozonation CT disinfection is only possible due to the pre-treatment provided by pre-ozone and BMF primarily through satisfying ozone demand. There are, however reported studies on the use of alternative indicators for monitoring of microbial performance removal including bromate formation, O3:
TOC ratio, and UVA reduction.11–13 These studies have focussed on ozonation under ideal conditions, for example by removing suspended solids and using benchtop reactors.11 The ability to obtain LRV credits for ozonation/BMF could translate into major cost savings for wastewater recycling projects by reducing the need for additional treatment processes such as high-energy photochemical processes, or by reducing chemical consumption for downstream disinfection processes. Alternatively, obtaining additional LRV credits for a treatment system can increase the resilience of recycled water production in the event of sub-optimal performance of one or more pathogen reduction barriers.
Some previous studies have reported efforts to develop models for the prediction of microbial concentration based on some water quality data and operational parameters. Using an on-line UV absorbance analyser, Gerrity (2012)11 developed a model to predict microbial inactivation during ozonation. Gamage (2013)14 used O3:
TOC ratio, ΔUV254 and ΔTF to predict the inactivation of three surrogate microbes. Using a linear correlation for the prediction of microbial removal, Gamage reported that ΔUV254 and ΔTF were able to most effectively predict microbial inactivation in ozone/H2O2 systems. Gamage reported high variability in the prediction of E. coli concentration under different dosing conditions. However, traditional regression models might not be the optimum prediction tools for the complex relationships between microbial removal and operational or water quality data. Recently, multivariant predictive models based on Naïve Bayes have been developed to predict disinfection by-products (DBPs) concentration in drinking water streams.15 Although a few studies have investigated the prediction capacity of Naïve Bayes in the LRV performance of pathogens from wastewater streams,16 there are still limited studies on the use of prediction tools for on-line monitoring of water treatment process performance.
The aim of this paper was to evaluate the use of several water quality and operational parameters as surrogates for the removal efficiency of microbial indicators during full-scale ozonation and biological media filtration of secondary treated wastewater. This research assessed previous reported indicators including colour and UVA. The key outcome is the presentation of a method for evaluating useful predictive variables through mutual information and calculating microbial LRVs using these predictors. This outcome is significant because it allows us to perform on-line monitoring to indicate the likely presence or removal of microbial substances in water samples without requiring cumbersome or time-consuming microbial measurements.
LRV observations from the data shows that the reduction of microbial indicators i.e., Clostridium perfringens, E. coli, and somatic coliphage was evidently successful through pre-ozonation and biological media filtration. Fig. 2 shows the average reduction of each microbial indicator after pre-ozonation and biological media filtration (sample point 2).
![]() | ||
Fig. 2 log removal reduction of three microbial indicator after pre-ozonation and biological media filtration. |
Water quality data from sample points 1 and 2 were also measured for a range of parameters, including microbial indicators, suspended solids (SS), alkalinity, nitrite, nitrate, ammonia, ultraviolet transmittance (UVT), and colour.
The modelling processes applied here incorporates a range of statistical methods. These are summarised, indicating common terminology, acronyms and abbreviations in Table 1.
Abbreviation | Meaning | Explanation |
---|---|---|
PA | Prediction accuracy | Quantifies the number of correctly predicted values divided by the total number of cases |
PE | Prediction error | Quantifies the number of incorrectly predicted values divided by the total number of cases |
KS | Kappa statistic | Measures the agreement between model predictions and actual values as a metric in the range [−1, 1]. KS = 1 means perfect agreement, KS = 0 means that agreement is equal to chance, and KS = −1 means “perfect” disagreement |
AUC | Area under the curve for the receiver operating characteristic curve (ROC) | AUC ranges between 0 and 1, where 1 represents perfect matching, 0.5 reflects totally random models, and <0.5 indicates models generating predominantly inaccurate predictions |
TPR | True positive rate | Rate of correct positive predictions (high reductions) |
FPR | False positive rate | Failure to detect low reductions when they occurred |
TNR | True negative rate | Rate of correct negative predictions (low reductions) |
FNR | False negative rate | Failure to detect high reductions when they occurred |
NB | Naïve Bayes | Probabilistic classifier based on Bayes theorem |
GLM | Generalised linear model | Conventional linear regression models for a continuous response variable given continuous and/or categorical predictors |
RF | Random forest | |
SVM/bn | Support vector machine/binary | SVM algorithm can find a hyperplane in an N-dimensional space that distinctly classifies the data points using binary kernel function |
SVM/GK | Support vector machine/Gaussian kernel | SVM algorithm that uses Gaussian kernel |
SVM/PK | Support vector machine/polynomial kernel | SVM algorithm that uses polynomial kernel |
SVM/rbf | Support vector machine/radial basis function | SVM algorithm that uses radial basis function as the kernel function |
In the following sections, “features” denote the system parameters which can be easily and continuously monitored (e.g., operational parameters and physico-chemical parameters with online detection). The features included: amount of change (pre O3/BMF − post O3/BMF) in total organic carbon (TOC), UV absorbance, UV transmittance, suspended solids, pH, alkalinity, colour, ammonia, nitrate and nitrite concentrations. “Outcome” denotes the amount of change (pre O3/BMF − post O3/BMF) in microbial indicators, which are more difficult and laborious to continuously measure. The features are used to make a prediction model, which can predict the outcomes through a supervised learning process.
In this study we used the minimal-redundancy-maximal-relevance algorithm (mRMR) developed by Peng (2005)23 because of its advantages in terms of both feature selection complexity and feature classification accuracy. The focus of this method is on mutual-information-based feature selection. The mutual information between two random variables X and Y is defined based on their entropies and their probabilistic density functions, as shown in (eqn 1(a)) to (1(c)):
I(X, Y) = H(X) − H(X|Y) | (1(a)) |
![]() | (1(b)) |
And thus from (eqn 1(a)) and (1(b)):
![]() | (1(c)) |
The mRMR method seeks to maximise the relevance of a feature set for a specific class and minimise the redundancy of all features in the feature set. Relevance is defined by the average value of all mutual information (MI) values between the individual feature (xj) and the specific class (c). This is shown in formula (2), where (Sm−1) is a feature set with m − 1 features. The task is to select the mth feature from the set {X − Sm−1}. This is done through an incremental search method by selecting the feature that maximizes the condition inside the square brackets. Redundancy is the average value of all MI values (denoted with I in formula (1)) between the individual feature and every other feature in the set (xi).
![]() | (2) |
Using the mRMR method, for each outcome variable (i.e., microbial indicator concentration), the top features were determined from a random 80% of the data and used for training the prediction model in the next step. Assessment of the number of features' effect on prediction accuracy showed that selection of the top four features resulted in the best performance (ESI† A).
For an n-element vector X, quantiles were computed by using a sorting-based algorithm as follows:
1. The sorted elements in X are taken as the (0.5/n), (1.5/n), …, ([n − 0.5]/n) quantiles. For example:
• For a data vector of five elements such as {6, 3, 2, 10, 1}, the sorted elements {1, 2, 3, 6, 10} respectively correspond to the 0.1, 0.3, 0.5, 0.7, 0.9 quantiles.
• For a data vector of six elements such as {6, 3, 2, 10, 8, 1}, the sorted elements {1, 2, 3, 6, 8, 10} respectively correspond to the (0.5/6), (1.5/6), (2.5/6), (3.5/6), (4.5/6), (5.5/6) quantiles.
2. Linear interpolation was used to compute quantiles for probabilities between (0.5/n) and ([n − 0.5]/n).
3. For the quantiles corresponding to the probabilities outside that range, the minimum or maximum values of the elements in X was assigned.
As a visual example, Fig. 4 shows the histogram of data values for log removal of variable E. coli. The bin edges were set such that it resulted in four quantiles, as explained above. The data that fell in the first quantile (first histogram bin) were labelled class 1, data in the second quantile were labelled class 2 and so on for classes 3 and 4.
![]() | ||
Fig. 4 Histogram of log removal values for the outcome variable E. coli. Bin edges denote the boundaries of the four quantiles for discretisation. |
For consistency, all the classification algorithms compared in this study used discretised data values.
The premise is to train a computer model by giving it features along with their corresponding outcomes, such that the model can later predict an outcome based on input features only. The algorithms to train and test the computer model were developed in MATLAB Version (R2019b). The computer algorithms find patterns between features that are exclusively associated with the outcomes.
While these findings are noteworthy and useful as a guideline to the influence of changes in physico-chemical predictors on changes in microbial indicator concentration, it should be noted that the predictors are surrogate measures of the treatment performance and that there is not necessarily a direct relationship between the measurement and the presence of the microbial indicator.
Many more hidden and intricate associations may exist between these variables that cannot be detected from simple assessment of data. Pattern recognition and sophisticated machine learning algorithms were used to detect those associations and be able to predict the amount of microbial LRV due to O3/BMF solely based on measurement of the four predictors before and after the water treatment process.
Pathogen | Performance measure | NB | GLM | RF | SVM/bn | SVM/GK | SVM/PK | SVM/rbf |
---|---|---|---|---|---|---|---|---|
NB: Naïve Bayes, GLM: generalized linear model, RF: random forests, SVM/bn: support vector machines with binary kernel, SVM/GK: support vector machines with Gaussian kernel, SVM/PK: support vector machines with polynomial kernel, SVM/rbf: support vector machines with radial basis function kernel, LRV: log removal, PA: prediction accuracy, PE: prediction error, KS: kappa statistic, AUC: area under curve, TPR: true positive rate, FPR: false positive rate, TNR: true negative rate, FNR: false negative rate. | ||||||||
Clostridium perfringens LRV | PA | 0.73 ± 0.03 | 0.92 ± 0.00 | 0.77 ± 0.03 | 0.76 ± 0.03 | 0.78 ± 0.02 | 0.75 ± 0.02 | 0.77 ± 0.03 |
PE | 0.27 ± 0.03 | 0.08 ± 0.00 | 0.23 ± 0.03 | 0.24 ± 0.03 | 0.22 ± 0.02 | 0.25 ± 0.02 | 0.23 ± 0.03 | |
KS | 0.26 ± 0.10 | 0.00 ± 0.00 | 0.34 ± 0.09 | 0.32 ± 0.07 | 0.37 ± 0.06 | 0.31 ± 0.05 | 0.33 ± 0.09 | |
AUC | 0.81 ± 0.01 | 0.79 ± 0.01 | 0.71 ± 0.04 | 0.80 ± 0.11 | 0.90 ± 0.01 | 0.76 ± 0.08 | ||
TPR | 0.47 ± 0.07 | 0.00 ± 0.00 | 0.56 ± 0.07 | 0.54 ± 0.06 | 0.61 ± 0.05 | 0.49 ± 0.05 | 0.60 ± 0.08 | |
FPR | 0.18 ± 0.02 | 0.02 ± 0.00 | 0.14 ± 0.02 | 0.14 ± 0.02 | 0.13 ± 0.01 | 0.16 ± 0.02 | 0.13 ± 0.02 | |
TNR | 0.82 ± 0.02 | 0.98 ± 0.00 | 0.86 ± 0.02 | 0.86 ± 0.02 | 0.87 ± 0.01 | 0.84 ± 0.02 | 0.87 ± 0.02 | |
FNR | 0.53 ± 0.07 | 1.00 ± 0.00 | 0.44 ± 0.07 | 0.46 ± 0.06 | 0.39 ± 0.05 | 0.51 ± 0.05 | 0.40 ± 0.08 | |
E. coli LRV | PA | 0.74 ± 0.02 | 0.93 ± 0.00 | 0.74 ± 0.03 | 0.75 ± 0.03 | 0.75 ± 0.02 | 0.75 ± 0.03 | 0.73 ± 0.03 |
PE | 0.26 ± 0.02 | 0.07 ± 0.00 | 0.26 ± 0.03 | 0.25 ± 0.03 | 0.25 ± 0.02 | 0.25 ± 0.03 | 0.27 ± 0.03 | |
KS | 0.29 ± 0.07 | 0.00 ± 0.00 | 0.31 ± 0.07 | 0.34 ± 0.07 | 0.34 ± 0.06 | 0.32 ± 0.09 | 0.28 ± 0.08 | |
AUC | 0.79 ± 0.01 | 0.82 ± 0.01 | 0.73 ± 0.06 | 0.88 ± 0.01 | 0.70 ± 0.11 | 0.88 ± 0.01 | ||
TPR | 0.49 ± 0.05 | 0.00 ± 0.00 | 0.49 ± 0.05 | 0.50 ± 0.05 | 0.52 ± 0.05 | 0.50 ± 0.06 | 0.47 ± 0.06 | |
FPR | 0.17 ± 0.02 | 0.02 ± 0.00 | 0.17 ± 0.02 | 0.17 ± 0.02 | 0.16 ± 0.02 | 0.17 ± 0.02 | 0.18 ± 0.02 | |
TNR | 0.83 ± 0.02 | 0.98 ± 0.00 | 0.83 ± 0.02 | 0.83 ± 0.02 | 0.84 ± 0.02 | 0.83 ± 0.02 | 0.82 ± 0.02 | |
FNR | 0.51 ± 0.05 | 1.00 ± 0.00 | 0.51 ± 0.05 | 0.50 ± 0.05 | 0.48 ± 0.05 | 0.50 ± 0.06 | 0.53 ± 0.06 | |
Coliphage | PA | 0.75 ± 0.02 | 0.92 ± 0.00 | 0.75 ± 0.04 | 0.76 ± 0.02 | 0.78 ± 0.02 | 0.76 ± 0.02 | 0.76 ± 0.03 |
PE | 0.25 ± 0.02 | 0.08 ± 0.00 | 0.25 ± 0.04 | 0.24 ± 0.02 | 0.22 ± 0.02 | 0.24 ± 0.02 | 0.24 ± 0.03 | |
KS | 0.33 ± 0.07 | 0.00 ± 0.00 | 0.34 ± 0.09 | 0.35 ± 0.06 | 0.41 ± 0.05 | 0.37 ± 0.06 | 0.37 ± 0.08 | |
AUC | 0.83 ± 0.01 | 0.85 ± 0.01 | 0.82 ± 0.04 | 0.91 ± 0.01 | 0.75 ± 0.14 | 0.92 ± 0.01 | ||
TPR | 0.49 ± 0.04 | 0.00 ± 0.00 | 0.51 ± 0.07 | 0.51 ± 0.04 | 0.56 ± 0.04 | 0.53 ± 0.05 | 0.53 ± 0.06 | |
FPR | 0.17 ± 0.02 | 0.02 ± 0.00 | 0.16 ± 0.02 | 0.16 ± 0.01 | 0.15 ± 0.01 | 0.16 ± 0.01 | 0.16 ± 0.02 | |
TNR | 0.83 ± 0.02 | 0.98 ± 0.00 | 0.84 ± 0.02 | 0.84 ± 0.01 | 0.85 ± 0.01 | 0.84 ± 0.01 | 0.84 ± 0.02 | |
FNR | 0.51 ± 0.04 | 1.00 ± 0.00 | 0.49 ± 0.07 | 0.49 ± 0.04 | 0.44 ± 0.04 | 0.47 ± 0.05 | 0.47 ± 0.06 |
The AUC score varies from 0–1, with 0.5 indicating a totally random model and 1 no error in prediction. AUC < 0.5 denotes models predicting erroneously most of the time. Values of 0.5–0.7 indicate poor classification performance; values of 0.7–0.9 indicate fair classification performance, and values higher than 0.9 indicate excellent classification performance. Prediction accuracy is calculated as the total number of correct predictions divided by the total number of cases. This metric ranges between 0 and 100% with higher values indicating better prediction. Cohen's kappa statistic measures the agreement between model predictions and actual values as a metric in the range [−1, 1] considering adjustment due to chance effects.32 Kappa = 1 means perfect agreement, kappa = 0 means that agreement is equal to chance, and kappa = −1 means “perfect” disagreement.32 Distinct levels of agreement in the range between 0 and 1 have been defined for kappa coefficient:33 <0.2 = slight; 0.2–0.4 = fair; 0.4–0.6 = moderate; 0.6–0.8 = substantial; and >0.8 = almost perfect.
Assessing the prediction performance for three microbial communities showed that with the exception of the GLM classifier, the prediction accuracy for other classifiers (including NB, RF and all types of SVM) is around 75%, which is very promising. However, in order to evaluate the performance of a classifier, all performance measures should be considered simultaneously. Performance of a classifier is reliable when the value of both the true positive rate (TPR) and true negative rate (TNR) are greater than 50%. The higher KS and AUC is also representative of a better prediction performance.
Although prediction accuracy of the GLM classifier was above 90%, TPR was around zero and TNR is around 1 which means that relying on prediction accuracy was not sufficient and GLM was not an effective classifier for prediction of microbial removal during these water treatment processes.
With the TPR of less than 50% (47%, 49% and 47% for clostridium perfringens, E. coli and coliphage LRV, respectively), the Naïve Bayes model was also not a good classifier for any of the microbial communities. On the other hand, support vector machine with Gaussian kernel had above 50% TPR for all three microbial indicators. FPR was also very promising. Moreover, AUC and KS were the highest compared to the values of other classifiers.
For coliphage LRV, although the value of PA, TPR and TNR were almost similar in RF and all types of SVM, however, the AUC and KS was higher in SVM/GK. Therefore, SVM/GK was considered the most suitable classifier among all other classifiers.
The significance of results lies in the capacity of the SVM/GK model to predict the microbial indicator log removal value of a sample with unknown microbial concentration based on its previously learned knowledge and four simple measurements (i.e., UVT, colour, nitrite and nitrate) before and after the O3/BMF process. As shown in ESI† A, optimal prediction accuracy resulted with these four predictors. This has great implications for faster and more cost-effective assessment of the efficacy of O3/BMF water treatment process for microbial activity removal. The prediction model, developed in the form of a MATLAB script, takes the four physico-chemical measurements as inputs and calculates the microbial removal value range associated with those inputs.
Key findings from this study are:
• Three microbial indicators; Clostridium perfringens, E. coli, and somatic coliphage have been efficiently removed by the combination of ozonation and biological media filtration.
• Removal of three microbial indicators Clostridium perfringens, E. coli, and somatic coliphage can be predicted based on physico-chemical measurements.
• Feature selection based on mutual information showed that the top four physico-chemical predictors of microbial indicator removal were UVT, colour, nitrite and nitrate concentrations.
• The best prediction algorithm was found to be support vector machines with Gaussian kernel (SVM/GK), followed by SVM with radial basis function, and random forests.
• Using the SVM/GK classifier, prediction accuracy for all microbial removals was above 75%, AUC ≥ 0.80, and kappa statistic (KS) ≥ 0.34.
• This prediction model, developed in the form of a MATLAB script, takes the four physico-chemical measurements as inputs, and calculates the microbial removal value range associated with those inputs. Therefore, this model can be used to assess the performance of other systems based on changes in the surrogate measures from pre- to post-water treatment process.
While removal of most microbial indicators during O3/BMF cannot be continuously monitored by online processes, the methodology discussed in this study provides a fast and cost-effective alternative based on surrogate measures. This is important because continuous online monitoring of water treatment process performance is an essential step in ensuring reliable water quality outcomes.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d2ew00747a |
This journal is © The Royal Society of Chemistry 2023 |