Angela R.
Harris
*ab,
Amy J.
Pickering
ac,
Alexandria B.
Boehm
a,
Mwifadhi
Mrisho
d and
Jennifer
Davis
ae
aEnvironmental and Water Studies, Department of Civil and Environmental Engineering, Stanford University, Stanford, CA, USA
bCivil, Construction and Environmental Engineering, North Carolina State University, 2501 Stinson Drive/208 Mann Hall, Campus Box 7908, Raleigh, NC 27695-7908, USA. E-mail: aharris5@ncsu.edu; Fax: +1 919 515 7908; Tel: +1 919 515 2402
cCivil and Environmental Engineering, Tufts University, Medford, MA, USA
dIfakara Health Institute, Bagamoyo, Tanzania
eWoods Institute for the Environment, Stanford University, Stanford, CA, USA
First published on 24th April 2019
Exposure to fecal contamination continues to be a major public health concern for low-income households in sub-Saharan Africa. Drinking water and hands are known transmission routes for pathogens in household environments. In an effort to identify explanatory variables of water and hand contamination, a variety of analytical approaches have been employed that model variation in E. coli contamination as a function of behaviors and household characteristics. Using data collected from 1217 households in Bagamoyo, Tanzania, this investigation compares the explanatory variables identified in the three different modeling methods to explain hand and water contamination: ordinary least squares regression, logistic regression, and classification tree. Although the modeling approaches varied, there were some similarities in the results, with certain explanatory variables being consistently identified as being related to hand and water contamination (e.g., water source type for the water models and activity prior to sampling for the hand models). At the same time, there were also marked differences across the models. In sum, these results suggest there are benefits to using multiple analysis methods to assess relationships in complex systems. The models were also characterized by low explanatory power, suggesting that variation in hand and water contamination is difficult to capture when analyzing one-time water and hand rinse samples. For improved model performance, future studies could explore modeling of repeat measures of water quality and hand contamination.
Environmental significanceAssessment of microbiological contamination found in stored drinking water and on hands has been conducted extensively as a part of research investigations and monitoring efforts. Researchers seek to identify covariates for explaining contamination on these sources of fecal contamination exposure in low-income countries. Data analysis is typically conducted using bivariate tests, or sometimes, more complex regression models, but with limited success in explaining variation. This study uses data collected from ∼1200 households in Tanzania to explore the use of multiple analytical techniques to explain hand and water contamination. Although approaches varied, there were similarities in the results, with certain covariates being consistently identified. The analysis highlights limitations with current practices of microbial sampling and analysis and suggests further research. |
In an effort to inform the design of interventions that minimize the levels of fecal contamination found in water and on hands, researchers have sought to identify household characteristics and practices associated with microbiological contamination levels. Some have employed bivariate statistical tests, evaluating the association between the concentration of fecal indicator bacteria (such as E. coli) and household characteristics and practices one at a time.9,10 Such an approach, however, fails to account for potential confounding (i.e., a variable having a spurious association with the outcome variable because it is also associated with an independent variable). A small number of studies have employed multivariate regression models—which can, in theory, address confounding—to identify associations between explanatory variables and levels of E. coli in stored water and on hands. For example, Levy et al. used a generalized estimating equations (GEE) model and found water source type, water treatment practices, and storage time to be significantly associated with levels of E. coli in stored water in Ecuadorian households.11 Also using GEE models, a second study in Ecuador by Levy et al. (2009) found source type, water storage practices, and rainfall to be significantly correlated with E. coli in stored water.12 Pickering et al. (2010) used linear regression and generalized estimating equations in their investigation of household characteristics and behaviors associated with levels of E. coli contamination of stored water and female caregiver hands in Tanzanian households.5 Contamination on female caregiver hands was the only independent variable found to be statistically significantly associated with stored water quality. Having an infant present in the household was associated with higher hand contamination and educational attainment of mother was associated with lower hand contamination.
Substantive conclusions regarding the correlates of E. coli contamination in stored drinking water thus vary across these investigations. For example, water treatment was statistically significantly associated with stored water quality in just one of the studies described above; water source type was significant in two of the studies. Several plausible and non-mutually exclusive explanations exist for such divergence. The differences could reflect variation in the true relationship between water quality and independent variables across the study sites. In addition, authors did not evaluate the same independent variables in all analyses. It is also possible that limited variation in the values of independent variables caused them to be omitted or found not to be statistically significant in some analyses.
Whereas the substantive conclusions across these studies were compared, goodness-of-fit measures were not because comparable measures were not reported for the majority of the models. Only the hand contamination linear regression model in Pickering et al. (2010) reported an indicator reflecting model fit.5 The model was characterized by poor explanatory power, with only 3% of the variation in the level of E. coli on two hands explained by the independent variables included in the model.
Researchers who employ linear regression techniques assume that the underlying relationship between log-transformed fecal indicator bacteria concentration and each explanatory variable is best represented by a straight-line function. If the effect of an explanatory variable is believed to be moderated by another variable, then interaction effects must be modeled explicitly, which has implications for sample size requirements. It may be, however, that the relationship between fecal indicator bacteria contamination in water and on hands and commonly tested correlates (e.g., water management and hygiene behaviors, types of water sources and sanitation facilities used) is better characterized by nonlinear functions. For example, continuous explanatory variables and outcome measures may exhibit threshold effects. In such cases, the two variables are associated only above or below certain quantitative limits, or thresholds. Alternatively, such relationships could be characterized by equifinality (multiple causal pathways resulting in the same level of contamination), in which different combinations of explanatory variable values would be associated with the same level of contamination. Exploring other analytical methods that relate explanatory variables to water and hand contamination in fundamentally different ways could offer insight into these complex relationships.
Logistic regression and classification tree analysis are two such alternative analytical approaches. Logistic regression relates independent variables to the predicted probability of a categorical outcome, which can be binary or have three or more ordered or non-ordered values. Logistic regression has been used to model categorical outcomes in many different fields; for example, it has been applied extensively in medicine to model the probability of illness as a function of disease risk factors.13 Each parameter estimate produced by logistic regression modeling can be interpreted as the average effect of a unit change in a correlate value on the predicted probability of a case belonging to a particular outcome category. Applied to fecal indicator bacteria contamination, logistic regression thus includes an assumption of threshold effects between contamination and an independent variable. At the same time, logistic regression methods still assume that the log-transformed predicted probabilities for a given outcome category are linearly related to each independent variable.14 This modeling technique also assumes that the explanatory variables are independently related to the outcome, and tests for multicollinearity must be conducted in order to ensure this assumption holds for the data analyzed.
Classification tree analysis is a non-linear method that assigns cases into outcome categories based on the values of the explanatory variables, which can be continuous or categorical.15 Classification tree models are particularly useful for identifying non-linear relationships and interactions among explanatory variables, as well as for predicting outlier cases.15–17 Classification trees have been used to predict recreational water quality;16,18–20 they have also been applied, either for prediction or identification of correlates, in medicine, pharmacology, ecology, computational biology, and bioinformatics.15,17,21 A recent investigation employed classification tree analysis to predict soil-transmitted helminth infections using water, sanitation, and hygiene indicators.22 The study identified latrine structure and cleanliness as the only predictors of infection.22
In this study, we employ logistic regression and classification trees, along with ordinary least squares regression, to explain variation of fecal indicator bacteria concentration in stored drinking water and hand rinse samples. We use data collected from 1217 female heads of household in Bagamoyo, Tanzania, to estimate these three models. The primary objective of the study is to compare and contrast the explanatory variables selected in each of these analysis methods. We also assess the performance of the different models in terms of explaining variation in the outcome measures, and discuss the practical implications of our findings. These models are not optimized for prediction, and thus should not be used to predict outcomes for new sample data.
Household characteristics | |
---|---|
a GI illness described as 3 or more loose and watery stools in a 24 h period. | |
Number of households in study | 1217 |
Median weekly expenditures per capita for household, Tsh (USD) | 6500 (4.3) |
Female head of household works outside of home | 22% |
Female caregiver completed primary education | 73% |
Child 1 year old or less present in household | 29% |
GI illnessa in household in past 48 h | 9% |
Household has dirt floor in home | 51% |
Household located within Bagamoyo town | 61% |
Household sanitation | |
Household has private latrine | 61% |
Household latrine has a roof | 27% |
Household latrine has cement floor | 35% |
Household latrine has a septic tank | 7% |
Household latrine has a pit cover | 21% |
Feces visible around household premises | 6% |
Children in household practice open defecation | 58% |
Five or more flies visible in latrine | 25% |
Household water | |
Household has JMP improved water source | 82% |
Water source on household premises | 14% |
Household actively treated drinking water | 15% |
Water sampled was stored less than 24 h | 24% |
Water storage container fully covered at time of sampling | 95% |
Drinking water extraction method risky | 89% |
Respondent hand contacted water during stored water extraction | 15% |
Household hand hygiene | |
Household has hand washing station with soap and water | 19% |
Female caregiver dries hands with fabric after handwashing | 65% |
Female caregiver pours water from jerrycan to wet hands for handwashing | 15% |
Activity prior to hand rinse sample – washing | 14% |
Activity prior to hand rinse sample – food handling | 20% |
Activity prior to hand rinse sample – sitting | 60% |
Activity prior to hand rinse sample – other | 6% |
Median time since last hand washing with soap, hours | 3 |
Household microbial indicators | |
Geometric mean CFU EC per 100 mL in stored water | 33 |
Percent households with 0–10 CFU EC per 100 mL in stored water | 28% |
Percent households with 11–100 CFU EC per 100 mL in stored water | 35% |
Percent households with more than 100 CFU EC per 100 mL in stored water | 37% |
Geometric mean CFU EC per 2 hands of female caregiver | 263 |
Percent household with EC detected on female caregiver hands | 72% |
The same three techniques were also used to analyze female caregiver hand contamination. Concentrations of E. coli in the hand rinse sample were reported per 2 hands rinsed and log-transformed for use as the dependent variable in the ordinary least squares regression model. The outcome measure of hand contamination used in the logistic regression and classification tree models was whether E. coli was detected (1) or not (0) in the rinse sample. Because no risk thresholds for hand contamination were found in the literature, this binary category of E. coli detection was employed. Each predicted concentration of hand contamination generated by the ordinary least squares regression model was also classified as “detect” or “non-detect” of E. coli based on the lower detection limit of the hand rinse assay (17.5 CFU per 2 hands) to match the categorical outcome of the logistic regression and classification tree models.
Ordinary least squares regression and logistic regression modeling were conducted using PASW Statistics (SPSS Inc., Chicago, IL). These modeling techniques assume explanatory variables are independently related to the outcome. To ensure variables in the models did not violate this assumption, tests for multicollinearity were performed by reviewing tolerance and variance inflation factors (VIFs) of explanatory variables and correlations between explanatory variables. If tolerances were above 0.2, VIFs were less than two, and Pearson's r correlations were less than 0.8 between variables, then the model was assumed not to be compromised by multicollinearity.14 Neither model was found to be compromised by multicollinearity. For the ordinary least squares regression models, we also tested that residuals were normally distributed with a predicted probability (P–P) plot and found no concerning deviations from the normality line. Explanatory variables for the models were chosen a priori based on theory and prior published research. For the regression models, all a priori chosen explanatory variables were included in the first run of the model. A reduced model was then estimated by an iterative process, first removing variables with p > 0.20, and then keeping only variables with p < 0.10. The results of the full models (i.e., including all the a priori explanatory variables) are found in the ESI.†
Within-sample predictive power was estimated for the models as a secondary measure of comparison of model performance (see ESI†), since explanatory power (as measured by the coefficient of determination, R2) could not be evaluated for all three modeling techniques. If developing a model for predictive purposes, predictive power should be evaluated on a new ‘test’ sample set (i.e., not the sample set that built the model), as within-sample predictive power could overestimate performance.27 However, the main objective of this study was to develop explanatory models, so for model development, decisions were made to optimize the statistical power (i.e., large sample size prioritized over withholding data for a ‘test’ set).
Classification tree modeling was conducted using MATLAB & Simulink R2010a, version 7.10 (The MathWorks Inc., Natick, MA) using the ‘classregtree’ command. The classification tree starts with a parent-node containing all observations; the tree is then split into two child-nodes based on an explanatory variable and the corresponding binary decision (i.e., yes/no or threshold), such that cases in the two child-nodes are the most homogeneous with respect to the outcome variable, i.e., the error cost (percent of cases incorrectly classified) is minimized. The child-nodes then become parent-nodes to undergo further splitting, and the process is repeated until a chosen tree optimization parameter is met. Each node in the tree is assigned a pruning level (i.e., a level of branching) based on its associated error cost; nodes closer to the top of the tree have a higher pruning level. Therefore, the explanatory variables selected at higher pruning levels sort more cases into the correct outcome categories (i.e., have a lower error cost).
In this study, classification trees were optimized by setting the ‘minleaf’ parameter, which is the minimum number of cases in a child-node required for branching of the tree to continue. The optimal ‘minleaf’ value was determined by minimizing the pooled error cost of a 10-fold cross-validation of the tree. A 10-fold cross-validation of the tree divides the full dataset into 10 equal sets, uses 9 of the 10 sets to train a model, and then calculates an error cost with the set that was not used to train the model (i.e., test set). The error cost calculation is then repeated, alternating the set used as the test set, and a pooled error cost is calculated. Values of ‘minleaf’ parameters from 1–100 were tested to find the value associated with the minimum pooled error cost of a 10-fold cross-validation of the tree.
Stored water contamination categories | |||
---|---|---|---|
Low, 0–10 CFU EC per 100 mL | Medium, 11–100 CFU EC per 100 mL | High, >100 CFU EC per 100 mL | |
Unimproved water source | 10.6%(23) | 33.6%(73) | 55.8%(121) |
Improved water source off-plot | 32.1%(264) | 33.7%(277) | 34.3%(282) |
Improved water source on-plot | 32.1%(51) | 40.3%(64) | 27.7%(44) |
Variabled | Ordinary least squares regressiona | Multinomial logistic regression: medium EC categoryb | Multinomial logistic regression: high EC categoryb | Classification treec | |||
---|---|---|---|---|---|---|---|
Be | SE | B | SE | B | SE | Prune level | |
a Dependent variable is log CFU EC per 100 mL water. b Reference group is low contamination level category. c Outcome categories are low, medium, and high EC contamination categories. d Variables tested, found not to be significant, and excluded from models include: someone in the household has GI illness, latrine has a septic tank, latrine has a roof, latrine has a pit cover, flies present in latrine, water source on-plot, and water storage container is covered. e Unstandardized beta coefficient. f Binary variable (0 or 1). ***p < 0.01 **0.01 ≥ p < 0.05 *0.05 ≥ p < 0.10. | |||||||
Constant | 1.8 | 0.1 | 0.52 | 0.40 | 1.16*** | 0.37 | — |
Respondent works outside the homef | −0.20*** | 0.07 | −0.45** | 0.18 | −0.50*** | 0.18 | — |
Regular weekly expenditure per capita | — | — | — | — | — | — | 4 |
House has dirt floorf | 0.16*** | 0.06 | — | — | — | — | — |
House located within townf | −0.13** | 0.06 | −0.30* | 0.17 | −0.41** | 0.17 | — |
Infant present in householdf | — | — | 0.21 | 0.18 | 0.40** | 0.17 | — |
Household has private latrinef | — | — | 0.41** | 0.16 | 0.21 | 0.16 | — |
Feces visible around householdf | — | — | 0.29 | 0.36 | 0.59* | 0.35 | — |
Latrine has a cement floorf | — | — | — | — | — | — | 5 |
Children open defecatef | — | — | — | — | — | — | 0 |
Water source is improvedf | −0.52*** | 0.07 | −0.95*** | 0.26 | −1.56*** | 0.25 | 2 |
Water was actively treatedf | — | — | — | — | — | — | 2 |
Water extracted in risky mannerf | — | — | 0.46* | 0.25 | 0.04 | 0.23 | — |
Hand contacted water when extractingf | 0.14* | 0.07 | 0.19 | 0.23 | 0.51** | 0.22 | — |
Water stored for less than 24 hf | −0.16** | 0.06 | −0.40** | 0.18 | −0.25 | 0.18 | — |
Log EC CFU per 100 mL on hands of caregiver | 0.08*** | 0.03 | 0.06 | 0.08 | 0.19** | 0.08 | 3 |
For some associated correlates, however, the models did not agree. For instance, reported active treatment of the water, which included boiling, adding a coagulant, filtration, and chlorination, was not statistically significant in the ordinary least squares regression or multinomial logistic regression models but was a classification node in the classification tree model. Having a latrine with a cement floor was the first classification node in the classification tree model, meaning it was the variable that best sorts the cases into homogeneous groups of the outcome variable.16 This variable was not statistically significant in either regression model, however. The modeling techniques relate explanatory variables to stored water quality in fundamentally different ways, as mentioned in the introduction, so it is not surprising that differences arise in terms of explanatory variables identified. However, based on underlying theory, we are not able to determine which of the models is ‘correct’ (i.e., reflecting the true state of the world) but rather, each model, can offer insight for future hypothesis generation.
Uniquely, the classification tree model identified several ‘recipes’—or combinations of characteristics—for households with highly contaminated water (Fig. 1). For instance, the predicted probability of high contamination in stored water was 0.54 for a household without a cement floor in the latrine, that did not report treating their water, and that obtained their water from an unimproved source. Interestingly, if a household did not have a cement floor but reported that they did treat their water, the model still predicted high contamination. The classification tree model predicted low contamination in the stored drinking water for households that had a cement floor in the latrine and regular weekly expenditure greater than 6.29 USD (47% of households reported greater than 6.29 USD regular weekly expenditure). Also, the model predicted low contamination for households that had a cement floor in the latrine, female caregiver hand contamination less than 3.3 log CFU E. coli per 2 hands, and weekly expenditure of less than 3.48 USD (39% of households reported less than 3.48 USD regular weekly expenditure).
Variabled | Ordinary least squares regressiona | Binary logistic regressionb | Classification treec | ||
---|---|---|---|---|---|
Be | SE | B | SE | Prune Level | |
a Dependent variable is log CFU E. coli per 2 hands. b Reference group is no detection of E. coli. c Outcome categories are E. coli detected or not on female caregiver hands; pruning level represents the level of branching in the tree with nodes at the top of the tree having a higher pruning level. d Variables tested, found not to be significant, and excluded from models include: house has a dirt floor, someone in household has GI illness, latrine has a roof, latrine has pit cover, respondent has primary education, hand washing station with soap present, hands dried with fabric after hand washing, and hands wetted for hand washing by pouring water. e Unstandardized beta coefficient. f Binary variable (0 or 1). g In (1000 Tsh). h Dummy variables with the reference activity of ‘sitting’ and ‘other activities’. i Categorical variable of activity prior to hand rinse being sitting, washing, food handling, or other. ***p < 0.01 **0.01 ≥ p < 0.05 *0.05 ≥ p< 0.10. | |||||
Constant | 2.30*** | 0.09 | 0.48** | 0.23 | — |
Respondent works outside the home f | — | — | 0.36** | 0.17 | — |
Regular weekly expenditure per capitag | −0.02*** | 0.01 | — | — | 1 |
House located in townf | 0.39*** | 0.06 | 0.58*** | 0.14 | 1 |
Infant present in householdf | 0.16** | 0.07 | — | — | — |
Household has private latrinef | −0.16** | 0.06 | −0.33** | 0.14 | — |
Feces visible around householdf | 0.25* | 0.13 | 0.84** | 0.34 | — |
Latrine has a cement floorf | — | — | — | — | 2 |
Latrine has a septic tankf | −0.28** | 0.12 | — | — | — |
Flies present in latrinef | — | — | −0.26* | 0.15 | — |
Children open defecatef | — | — | 0.27** | 0.14 | — |
Time since last hand washing 1 h or lessf | — | — | — | — | 3 |
Prior activity involved washingh | 0.19** | 0.09 | 0.47** | 0.21 | — |
Prior activity food handlingh | — | — | 0.36** | 0.18 | — |
Prior activity (for classification tree only)i | — | — | — | — | 1 |
Other variables were only identified as correlates in one or two of the models. A household having a private latrine, as well as feces being observed on the ground near the household, were both statistically significantly associated with higher levels of E. coli contamination on a respondent's hands in the ordinary least squares regression and binary logistic regression models. For the ordinary least squares regression model, the household having a septic tank was associated with decreased contamination on the respondent's hands. For the binary logistic regression model, household children reportedly practicing open defecation and feces being visible around the household were statistically significantly associated with increased contamination on the respondent's hands. Time since last hand washing with soap was the first node in the classification tree (i.e., highest pruning level), meaning it was the most effective variable in sorting cases into homogenous groups of E. coli detect and E. coli non-detect (Fig. 2). Interestingly, the tree predicts E. coli detection in the hand rinse sample if the respondent reported hand washing with soap less than 1 hour prior to sampling. The classification tree model only had one branch that predicted no detection of E. coli in the rinse sample, and it was for respondents that reported hand washing more than 1 hour prior to rinse, do not have a cement slab on their latrine, live outside of Bagamoyo town, were sitting prior to the rinse sample, and had comparatively high regular weekly expenditures.
There were some explanatory variables that were identified as related to the outcome variables across the three modeling types (e.g., water source type for the water models and activity prior to sampling for the hand models), providing triangulated support for the relationships. Strikingly, whether an improved water source is on-plot or off-plot doesn't result in improved water quality outcomes (Table 2). As an on-plot improved water source represents the top of the water ladder,28 this study highlights that if drinking water is still stored in the home, the achievement of access to improved water infrastructure on the living premises would not necessarily confer water quality gains.
The explanatory power of the regression models was lower for hand contamination than for stored water quality, and was on par with previous research.5 In addition, all three modeling techniques exhibited low within-sample predictive power (see ESI†). Similar to other studies,5,11,12 our results highlight the complexity of explaining stored water quality and hand contamination in low-resource settings.
Several possible explanations exist for the poor explanatory and within-sample predictive power of these models. The limited variation in values of the outcome variable (e.g., low percentage of cases were non-detects for the hand contamination categorical outcome models) could contribute to the poor model performance.22 Also, several of the explanatory variables used in the analyses were based on self-reported data. Biased responses could prevent the identification of an existing relationship between the outcome variable and a correlate. In particular, unreliable reporting of hygiene behaviors has been documented in other studies.29,30 Aside from biases in the collected data, omission of important correlates could limit the predictive power of the models.
Poor model performance could also stem from E. coli concentrations not being an appropriate indicator of fecal contamination in drinking water and on hands. For instance, E. coli have been found to be naturally occurring in soils (i.e., not from feces) in tropical environments.31–33 In such a case, one would not necessarily expect the concentration of these organisms in stored water to be correlated with household water management and hygiene practices. Additionally, E. coli are found in the feces from multiple animal hosts, not just humans, which is problematic since many of the sanitation-related variables (e.g., children practicing open defecation) included in the models focus on contamination from human, rather than non-human feces.
We also note that E. coli measurements can exhibit considerable intrinsic sampling variability.34 Such random sampling error can impede efforts to identify associations between E. coli and extrinsic explanatory variables using multivariate statistical modeling.35 In theory, taking replicate water quality measurements would reduce measurement error, improve model fit, and increase precision of parameter estimates.35 Future research that explores the impact of replicate sampling on the explanatory power and parameter estimates of water quality and hand contamination models would thus be a valuable contribution. Some researchers have applied the Spearman–Brown formula to determine the number of replicate measures needed for a desired level of precision in a parameter estimate.36
Variability in E. coli measurements within a household may also be non-random. For example, there is evidence that water quality varies systematically over both short (<1 day) and longer (>1 day) time scales.12,37 The water and hand samples taken in the present study were captured at different times of day, and no information was available regarding temporal trends in contamination among study households. In future research, incorporating both repeat measurements within a household (over a relevant time frame) and replicate measurements (at each point in time) could allow modeling efforts to estimate the share of total variance attributable to explanatory variables, systematic temporal variation, and random variability.
Despite some limitations in the models, this study did provide some insight between behaviors and household characteristics and contamination in water and on hands that could be further explored for causal relationships in experimental evaluation. This work highlighted how the use of multiple modeling techniques can be fruitful when underlying relationships between explanatory variables and an outcome remain unclear. Although the modeling approaches varied, there were some similarities in the results, with certain explanatory variables being consistently identified as being related to hand and water contamination (e.g., water source type for the water models and activity prior to sampling for the hand models). At the same time, there were also marked differences across the models. In sum, these results suggest there are benefits to using multiple analysis methods to assess relationships in complex systems.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: 10.1039/c8em00460a |
This journal is © The Royal Society of Chemistry 2019 |