Murage
Ngatia
*,
David
Gonzalez
,
Steve
San Julian
and
Arin
Conner
Municipal Water Quality Investigations Program, Division of Environmental Services, California Department of Water Resources, P.O. Box 942836, Sacramento, CA, USA. E-mail: mngatia@water.ca.gov; Fax: +1 916-376-9692; Tel: +1 916-376-9714
First published on 21st October 2009
To evaluate whether two unattended field organic carbon instruments could provide data comparable to laboratory-generated data, we needed a practical assessment. Null hypothesis statistical testing (NHST) is commonly utilized for such evaluations in environmental assessments, but researchers in other disciplines have identified weaknesses that may limit NHST's usefulness. For example, in NHST, large sample sizes change p-values and a statistically significant result can be obtained by merely increasing the sample size. In addition, p-values can indicate that observed results are statistically significantly different, but in reality the differences could be trivial in magnitude. Equivalence tests, on the other hand, allow the investigator to incorporate decision criteria that have practical relevance to the study. In this paper, we demonstrate the potential use of equivalence tests as an alternative to NHST. We first compare data between the two field instruments, and then compare the field instruments' data to laboratory-generated data using both NHST and equivalence tests. NHST indicated that the data between the two field instruments and the data between the field instruments and the laboratory were significantly different. Equivalence tests showed that the data were equivalent because they fell within a pre-determined equivalence interval based on our knowledge of laboratory precision. We conclude that equivalence tests provide more useful comparisons and interpretation of water quality data than NHST and should be more widely used in similar environmental assessments.
Environmental impactEnvironmental research is generally conducted to understand an issue of concern. Such research requires collection, analysis and interpretation of data about the issue. Currently, the majority of environmental data are evaluated using classical statistics which utilize p-values as the standard criterion to indicate significance of the research findings. In this paper, we present equivalence statistical tests as better alternatives to p-values in evaluating an environmental water quality assessment issue. We posit that other environmental issues would benefit from data interpretations that incorporate the practical importance of the research findings. We conclude that equivalence tests offer an alternative to classical statistics' p-values in evaluating whether research findings are or are not of practical importance in understanding any environmental problem. |
An equivalence test, on the other hand, evaluates whether treatments A and B are close enough to be considered similar. The investigator sets a benchmark, then uses prior knowledge or belief to define a region around the benchmark (equivalence interval) within which the investigator has decided that the difference is unimportant (possibly because the difference falls in a range that is considered trivial). The equivalence interval has to be set a priori. Detailed descriptions of the development and theory of different approaches for testing equivalency are available in the literature.1–5 One can test for the null hypothesis that 2 (or more treatments) are inequivalent or for the alternative hypothesis that they are equivalent, thereby reversing the traditional tests. This shifts the burden of proof to demonstrating that what is being tested (e.g., water quality) meets certain criteria.
Equivalence tests have been most widely used in pharmacology for approval of generic drugs since 1984.6 Before a generic drug is approved, clinical trials have to demonstrate therapeutic equivalence (bioequivalence) to the reference formulation using 20% limits. Equivalence tests have not been widely adopted in environmental studies although their use has been advocated.7,8
The following equations and discussion explain NHST and its weaknesses:
H0: A − B = 0 (i.e., A = B) | (1) |
Ha: A − B ≠ 0 (i.e., A − B < 0 or A − B > 0) | (2) |
In the above example, if the calculated p is less than 0.05, the H0 is rejected. The conclusion is that treatments A and B are statistically significantly different at the calculated α level. This is often dismissed by NHST critics. For example, Kirk9 stated, “In scientific inference, what we want to know is the probability that the null hypothesis (H0) is true given that we have obtained a set of data (D); that is, p(H0|D). What null hypothesis significance testing tells us is the probability of obtaining these data or more extreme data if the null hypothesis is true, p(D|H0).” In other words, the H0 infers about (more extreme) data that were never collected.
When compared to other disciplines over the last 70 years, water quality assessments contain few discussions critical of NHST.10–17 Harlow et al.18 summarize NHST arguments (for and against). These criticisms of NHST are pertinent to environmental assessments because large sample size can increase statistical significance and a statistically significant result is not necessarily of practical significance.
The following are the main criticisms of NHST in other disciplines:
(1) The basic premise of NHST where the difference between treatments (or whatever is being tested) is assumed to be zero (i.e., H0: µ1 − µ2 = 0) is unrealistic. As Tukey19 put it, “All we know about the world teaches us that the effects of A and B are always different—in some decimal place—for any A and B.” Some null hypothesis tests have been termed “gratuitous significance testing,” i.e., using statistical significance testing for what is readily obvious or is already known.20 An example from literature is given by Johnson21 where a statistically significant test was reported that “the density of large trees was greater in unlogged forest stands than in logged stands (p = 0.02).”
(2) The p-value is arbitrary. Statistical significance has been shown to increase with sample size, i.e., p-value gets smaller as sample size gets larger.16,22,23 A possible solution is to predetermine a sample size that prevents the test from becoming too powerful. However, this approach is not feasible in long-term environmental monitoring programs where accumulation of data is unavoidable and is actually the main objective.
(3) A statistically significant result does not necessarily indicate magnitude of effect or practical importance.9,24 In the social sciences, some efforts provide a measure of the practical significance of a statistically significant result using “effect sizes” to indicate the practical importance of the results.11,16,25 However, these interpretational approaches are largely absent from environmental literature where a small statistically significant p-value is still the gold standard in interpreting research findings.
In this paper, we demonstrate the potential use of equivalence tests as an alternative to NHST in environmental assessments. We present equivalence tests from a practitioner's point of view. In our study, we use organic carbon (OC) data generated by two online field OC instruments each using a different analytical method. We compare a subset of the field instruments' data to grab sample results analyzed in the laboratory. The first objective is to determine if the two field instruments' data are comparable. The second objective is to determine if the two field instruments' data are of comparable precision quality to laboratory-generated data. We will discuss why equivalence tests more than NHST provide more practical interpretation of the data to answer the above objectives.
![]() | ||
Fig. 1 The Sacramento River at Hood station (Hood, 38.382°N, 121.519°W) was established in 1998 to provide a platform for collecting grab samples and to support a secure enclosure for evaluating advanced analytical instruments for online field monitoring of water quality. |
A Sievers 800 laboratory-grade OC analyzer (GE Analytical Instruments, Boulder, CO) using ultraviolet (UV) persulfate chemical oxidation analytical method was installed in 1999. A Shimadzu 4100 OC analyzer (Shimadzu Scientific Instruments, Columbia, MD) using high temperature catalytic combustion oxidation (HTC) analytical method was installed in April 2002 to run in tandem with the Sievers 800 (Fig. 2). The two analytical methods are described in Standard Methods.26 The United States Environmental Protection Program (USEPA) has comparable analytical methods. The equivalent USEPA method is 415.3.27 This method describes both HTC and oxidation in the analysis of OC. The Sievers 800 was replaced with a new Sievers 900 instrument in May 2005. (Although the 900 has upgraded hardware and software, its mode of operation is similar to the 800; the data from the two models are not separated in the data analyses in this paper). We use OC collected by the two instruments between April 2002 and April 2007 to demonstrate potential use of equivalence tests over NHST in water quality analysis.
![]() | ||
Fig. 2 A Shimadzu 4100 OC analyzer using high temperature catalytic combustion oxidation method runs in tandem with a Sievers 800 organic carbon analyzer using ultraviolet persulfate chemical oxidation method. |
The instruments were calibrated and maintained to manufacturers' specifications by DWR field support staff. A Campbell Scientific CR10X data logger (Campbell Scientific, North Logan, Utah) controlled the analytical frequencies and also temporarily stored the OC data. The data were uploaded every two hours by phone modem to the online California Data Exchange Center (CDEC) website (http://cdec.water.ca.gov). The station name is “srh.” Sievers is sensor 101, and Shimadzu TOC is sensor 112.
The paired t-test is the standard method (in NHST) for comparing two groups of paired data.28 The Kruskal–Wallis (K-W) analysis of variance (ANOVA) test is one of the options available in NHST for testing more than two sets of data. We used both of these tests for the NHST analyses.
The macro procedure worked as follows:
Suppose that was the overall mean difference between the 1665 pairs of Shimadzu and Sievers OC daily average results. The first part of the TOST performed the following hypothesis tests (α = 0.05):
H01: ![]() | (3) |
Ha1: ![]() | (4) |
The second part of the TOST performed the following hypothesis tests (α = 0.05):
H02: ![]() | (5) |
Ha2: ![]() | (6) |
Note that in the above procedures (equations 3–6), we were testing the null hypothesis of inequivalence (i.e., the null hypotheses that Shimadzu OC data were not equivalent to Sievers OC data). Shimadzu OC data were equivalent to Sievers OC data only when:
θ1 < ![]() | (7) |
If (7) was true, we would conclude that Shimadzu data were equivalent to Sievers OC data.
The TOST is identical to testing whether the 90% confidence interval around (calculated using 1 − 2α) is entirely contained within the equivalence interval. If equivalence is found to be true in the TOST, the confidence interval must be completely contained within the equivalence interval.4,30 If any of the null hypotheses was true, the two instruments' data were not equivalent.
![]() | ||
Fig. 3 Sievers and Shimadzu daily average organic carbon results are equivalent. |
![]() | ||
Fig. 4 Kruskal–Wallis pairwise comparisons between instruments. The dotted lines (−z, z) denote the interval outside of which paired differences were statistically significantly different. |
![]() | ||
Fig. 5 The confidence interval limits between all paired differences are within the 20% equivalence interval (dotted line); thus Sievers, Shimadzu, and Bryte Lab OC data were all equivalent. We consider the deviations between the instruments to be expected reproducibility differences since all the confidence intervals are contained in the equivalence interval. |
This study contained a lot of data—more than 1000 OC data pairs. Thus the paired t-test would be likely to detect small negligible differences between the two instruments as statistically significant. High frequency data may present serial correlation problems when used for time series or trend analysis. In method or instrument comparisons a major goal is to block for inter-sample differences, i.e., to keep the aliquots of the replicate samples analyzed in a round-robin study of different classes of instruments as uniform as possible. So we did not consider serial correlation to be an issue in our study. The equivalence test is the appropriate test in this study because we were not interested in small statistically significant differences but in differences of practical significance, i.e., whether two unattended field instruments using different analytical methods can provide data comparable in quality to laboratory-generated results. The equivalence tests demonstrated that the field UV persulfate oxidation and the HTC oxidation generated comparable data to each other at Hood station. Both instruments' data were of equivalent quality to laboratory-generated results. An advantage of using field instruments is that they can generate high frequency data at significantly lower cost compared to laboratory-analyzed data. In addition, the field instruments provide data in near real time; whereas, laboratory analysis of grab samples may take several weeks.
Equivalence tests offer good alternative to tests of significance in environmental assessments because they allow the investigator to incorporate decision criteria that have practical relevance to the study. Most environmental research projects are funded to find a solution to one or more practical problems. It is therefore desirable to use a data analysis technique that can readily indicate the magnitude and/or practical importance of the study results to the project's objectives. In this study, it was logical to use a precision criterion comparable to laboratory duplicate analysis. We were interested in evaluating whether the unattended field instruments could provide data comparable to accepted laboratory quality standards. Equivalence tests provided the mechanism to make these kinds of comparisons. If equivalence tests are preferred in determining the effectiveness of generic drugs for humans, it is our opinion that equivalence tests should be good enough for environmental assessments.
A potential difficulty in using equivalence tests is objectively determining the equivalence interval. Environmental research historically has relied on statistical significance testing for decision-making and does not have a track record of defining quantitative criteria to indicate practical significance of statistically significant test results. Exceptions are where limits are set by a regulatory agency. Practitioners have a number of options to determine a difference of practical significance.32 The first and most straightforward is defining equivalence intervals empirically such as using a regulatory limit (as is the case with generic drugs) or using calculations from previous research. In this paper we set the equivalence interval using our knowledge of laboratory precision of OC analyses, which was an empirical approach. A second approach to constructing the equivalence interval is soliciting (or eliciting) opinions from experts in the area of interest. This approach is more complex than the empirical approach, but in many environmental situations, it may be the only viable alternative.
Footnotes |
† Part of a themed issue dealing with water and water related issues. |
‡ Electronic supplementary information (ESI) available: Bryte Lab quality control and accreditation details. See DOI: 10.1039/b912098j |
This journal is © The Royal Society of Chemistry 2010 |