Analytical Methods Committee AMCTB No. 93

Received
18th December 2019

First published on 31st January 2020

A significance test can be performed by calculating a test statistic, such as Student’s t or chi-squared, and comparing it with a critical value for the corresponding distribution. If the test statistic crosses the critical value threshold, the test is considered “significant”. The critical value is chosen so that there is a low probability – often 5% (for “95% confidence”) – of obtaining a significant test result by chance alone. Routine use of computers has changed this situation; software presents critical values at traditional probabilities, but now also calculates a probability, the “p-value”, for the calculated value of the test statistic. A low p-value – say, under 0.05 – can be taken as a significant result in the same way as a test statistic passing the 95% critical value. This applies to a wide variety of statistical tests, so p-values now pop-up routinely in statistical software. However, their real meaning is not as simple as it seems, and the widespread use of p-values in science has recently been challenged – even banned. What does this mean for p-values in analytical science?

A p-value is the probability, given certain assumptions, that a test statistic equals or exceeds its experimental value. For example, suppose an analyst is checking the calibration of an elemental analyser to see whether it is biased. She takes five readings based on a reference material and finds the difference between their mean and the reference value. She then conducts a statistical test, such as Student’s t test. Like most statistical tests, this one starts with a ‘null hypothesis’; usually that there is zero bias in this case. The software used for the test generates the ubiquitous p-value; if it is small, the analyst will usually conclude that there is evidence of bias and will take some corrective action.

This p-value is the probability of obtaining the test statistic, the calculated value of Student’s t. It is – perhaps surprisingly – not the probability that the analyser is biased. Instead, it is the probability of getting the test statistic, or a greater value, if the null hypothesis is correct and with the additional assumptions listed below. It answers the question “how probable is our value of Student’s t if the instrument really has no bias?”. Clearly, if that probability is very small, it is sensible to question the null hypothesis or some other assumption; that is why the null hypothesis is ‘rejected’ when the p-value is small. Equally, if the likelihood of our result is high we assume that there is no effect of interest, and would not usually undertake further investigations.

While this is useful, we have not addressed the exact question of interest, which is: ‘How sure are we that our instrument is unbiased, given our data?’ To answer this second question properly, we would first need to know the chances of the instrument being biased before we start the experiment, as well as the chance of getting the same result if the instrument is biased. We use hypothesis tests largely because we do not generally have this essential prior information, so must rely on a related but indirect question to decide how to proceed.

The test also assumes that errors are random, independent of one another, normally distributed around zero and that the (true) standard deviation is constant within the range of interest. Other distribution assumptions would mean different probability calculations. There is also an implicit assumption that only one such test is conducted. This is important because, if we carry out the test many times, we would have a much higher chance of seeing at least one extreme value of Student’s t: we should then be very wary of interpreting p-values below 0.05 as an indication of some important finding.

– 0.05 is an arbitrary criterion, leading us to claim a ‘significant’ finding one time in 20 when there is in fact no real effect. False positive findings can lead researchers to conduct unnecessary follow-up work. Worse, in fields where the effects studied are nearly always very small or absent – for example, looking for genes with a large effect on a disease, among a very large population of genes with no effect – the outcome is that nearly all of the apparent ’positives’ (at p = 0.05 for each single experiment) will be false.

– A small, and therefore ‘significant’, p-value does not say anything about the size or practical importance of the effect; inconsequential bias can lead to a significant p-value if precision is very good.

– A small p-value is unlikely to arise by chance, but it does not rule out the possibility of experimental bias caused by some factor that is not under study. For example, runs on different stability test samples on two separate days may show a significant difference, but that need not signal a change in the test materials: it may indicate instead a change in the analytical method.

– A large p-value is not proof that there is no effect. The experiment may not be sufficiently precise, or too small, to find the effect of interest. In short, “absence of evidence is not evidence of absence”; we may simply have not looked hard enough!

– Probability calculations are based on theoretical assumptions that are almost always approximate for real data. Very small p-values rely on the ‘tails’ of a distribution, and as any proficiency test result will usually demonstrate, the tails of the data rarely follow a normal distribution closely.

– p-Values can be misused very easily when seeking findings for publication. If we study one large group of data, carrying out one hypothesis test, we can stay with our familiar p < 0.05 criterion. However, if that is not met, it is easy and tempting to break the data into subsets and test all of those individually, claiming every instance of p < 0.05 as a new discovery (this, with related abuses, even has a name – “p hacking”). But repeating the test on unplanned subsets greatly multiplies the chances of individual false positives. Statisticians have long had methods of correcting for this – usually by modifying the p-value criterion to retain a low false positive rate for the experiment as a whole. Unfortunately this so-called ‘multiple testing’ correction does not feature in many science or statistics courses for analytical chemists.

– As an indication that follow-up action is not needed. Most criticism of p-values focus on over-interpretation of significance, but most method validations, for example, rely on insignificance; we are reassured by negative findings, not seeking positive findings. When an experiment is properly carried out, it remains safe to take an insignificant p-value as ‘no cause for concern’.

– As a signal for follow-up investigation. Again, in method validation, a significant p-value is usually followed up, both to confirm that there is a problem and because it needs to be rectified. This provides very substantial protection against inadvertent over-interpretation.

– Use in conjunction with other indicators of effect size: an insignificant p-value with a small measured effect and a small confidence interval should be reassuring evidence that an effect can genuinely be neglected.

– As a protection against visual over-interpretation. Visual inspection of data often shows apparent outliers, trends, curves or regularities because human visual systems are adept at finding anomalies. Our eyes effectively consider many possible patterns at once, and we are likely to test only for the pattern we see: this is a hidden ‘multiple testing’ problem, so a significant finding need not signal a real effect. However, with that bias towards significant findings, an insignificant hypothesis test can be a good reason to disregard the visual anomaly as chance.

Therefore for most normal validation and QC applications, the use of p-values remains justified when properly applied. The important caveat is that, if an experiment is to be used as evidence that no further work is needed, the experiment must be sufficiently powerful to find an effect that would be important (AMC Technical Brief No. 38 (ref. 5) explains test power more fully). We cannot claim that, say, three replicates with a 15% relative standard deviation are sufficient to show that there is no bias over 5%; such an experiment is simply not sufficient for that purpose. We still need 10–15 replicates to be reasonably sure of detecting a bias as large as our available standard deviation.

Another useful alternative is graphical inspection. Computers provide many different graphical displays of our datasets instantly, and these may tell us (with a bit of experience) all we need to know without further ado. Significance tests are then optimally useful when visual examination seems to be marginal.

• An insignificant p-value is not evidence of absence of an important effect or acceptability unless the experiment is properly designed and sufficiently powerful.

• A significant p-value should be followed up and confirmed, particularly if it implies a need for expensive or safety-critical action.

• A p-value is not a probability that the null hypothesis is correct. It is the probability that the observed result would have arisen if the null hypothesis is correct.

• Statistical significance is not the only criterion on which action should be based; look at the measured effect size and the confidence interval as well.

• 0.05 may not always be a reasonable boundary for statistical significance, for instance, in forensic science.

• A p-value of 0.05 (or some other value) should not be regarded as a sharp boundary, as it is arbitrarily chosen. A value of (say) 0.07 still suggests a possible effect, hinting that a larger experiment might be valuable. A p-value of (say) 0.03 indicates statistical significance, but still with a definite probability that the null hypothesis may be true.

SLR Ellison (LGC Limited)

M Thompson (Birkbeck College, London)

This Technical Brief was prepared for the Analytical Methods Committee, with contributions from members of the AMC Statistics Expert Working Group, and approved on 23 August 2019.

- R. L. Wasserstein and N. A. Lazar, The ASA’s statement on p-values: context, process and purpose, Am. Stat., 2016, 70, 129–133, DOI:10.1080/00031305.2016.1154108.
- J. P. A. Ioannidis, Why Most Published Research Findings Are False, PLoS Med., 2005, 2(8), e124, DOI:10.1371/journal.pmed.0020124.
- M. L. Head, L. Holman, R. Lanfear, A. T. Kahn and M. D. Jennions, The extent and consequences of p-hacking in science, PLoS Biol., 2015 Mar 13, 13(3), e1002106, DOI:10.1371/journal.pbio.1002106.
- D. Trafimow and M. Marks, Basic Appl. Soc. Psychol., 2015, 37, 1–2 CrossRef.
- AMC Technical Briefs Webpage, https://www.rsc.org/Membership/Networking/InterestGroups/Analytical/AMC/TechnicalBriefs.asp.
- AMC Technical Brief No. 52., Bayesian statistics in action, Anal. Methods, 2012, 4, 2213–2214, 10.1039/c2ay90023h.

This journal is © The Royal Society of Chemistry 2020 |