Michael
Thompson
*
School of Biological and Chemical Sciences, Birkbeck University of London, Malet Street, London WC1E 7HX. E-mail: M.Thompson@bbk.ac.uk
A sound metrological infrastructure for chemical measurement is a requirement for universal comparability of the results used in international trade. Over the last few decades such a system has been put in place and a corresponding improvement in quality has been evident. But an undue emphasis on traceability and a particular approach to uncertainty have now become counterproductive. Rather than leading to further improvements in the quality of analytical results, these widely-held precepts have distracted analytical chemists from the real problems and progress has stalled. The counter-arguments proposed here are that: (a) broken traceability to the SI is seldom a problem in chemical measurement, both because the metrological infrastructure is very effective and because the applications of analysis seldom demand a relative uncertainty smaller than 1%; (b) the most serious sources of error in analysis usually arise from shortcomings in the chemical preparation of the test solution and matrix mismatch between test solution and calibrators, so cannot meaningfully be attributed to a broken traceability to the SI; (c) the cause-and-effect approach to uncertainty has shortcomings and leads to an ineffectual validation of measurement procedures; (d) properly validated analytical procedures are by definition already fit for purpose, and problems arise only when the analyst deviates from the procedure or when the test materials fall outside its validated scope.
Let me first convince the reader that there are still problems. A few examples suffice, brought to light by proficiency testing or blind quality control.
• GeoPT is an international proficiency testing scheme for rock analysis in which about 70 participants report results for up to 70 elements in one test material per round.3 Criteria for appropriate uncertainty used in calculating z-scores are published before the test. In round 32 (2012), about 7% of the absolute z-scores exceeded 5.0. This was by no means unusual. The proportion was higher in round 33 (2013). (An absolute z-score of 5.0 implies that the deviation of the participant's result from the assigned value was five times greater than the uncertainty defined by the scheme as fit for purpose. The proportion of the absolute z-scores exceeding 5.0 expected in a population of laboratories compliant with the prescribed criterion for uncertainty is about 0.00005%.)
• FAPAS is a large international proficiency testing scheme for foodstuff analysis.4 Criteria for appropriate uncertainty used in calculating z-scores are published before the test. Data were taken from the latest 15 rounds posted on 21st May 2014. These rounds provided tests for 38 combinations of diverse food matrices and analytes. A proportion of 5.1% (78/1529) of the absolute z-scores exceeded 5.0.
• In a recent study from an undisclosed source, a series of blind duplicate samples were sent to a laboratory holding an accreditation that included in its specification a maximum level of uncertainty umax for the analysis of the named analytes and test material. The repeatability standard deviation r estimated from the duplicate results was about 2umax. (It is impossible for the true value σr to exceed the true standard uncertainty υ. Usually in analytical procedures we find σr ≈ υ/2.)
A noteworthy feature of the first two examples is that in proficiency tests the participants know that their results will be under third-party scrutiny: they were presumably striving to conform to the schemes' prescribed criteria of uncertainty and so get ‘satisfactory’ z-scores. Further important points emerge. (a) In most instances a properly designed internal quality control system would have flagged up any unacceptable performance before the results were released by the laboratory to the proficiency test scheme or customer. We can infer that the laboratories reporting these very unsatisfactory results were not just conducting the analysis ineptly but also employing an inadequate IQC system. (b) In food analysis, many of the procedures would have been validated by published collaborative (inter-laboratory) study, widely regarded as the most thorough type of validation. We see that using a carefully validated (and in most instances accredited) procedure does not in itself make a laboratory immune from problems. (c) In the third example the laboratory involved was apparently unable to make an adequate estimate of the uncertainty of its measurements results.
Seemingly the huge international efforts put into improving the reliability of chemical measurement has not been completely effective. In my view, this shortcoming persists because certain precepts, as interpreted today, are emphasised beyond their utility and have become counterproductive. These questionable assumptions can be summarised as follows.
• Chemical measurement is unreliable because some existing procedures are not fit for purpose.
• This unfitness for purpose is the outcome of a breakdown of traceability of the result to SI units.
• The problem could be rectified only if measurement uncertainty could be determined by a procedure appropriate for the determination of the physical constants of nature.
These assumptions have diverted analysts' attention away from the real causes of the remaining problems. In contrast I contend that: (a) that most established analytical procedures are fit for purpose—validated procedures are fit for purpose by definition—but problems in the results of chemical measurement are caused by deviation from the validated procedure; (b) that there is no breakdown in traceability to the SI in the overwhelming majority of analytical procedures; and (c) that while the cause-and-effect method is useful to illustrate the uncertainty concept, in chemical measurement it is not appropriate for routine analysis—it is not uniquely valid and, in the majority of instances, gives rise to an underestimate.
Despite these difficulties surrounding fitness for purpose, evolutionary tendencies come to our aid without any high-level intervention or financial modelling. In an established application sector, such as food analysis, the suite of procedures in use evolves over time by a process of adaption. Procedures that are unnecessarily accurate (and therefore needlessly expensive) tend to be set aside in favour of cheaper and less accurate ones. Procedures that give rise to too great a proportion of inept decisions tend to be replaced by more accurate (and more expensive) ones. The eventual outcome is a suite of procedures giving results with appropriate (fit-for-purpose) uncertainties over the whole range of relevant concentrations. This process is thought to be the origin of the empirically-observed Horwitz function that describes the trend of reproducibility standard deviation as a function of mass fraction in the analysis of foodstuffs.6,7 Because of this tendency to adaptation, it is a cogent supposition that most procedures used in established fields are already fit for purpose. It is therefore only departure from such a procedure that causes serious analytical errors.
Many, perhaps most, of such departures can be attributed to human error.8,9 Assuming the existence of a well-found laboratory with traceable standards, analytical results are likely to be inaccurate when either of the following occurs: (a) deviation from a well validated procedure; or (b) the use of a procedure outside the scope of its validation. It is clear from this that validation plays an axial role in obtaining accurate results, so it essential that critical aspects of validation are well understood and executed. However, two aspects of validation—the definition of the applicable scope of a procedure and the estimation of the uncertainty associated with its results—commonly fall short of requirements.
A schematic diagram of a typical chemical measurement procedure (Fig 1) helps to pinpoint the major sources of uncertainty. The actions requiring traceability to the SI are clearly marked but other, almost always greater, causes of error cannot be referred back to the SI in any meaningful way. Setting aside uncertainty from sampling as a separate issue, the major problems encountered in analysis concern (a) the efficacy of the chemical preparation of the test solution from the test portion, (b) mistakes in the preparation of the calibrators, and (c) the comparison between the test solution and the calibrators via the selected measurement principle. The test solution can deviate from the correct concentration by virtue of loss or gain of analyte, loss through incomplete chemical treatment of the test portion or gain by contamination from the laboratory environment. Then the comparison between the treated test solution and calibrators can suffer from loss or gain of the net analytical signal brought about by matrix mismatch. These almost ubiquitous features of chemical analysis can seldom be predicted from theory and, in any event, cannot be meaningfully referred to the SI. On those grounds we can reject the idea that shortcomings in chemical measurement are related to incomplete traceability to the SI.
![]() | ||
Fig. 1 Schematic diagram of a chemical measurement, showing actions requiring traceability to the SI, and features that act as major sources of uncertainty (colour). |
There are two types of problem inherent in the splitter approach that potentially affect all measurements but beset chemical measurements in particular, namely (a) parametric, and (b) structural, shortcomings in the statistical model of the procedure. A parametric problem occurs when the variance attributed to a sub-procedure is incorrect. This could be rectified in principle by substituting the correct value into the model. But how would the analyst detect such a problem? It could only be done by comparison of the outcome with an independent estimate of uncertainty, a sort of reference value, to show that the splitter estimate was incorrect. No such reference value exists. The more serious problem occurs when the structure of the model itself is incomplete. Such an occurrence is commonplace to some degree in analytical procedures: there are usually numerous factors, both inherent and extraneous, of which the level could vary and thus conceivably affect the net analytical signal: there is also a potentially much larger number of their possible interactions. There is no way of rectifying this problem because few of these factors or interactions can be modelled adequately and most of them are neither detected nor even suspected. This problem gives rise to ‘dark uncertainty’.12,13 On this basis alone there are grounds for suspecting that the ‘splitter’ method will tend to underestimate the uncertainty associated with chemical measurement. As we shall see there is much evidence supporting this contention.
The extreme alternative method is the ‘top-down’ or ‘lumper’ approach, in which the standard uncertainty is equated with the reproducibility standard deviation sR of results of a procedure based on an inter-laboratory study such as a collaborative trial. The essential idea here is that the variable factors and their interactions that affect an analytical result will be close-to-randomly sampled in a sufficiently large study. Many practitioners object to this idea on the grounds that the complete scope for variation among the factors will be insufficiently sampled in an interlaboratory study. On that basis alone ‘splitters’ consider that sR will be tend to underestimate standard uncertainty. They also object to sR on the specious grounds that the estimate does not take procedure bias into account, and (equivalently) that we can say nothing about its traceability. As we have seen above, there is no substantive issue with traceability to the SI in most chemical measurement, because we can safely assume that all of the participants in a collaborative trial will have well-found laboratories with traceably calibrated equipment. Moreover, any candidate procedure will have been subjected to a very careful consideration in respect of bias before the very costly interlaboratory study is undertaken.
We thus see two schools of thought each claiming (possibly correctly) that the rival method of estimating uncertainty will tend to produce an underestimate. The inference can only be that the method tending to give the greater estimate will be closer to correct, but may still be somewhat too small. In several instances where such comparisons have been made, we see that sR tends to exceed the splitter estimate by a factor typically approaching two.12
Additional evidence can be gleaned from interlaboratory studies in which participants report their results with associated uncertainty estimates. In such studies, the reported uncertainties usually fail to account for the variation among the results.12 Furthermore, it is an almost invariable observation that in collaborative trials of a procedure, sR exceeds the repeatability standard deviation sr by a typical factor of two.14 Were there no dark uncertainty, the two standard deviations would be tend to be equal, that is, the repeatability standard deviation would account for all of the variation between laboratories. A large number of studies have thus demonstrated a tendency for analysts to underestimate their uncertainties. This seems to be an outcome of inter alia an incomplete validation.
(Note: in practice among analytical chemists there is a continuum of approaches to estimating uncertainty that fall between the extremes of ‘splitting’ and ‘lumping’.)
Finally there are psychological factors involved as well as physical: obtaining a small uncertainty tends to be seen as a measure of an analyst's skill, a laboratory's competence, or a method developer's ingenuity, a situation that tends to generate an unconscious data selection bias. If sr looks disappointingly large, it is tempting simply to repeat the experiment and, should a smaller sr emerge, to use that value.
Where a procedure has been validated, that is with a good estimate of the uncertainty, any problems in the accuracy of the results mostly seem to stem from human error,8 a deviation by the analyst from the validated procedure. This deviation can take two forms, a departure from the documented details of the procedure, or an application of the procedure to test materials outside the scope of the validation. Moreover, either of these problems can affect a whole run of analysis or be sporadic—restricted to just a few of the test materials that comprise the run. (A ‘run’ is the period during which repeatability conditions are regarded as prevailing.) Another reported major cause of problems is undetected instrument failure. When any of these circumstances occur, it renders the true uncertainty completely unknown (not simply underestimated), sometimes to an extent that has practical consequences.
However, IQC (as currently understood16) cannot guard against problems arising from the use of a procedure outside the scope of its validation. Analysts must be at all times aware of the possibility of encountering test materials with compositions outside the validated scope. In validating a procedure, the defined range of matrix types among the test materials must be reasonably limited. Operational categories such as ‘food’ or ‘soil’ may in some cases be far too inclusive. Soil for example could consist largely of silica, or clay minerals, or chalk, or peat, or, in tropical countries, laterite (largely an Fe2O3–Al2O3 mixture), or any of their various mixtures. Such diverse matrices could have a marked influence on the efficacy of a chemical decomposition and, just as importantly, on matrix effects encountered during the subsequent comparison of the test solution and calibrators. The scope must also include a specification of the concentration range covered by the validation process. This is because uncertainty tends to vary widely with the concentration of the analyte.
The surrogate test material used in IQC, however, with its invariant matrix and fixed analyte concentration, would show no untoward effect in a run that included out-of-scope test materials. Sporadic (transient) problems affecting a small part of a run are also unlikely to be detected by internal quality control. Several methods are available for detecting sporadic blunders and these measures should where possible be deployed as routine alongside IQC. Methods for detecting an out-of-scope matrix are not as yet well documented.17
The reproducibility standard deviation in an unbiased procedure, estimated by interlaboratory study, is usually a reasonable estimate of uncertainty. However, collaborative trials are very costly to carry out and, apart from in the food sector, seldom attempted. Sometimes, when a subset of the participants use very similar procedures, a value of σR can be estimated from results of proficiency tests. An alternative value, which might at least serve as a benchmark for comparison, could be taken as double the repeatability standard deviation, which can be estimated in a single-laboratory validation. An independent estimate that was substantially less than such a benchmark should be regarded as prima facie suspect. For this purpose, however, the repeatability value must be estimated under conditions of measurement that are all but impossible to simulate in a one-off validation exercise—the estimation must rest on real-life conditions, that is, when the procedure is ‘bedded down’ in routine use on routine test materials.16 The estimation protocol outlined in Appendix A is one such.
I must acknowledge that there exists a so-far unresolved problem in using realistic estimates of uncertainty in a commercial context. A laboratory offering an analytical service with a realistic uncertainty is likely to lose custom to a competitor claiming an unrealistically lower uncertainty at the same price. This commercial pressure encourages the use of approaches to estimation that bias the apparent uncertainty downwards. This tendency might at first seem impossible to forestall in an open market, but there are countermeasures that could be taken at all levels of the sociology of chemical measurement.
• Analytical laboratory managers could educate their customers in the commercial benefits of having fit-for-purpose results with realistically estimated uncertainties.
• Quality managers in laboratories could ensure that an adequate IQC programme is in place, and that some attempt is made to detect sporadic errors and to identify out-of-scope test materials.
• Customers from the outset could make clear to laboratories that they intend to apply blind quality control to ensure that laboratories fulfil contractual uncertainty specifications.18
• Funding agencies could ring-fence adequate support for studies relating to uncertainty and quality in chemical measurement.
• Accreditation agencies could pay special attention to the way in which uncertainties are estimated by a candidate laboratory, and to ensure that any uncertainties advertised or claimed were consistent with the laboratory's proficiency test scores and with the records of an adequate internal quality control system.
• Universities and colleges providing higher education in chemical measurement could make better provision for the coverage of quality issues in chemical analysis supporting industry, trade and public service.
• Professional bodies could set standards for such training in analytical quality and uncertainty that have to be fulfilled for the educational institution to receive endorsement of its courses, or for an analytical chemist to receive chartered status. They could further disseminate impartial information about uncertainty and quality to the wider community.
• Adhere always to the validated procedure and its defined scope, under routine conditions of measurement.
• Within each run of analysis, make the measurement on duplicate test portions of all (or a random selection) of the routine test samples. Place the duplicate test portions at individually randomised positions within the sequence of test portions in the run.
• Repeat the above in a number of separate runs.
• Estimate the repeatability standard deviation, where appropriate as a function of concentration, by considering the median of the absolute differences between corresponding pairs of results.18
This journal is © The Royal Society of Chemistry 2014 |