Benford's Law and the screening of analytical data: the case of pollutant concentrations in ambient air
Abstract
The need to ensure the robustness of very large data sets produced by analytical measurement processes is increasing. This requires data screening techniques that can identify formatting or transcription errors in large data sets, that have undergone multiple data-handling and manipulation procedures. The empirical observation that the digits 1 to 9 are not equally likely to appear as the initial digit in multi-digit numbers is known as Benford's Law, and may provide a solution to this requirement. Several sets of data pertaining to the measured concentrations of pollutants in ambient air in the UK in 2004 have been analysed for their initial digit frequencies in order to assess the potential for the use of Benford's Law as a data screening, and authenticity-checking, tool for these types of analytical data sets. Benford's Law has been shown to be a robust top-level data screening tool provided that the numerical range of the data set being considered is four orders of magnitude or greater. It has been shown that small changes in the deviation of a data set from Benford's Law may indicate the introduction of errors during data processing. In this way, Benford's Law provides a sensitive technique for identifying data mishandling in large data sets.