Avoiding overfitting in the analysis of high-dimensional data with artificial neural networks (ANNs)

Marianne Defernez; E. Katherine Kemsley

doi:10.1039/A905556H

View PDF Version

DOI: 10.1039/A905556H (Paper) Reference Section for: Analyst, 1999, 124, 1675-1681

Avoiding overfitting in the analysis of high-dimensional data with artificial neural networks (ANNs)

(Note: The full text of this document is currently only available in the PDF Version )

Marianne Defernez and E. Katherine Kemsley

Abstract

Complex data analysis is becoming more easily accessible to analytical chemists, including natural computation methods such as artificial neural networks (ANNs). Unfortunately, in many of these methods, inappropriate choices of model parameters can lead to overfitting. This study concerns overfitting issues in the use of ANNs to classify complex, high-dimensional data (where the number of variables far exceeds the number of specimens). We examine whether a parameter ρ, equal to the ratio of the number of observations in the training set to the number of connections in the network, can be used as an indicator to forecast overfitting. Networks possessing different ρ values were trained using as inputs either raw data or scores obtained from principal component analysis (PCA). A primary finding was that different data sets behave very differently. For data sets with either abundant or scant information related to the proposed group structure, overfitting was little influenced by ρ, whereas for intermediate cases some dependence was found, although it was not possible to specify values of ρ which prevented overfitting altogether. The use of a tuning set, to control termination of training and guard against overtraining, did not necessarily prevent overfitting from taking place. However, for data containing scant group-related information, the use of a tuning set reduced the likelihood and magnitude of overfitting, although not eliminating it entirely. For other data sets, little difference in the nature of overfitting arose from the two modes of termination. Small data sets (in terms of number of specimens) were more likely to produce overfit ANNs, as were input layers comprising large numbers of PC scores. Hence, for high-dimensional data, the use of a limited number of PC scores as inputs, a tuning set to prevent overtraining and a test set to detect and guard against overfitting are recommended.

References

S. Haykin, Neural Networks, Prentice-Hall, Englewood Cliffs, NJ, 1994 Search PubMed.
C. M. Bishop, Neural Networks for Pattern Recognition, Clarendon Press, Oxford, 1995 Search PubMed.
T. A. Andrea and H. Kalayek, J. Med. Chem., 1991, 34, 2824 Search PubMed.
B. Widrow, in Proceedings of IEEE 1st International Conference on Neural Networks, 1987 , pp. 143-158 Search PubMed.
A. Maren, C. Harston and R. Pap, Handbook of Neural Computing Applications, Academic Press, San Diego, CA, 1990 Search PubMed.
M. Defernez and E. K. Kemsley, Trends Anal. Chem., 1997, 216 Search PubMed.
I. T. Jolliffe, Principal Component Analysis, Springer, New York, 1986 Search PubMed.
P. Hopke and D. L. Massart, Chemom. Intell. Lab. Syst, 1993, 19, 35 Search PubMed.
R. Briandet, E. K. Kemsley and R. H. Wilson, J. Sci. Food Agric, 1996, 71, 359 Search PubMed.
M. Defernez, E. K. Kemsley and R. H. Wilson, J. Agric. Food Chem, 1995, 109 Search PubMed.
I. V. Tetko, D. J. Livingstone and A. I. Luik, J. Chem. Inf. Comput. Sci, 1995, 35, 826 Search PubMed.
B. D. Ripley, Pattern Recognition and Neural Networks, Cambridge University Press, Cambridge, 1996 Search PubMed.

Click here to see how this site uses Cookies. View our privacy policy here.