Avoiding overfitting in the analysis of high-dimensional data with artificial neural networks (ANNs)
Complex data analysis is becoming more easily accessible to analytical chemists, including natural computation methods such as artificial neural networks (ANNs). Unfortunately, in many of these methods, inappropriate choices of model parameters can lead to overfitting. This study concerns overfitting issues in the use of ANNs to classify complex, high-dimensional data (where the number of variables far exceeds the number of specimens). We examine whether a parameter ρ, equal to the ratio of the number of observations in the training set to the number of connections in the network, can be used as an indicator to forecast overfitting. Networks possessing different ρ values were trained using as inputs either raw data or scores obtained from principal component analysis (PCA). A primary finding was that different data sets behave very differently. For data sets with either abundant or scant information related to the proposed group structure, overfitting was little influenced by ρ, whereas for intermediate cases some dependence was found, although it was not possible to specify values of ρ which prevented overfitting altogether. The use of a tuning set, to control termination of training and guard against overtraining, did not necessarily prevent overfitting from taking place. However, for data containing scant group-related information, the use of a tuning set reduced the likelihood and magnitude of overfitting, although not eliminating it entirely. For other data sets, little difference in the nature of overfitting arose from the two modes of termination. Small data sets (in terms of number of specimens) were more likely to produce overfit ANNs, as were input layers comprising large numbers of PC scores. Hence, for high-dimensional data, the use of a limited number of PC scores as inputs, a tuning set to prevent overtraining and a test set to detect and guard against overfitting are recommended.