Avoiding overfitting in the analysis of high-dimensional data with artificial neural networks (ANNs)

Marianne Defernez; E. Katherine Kemsley

doi:10.1039/A905556H

Avoiding overfitting in the analysis of high-dimensional data with artificial neural networks (ANNs)

Marianne Defernez and E. Katherine Kemsley

Abstract

Complex data analysis is becoming more easily accessible to analytical chemists, including natural computation methods such as artificial neural networks (ANNs). Unfortunately, in many of these methods, inappropriate choices of model parameters can lead to overfitting. This study concerns overfitting issues in the use of ANNs to classify complex, high-dimensional data (where the number of variables far exceeds the number of specimens). We examine whether a parameter ρ, equal to the ratio of the number of observations in the training set to the number of connections in the network, can be used as an indicator to forecast overfitting. Networks possessing different ρ values were trained using as inputs either raw data or scores obtained from principal component analysis (PCA). A primary finding was that different data sets behave very differently. For data sets with either abundant or scant information related to the proposed group structure, overfitting was little influenced by ρ, whereas for intermediate cases some dependence was found, although it was not possible to specify values of ρ which prevented overfitting altogether. The use of a tuning set, to control termination of training and guard against overtraining, did not necessarily prevent overfitting from taking place. However, for data containing scant group-related information, the use of a tuning set reduced the likelihood and magnitude of overfitting, although not eliminating it entirely. For other data sets, little difference in the nature of overfitting arose from the two modes of termination. Small data sets (in terms of number of specimens) were more likely to produce overfit ANNs, as were input layers comprising large numbers of PC scores. Hence, for high-dimensional data, the use of a limited number of PC scores as inputs, a tuning set to prevent overtraining and a test set to detect and guard against overfitting are recommended.

Article information

Download Citation

Analyst, 1999,124, 1675-1681

Permissions

Request permissions

Avoiding overfitting in the analysis of high-dimensional data with artificial neural networks (ANNs)

M. Defernez and E. K. Kemsley, Analyst, 1999, 124, 1675 DOI: 10.1039/A905556H

To request permission to reproduce material from this article, please go to the Copyright Clearance Center request page.

If you are an author contributing to an RSC publication, you do not need to request permission provided correct acknowledgement is given.

If you are the author of this article, you do not need to request permission to reproduce figures and diagrams provided correct acknowledgement is given. If you want to reproduce the whole article in a third-party publication (excluding your thesis/dissertation for which permission is not required) please go to the Copyright Clearance Center request page.

Analyst

Avoiding overfitting in the analysis of high-dimensional data with artificial neural networks (ANNs)

Abstract

Article information

Download Citation

Permissions

Avoiding overfitting in the analysis of high-dimensional data with artificial neural networks (ANNs)

Social activity

Search articles by author

Spotlight

Advertisements