Representative subset selection and outlier detection via isolation forest†
Abstract
In order to a build robust and predictive model, all outliers should be eliminated and representative samples should be selected. In this study, Isolation forest Outlier detection and Subset selection (IOS) has been proposed, which can detect outliers and select representative subsets simultaneously. IOS is different from the classical subset selection method, which is cluster-based and has a uniform design. A comparative study among the IOS, Kennard–Stone (KS), sample set partitioning based on joint x–y distances (SPXY) and random sampling (RS) methods was conducted. The performances of these algorithms were benchmarked with four datasets, including two normal NIR datasets, which are free of outliers: soil and diesel fuel, and two datasets with outliers: milk NIR dataset and solubility QSAR dataset (LogS). Results show that IOS can detect outliers and select representative subsets of samples simultaneously, which reduces prediction errors significantly compared with the KS, SPXY and RS methods. IOS can eliminate outliers and select representative samples without y values. Hence, the proposed method may be an advantageous alternative to the other three strategies. IOS is implemented in MATLAB language and is available at https://github.com/zmzhang/IOS.