Jintae
Han
,
Hoeil
Chung
*,
Sung-Hwan
Han
and
Moon-Young
Yoon
Department of Chemistry, College of Natural Sciences, Hanyang University, Haengdang-Dong, Seongdong-Gu, Seoul, Korea 133-791. E-mail: hoeil@hanyang.ac.kr; Fax: 82-2-2299-0762; Tel: 82-2-2220-0937
First published on 17th October 2006
A new discrimination method called the score–moment combined linear discrimination analysis (SMC-LDA) has been developed and its performance has been evaluated using three practical spectroscopic datasets. The key concept of SMC-LDA was to use not only the score from principal component analysis (PCA), but also the moment of the spectrum, as inputs for LDA to improve discrimination. Along with conventional score, moment is used in spectroscopic fields as an effective alternative for spectral feature representation. Three different approaches were considered. Initially, the score generated from PCA was projected onto a two-dimensional feature space by maximizing Fisher's criterion function (conventional PCA-LDA). Next, the same procedure was performed using only moment. Finally, both score and moment were utilized simultaneously for LDA. To evaluate discrimination performances, three different spectroscopic datasets were employed: (1) infrared (IR) spectra of normal and malignant stomach tissue, (2) near-infrared (NIR) spectra of diesel and light gas oil (LGO) and (3) Raman spectra of Chinese and Korean ginseng. For each case, the best discrimination results were achieved when both score and moment were used for LDA (SMC-LDA). Since the spectral representation character of moment was different from that of score, inclusion of both score and moment for LDA provided more diversified and descriptive information.
Even though PCA-LDA has been widely used, improved discrimination is always needed, especially for differentiation of minute spectral features. With this in mind, a new algorithm, called score–moment combined linear discrimination analysis (SMC-LDA), was developed. This method utilizes the moment of spectrum as an alternative descriptor of spectral features along with PCA score for LDA. Moment is frequently adopted to describe the texture of an image in image processing fields, such as image retrieval systems.7–11 It effectively describes the outline and overall shape of the object using a few scalar values (moments). Also, it is known to yield a relatively stable description under situations involving a minor shift or rotation of the image. The concept of moment has been extended to vibrational spectroscopic fields in this study and utilized as an alternative and effective descriptor of spectral features.
To demonstrate the discrimination ability of the proposed method, three spectroscopic datasets were used: (1) infrared (IR) spectra of normal and malignant stomach tissue, (2) near-infrared (NIR) spectra of diesel and light gas oil (LGO) and (3) Raman spectra of Chinese and Korean ginseng. First, moments and principal component (PC) scores of spectra were used separately for LDA. Then, both moments and scores were utilized for LDA (SMC-LDA). Discrimination performance was improved for all three datasets when using SMC-LDA.
Moment is used to describe image texture in the fields of image retrieval and image processing. In this study, moment is proposed as an alternative method of spectral representation that could be comparable to score from PCA. Detailed mathematical descriptions of moment follow. Initially, raw spectra (or preprocessed data such as baseline corrected), x(λ), are Fourier transformed to achieve X(jω) by use of eqn (1):
![]() | (1) |
![]() | (2) |
![]() | (3) |
The inverse Fourier transform of eqn (2) is then performed to yield eqn (4):
![]() | (4) |
Raw spectra x(λ) can now be expressed as moments (m1, m2, … , mn) that represent the spectral features as a few scalar values. The meanings of individual moments are difficult to translate for spectroscopic data. However, in the image processing field, it is known that the first, second, third and fourth moments represent average, standard deviation, skew and slope of a shape, respectively. Higher order moments are harder to conceptualize. Nevertheless, moment can be used as an effective “spectral feature descriptor” in spectroscopic fields.
A noticeable feature of moment is that it is calculated in the Fourier-transformed domain where the original spectrum is decomposed into individual sinusoidal components (angular frequency in eqn (2)). In this domain, minor differences in sinusoidal components derived from the original spectrum can be observed. This is the most distinguishable portion compared to the PCA score.
LDA is used to maximize the discrimination between classes. In the case of using moment, each moment is projected onto two-dimensional (2D) feature space by maximizing Fisher's criterion function. The Fisher's criterion function J(d) is given by:
![]() | (5) |
B and W are distances between classes and within class, respectively (dT: transpose of d). B and W are defined as:
![]() | (6) |
![]() | (7) |
D = [d1d2] | (8) |
Moment is projected onto the feature space (formed by d1 and d2) and then the projected value y on the feature space is described as:
yij = DTmij | (9) |
Consequently, two dimensional plot of projected value y is generated viaeqn (9) and this projection plot is utilized for discrimination of classes throughout this study.
The same procedure [eqn (5)–(9)] was performed for score as well as score–moment combinations. From this point, three abbreviations will be used: S-LDA, M-LDA and SMC-LDA. S-LDA, that is PCA-LDA, uses score for LDA and M-LDA utilizes only moment for LDA. SMC-LDA corresponds to the use of both score and moment for LDA.
Forty-seven diesel and light gas oil (LGO) samples were acquired from a refinery in Korea.12 NIR spectra were collected over the 1100 to 1700 nm region using a dispersive NIR spectrometer (Foss NIRSystems, Silver Spring, MD) equipped with a tungsten halogen lamp, PbS detector, and a fiber optic interactance/reflectance probe. The resolution of the collected spectra was 10 nm with 2 nm intervals between data. NIR spectra were collected by positioning the fiber optic probe into a sealed bottle that contained the sample. Each NIR sample spectrum consisted of 16 co-added scans.
Forty-two Korean and Chinese ginseng samples were acquired from the National Agriculture Production Inspection Office (NAPIO) in Seoul, Korea.13 All samples were powdered using a cyclone mill fitted with a 1 mm screen. The particle size of the powder was maintained below 20 mesh. FT-Raman spectra were collected on a Bruker (Karlsruhe, Germany) IFS55 spectrometer equipped with a FRA 106 Raman module, CaF2 beam splitter and a cryogenically cooled Ge detector. The excitation source was a diode-pumped Nd:YAG laser operated at 1064 nm and a power of 300 mW. Each spectrum was collected with a resolution of a 4 cm−1 in a 180° back-scattering arrangement. Descriptions of these three spectroscopic datasets are presented in Table 1.
Spectroscopy used | Sample | Number of spectra |
---|---|---|
IR | Stomach (normal) | 87 |
Stomach (malignant) | 134 | |
NIR | Diesel | 47 |
Light gas oil (LGO) | 47 | |
Raman | Chinese ginseng | 42 |
Korean ginseng | 42 |
All algorithms were developed using Matlab Version 6.5 (The MathWorks Inc., MA, USA). For PCA, the commercial Matlab program package was used. All the pre-processing methods such as baseline correction or normalization were also accomplished using Matlab.
![]() | ||
Fig. 1 Three different types of vibrational spectra used in this study: IR spectra of normal and malignant stomach tissues (a), NIR spectra of LGO (light gas oil) and diesel (b), and Raman spectra of Korean and Chinese ginseng samples (c). |
Fig. 1(b) represents selected NIR spectra of LGO and diesel. Diesel is a blended product of three to five components. LGO is a major component constituting around 80% of diesel. Therefore, NIR spectral features of LGO and diesel are similar. The bands centered at 1210 nm and 1400 nm corresponded to the second overtone and combination bands of CH vibration. As shown in Fig. 1(b), only small spectral differences were observed at 1210 nm.
Fig. 1(c) shows selected Raman spectra of Korean and Chinese ginseng samples. Even with the use of a low-energy near-infrared laser, the baselines of the spectra varied slightly due to weak fluorescence. Spectral differences between Korean and Chinese samples were observed at 940, 870 and 500 cm−1. Also, the peaks in the 1500–1000 cm−1 range differed slightly from each other. Before performing the spectral processing, the weak fluorescence background that was superimposed on the Raman spectra was effectively removed using Wavelet Transformation (WT).14–16 A detailed investigation of the chemical and botanical origins of the differences is beyond the scope of this paper. Overall, the spectral features of each case are similar to each other.
![]() | ||
Fig. 2 Two-dimensional sample plots constructed by two moments (moments 3 and 4) as well as two principal component (PC) scores (scores 1 and 2) for gasoline samples. Three samples were selected and the corresponding spectra are shown on the right plots. |
Fig. 3 shows two-dimensional (2D) sample plots constructed by two scores and two moments for three cases of IR, NIR and Raman data. By using these plots, the difference in spectral description between score and moment was compared. The scores and moments were arbitrarily selected to show the different characteristics of spectral representation between score and moment in two dimensional space. For each 2D plot for score, 5 samples that showed large differences were chosen. The same samples were used in the 2D plot for moment. In the moment 2D plot for IR stomach data (open circles: normal tissue, filled circles: malignant tissue), the selected 5 samples were well-described and widely spread. The moment distribution of these samples was different from that of score, as seen in Fig. 2. Similar results were observed for the NIR (open circles: diesel, filled circles: LGO) and Raman data (open circles: Korean ginseng, filled circles: Chinese ginseng). Overall, results support the moment as an alternative, effective and comparable spectral description method along with score from PCA. It is similar to photographing an object at a different angle by a different photographer. Therefore, when score and moment are combined for discrimination, more descriptive and diverse information would be available. This will eventually lead to improvements in discrimination performance.
![]() | ||
Fig. 3 Two-dimensional sample plots constructed by two scores and two moments for the three cases of IR, NIR and Raman data. Open and filled circles in IR stomach data correspond to normal tissue and malignant tissue, respectively. Open and filled circles in NIR data represent diesel and LGO, respectively. Open and filled circles in Raman data represent Korean and Chinese ginseng, respectively. |
The cross validation method was essentially used to minimize possible over-fitting especially in the course of LDA. For this purpose, each data set was divided into 5 segments (80% in calibration and 20% in prediction) and cross-validated. Then, the corresponding discrimination accuracies from the calibration and prediction sets were evaluated.
Initially, the optimal number of scores for S-LDA and moments for M-LDA needed to be determined. To determine the optimal number of scores, the first ten scores were generated and continuously added one by one from the first score. By evaluating the pattern of increasing discrimination accuracy as a function of the number of scores, the optimal number of scores was determined when the best discrimination was achieved. The same procedure was used to determine the optimal number of moments. Fig. 4 shows discrimination accuracy plotted as functions of the number of factors (a) and moment (b) used to discriminate between normal and malignant tissues (IR data). Additionally, total percent variance (TPV) is also displayed in the plot (a). TPV is an indicator of how much variation is accounted for by factors (principal components, PCs). Factors represent the variation in the spectral data set, while eigenvalues are the relative weights of each individual PC. By summing eigenvalues and representing as a percentage, it can be estimated how much variance is described by the PCs. For S-LDA, four factors were chosen as an optimum since no significant improvement in discrimination accuracy as well as TPV after four factors. For M-LDA, the similar trend of improving discrimination accuracy was observed by adding additional moments and the use of six moments provided the optimal discrimination accuracy without over-fitting. The same procedures were accomplished to determine the numbers of scores and moment for other two data sets (NIR and Raman data sets).
![]() | ||
Fig. 4 Discrimination accuracy plotted as functions of the number of factors (a) and moment (b) used to discriminate between normal and malignant tissues (IR data). Additionally, total percent variance (TPV) is also displayed in the plot (a). |
Even though the spectral description characters between score and moment are different, possibly there is an overlap in feature description or some interactions between them. Therefore, the different approach has been accomplished to determine the optimal number of scores and moments for SMC-LDA. Initially, the number of scores is fixed as determined in Fig. 4 and then individual moment is added as combinatorial fashion. For example, 15 combinations of moment (6C2) are possible when 2 moments are used out of total 6 moments. Twenty combinations of moment (6C3) are possible when 3 moments are used out of total 6 moments. The maximum number of moments used corresponded to the optimal number of moments that was determined in M-LDA for each case. At each combination of moments with fixed number of scores, the corresponding discrimination accuracy was evaluated and the best combination was determined.
Table 2 shows the overall results of cross-validated discrimination accuracy for S-LDA, M-LDA and SMC-LDA. The numbers in parentheses from S-LDA and M-LDA results correspond to the optimal number of scores and moments, respectively. The numbers in parentheses from SMC-LDA results correspond to the optimal number of moments with the use of fixed number of scores from S-LDA. For IR data, the same number of moments as used in M-LDA was used for SMC-LDA. In the case of SMC-LDA for NIR data, the use of only one moment (1st moment) fairly improved the discrimination accuracy. In the case of SMC-LDA for Raman data, the use of three additional moments (1st, 6th and 7th moments) along with three scores led to 100% discrimination accuracy. Although it is difficult to justify how these moments selectively help to improve the discrimination, it is clear, however, based on the result, that the use additional moments provides complementary spectral information to score from PCA.
IR data (1800–900 cm−1) | NIR data (1100–1600 nm) | Raman data (1800–200 cm−1) | ||
---|---|---|---|---|
S-LDA | Calibration | 96.8 (4) | 94.4 (3) | 99.7 (3) |
Prediction | 96.8 | 92.8 | 97.6 | |
M-LDA | Calibration | 96.4 (6) | 94.4 (6) | 98.8 (7) |
Prediction | 96.3 | 93.7 | 98.9 | |
SMC-LDA | Calibration | 98.4 (6) | 97.3 (1) | 100.0 (3) |
Prediction | 97.3 | 96.9 | 100.0 |
The discrimination accuracies from S-LDA and M-LDA are similar each other, or slightly improved when M-LDA is used based on the results from both NIR and Raman data. Therefore, moment can be a successful spectral representation tool compatible with conventional score. When scores and moments were combined for LDA (SMC-LDA), the best discrimination accuracies were achieved for all three data sets. As shown in Fig. 2 and 3, spectral description characteristics of score and moment differed from each other. The discrimination accuracy was improved by the input of complementary information into the LDA. The corresponding 2D projection plots generated by using the eqn (9) for IR (discrimination between malignant and normal stomach tissues), NIR (discrimination between LGO and diesel) and Raman (discrimination between Chinese and Korean ginseng) data are shown in Fig. 5, 6 and 7, respectively.
![]() | ||
Fig. 5 Two-dimensional projection plots using S-LDA, M-LDA and SMC-LDA for the discrimination between normal and malignant stomach tissue using IR spectroscopy. Open and filled circles correspond to normal and malignant tissues, respectively. |
![]() | ||
Fig. 6 Two-dimensional projection plots using S-LDA, M-LDA and SMC-LDA for the discrimination between LGO and diesel using NIR spectroscopy. Open and filled circles correspond to diesel and LGO, respectively. |
![]() | ||
Fig. 7 Two-dimensional projection plots using S-LDA, M-LDA and SMC-LDA for the discrimination between Chinese and Korean Ginseng using Raman spectroscopy. Open and filled circles correspond to Chinese and Korean ginseng, respectively. |
This journal is © The Royal Society of Chemistry 2007 |