Cheng-Jian
Xu
a,
Yi-Zeng
Liang
*a,
Yang
Li
b and
Yi-Ping
Du
b
aCollege of Chemistry and Chemical Engineering, Institute of Chemometrics and Intelligent Analytical Instruments, Central South University, Changsha 410083, P.R. China. E-mail: yzliang@public.cs.hn.cn; Fax: 86-731-8825637; Tel: 86-731-8822841
bCollege of Chemistry and Chemical Engineering, Hunan University, Changsha 410082, P.R. China
First published on 10th December 2002
Some kinds of chemical data are not only univariate or multivariate observations of classical statistics, but also functions observed continuously. Such special characters of the data, if being handled efficiently, will certainly improve the predictive accuracy. In this paper, a novel method, named noise perturbation in functional principal component analysis (NPFPCA), was proposed to determine the chemical rank of two-way data. In NPFPCA, after noise addition to the measured data, the smooth eigenvectors can be obtained by functional principal component analysis (FPCA). The eigenvectors representing noise are sensitive to the perturbation, on the other hand, those representing chemical components are not. Therefore, by comparing the difference of eigenvectors obtained by FPCA with noise perturbation and by traditional principal component analysis (PCA), the chemical rank of the system can be achieved accurately. Several simulated and real chemical data sets were analyzed to demonstrate the efficiency of the proposed method.
The first important step for handling this kind of data is to determine the number of chemical components in the mixtures.2 An incorrect estimation of that will give misguidance for further qualitative and quantitative analysis. The number of components in the mixtures, usually called ‘chemical’ rank of the measured matrix,3 has a good one-to-one correspondence relation with the number of significant singular values of the data matrix X, provided that the spectra profiles of the components are linearly independent of each other.
It seems that chemical rank estimation is an easy task, however, practical experimental errors, such as uncorrected background, instrumental noise and low signal to noise ratio, make it very difficult in practice. Therefore, many kinds of statistical and empirical methods were proposed to resolve this problem.4–10
Most rank-estimation methods are based on principal component analysis (PCA). These methods can be roughly classified into two representative categories. The former is based on eigenvalue analysis, and the typical representation is Malinowski’s F-test or indicator function (IND).4 The latter is focused on frequency domain analysis5 in eigenvector analysis. The reason is that the frequency of the spectral profiles often appears low, whilst on the contrary, that of the noise usually appears high.
Meloun et al. gave a good critical review and comparison of many methods predicting the number of components in spectroscopic data.2 They came to the conclusion that in case of real experimental data, the RESO (the ratio of eigenvalues calculated smooth PCA and those calculated by ordinary PCA),6 IND and those indices methods based on knowledge of instrumental error should be preferred. To our knowledge, it will be wiser to use different methods according to different prior chemical information. The prior chemical information includes different kinds of data sets, instrumental noise level and so on. For instance, if the knowledge of instrumental error is known beforehand, the methods based on it should be preferred.2 When the responsive profiles are smooth, the frequency domain analysis, the morphological approach,7 or RESO could be chosen. If the responsive profiles have an evolving character11 the pure variable12 methods, such as Simplisma (simple-to-use iterative self-modeling mixture analysis),12 OPA (orthogonal projection approach)13 or other local rank estimation method14 are preferred. In our opinion, the performance of rank estimation methods has a strong relationship with the consistency of the model assumption and the property of real systems. Therefore, to develop different methods corresponding to different kinds of chemical data sets seems a promising job.
Chemical observation data, such as ultraviolet, fluorescence and some electrochemistry response etc., often appear as a continuous function of wavelength or potential. Such special characters of data, if being handled efficiently, will certainly improve the predictive accuracy. Functional data analysis, firstly proposed by Ramsay and Dalzell,15 then developed by Silverman and co-workers16,17 is a nonparametric method involving smooth in some way. In this paper, a novel rank estimation method, called noise perturbation in functional principal component analysis (NPFPCA), was proposed to determine the chemical rank of two-way data. In NPFPCA, after noise addition to the measured data, the eigenvectors can be obtained by functional principal component analysis (FPCA). Since the FPCA focuses more on systematic effects and less on random effects in the data, the eigenvectors representing noise are sensitive to the noise perturbation, whereas those representing chemical components are nearly unaffected by the noise perturbation. Then, by comparing the difference of eigenvectors obtained by FPCA and by traditional principal component analysis (PCA), the number of components in the concerned system can be estimated accurately.
The noise perturbation method proposed in this paper has some similar features with the data augmentation technique18 and smoothed bootstrap method.19 The former one uses noise addition to obtain sufficient relevant data to enable an accurate and robust calibration model, and the latter one uses a smooth estimate of the distribution instead of the empirical estimate.
Conclusions inferred from the theoretical study are used afterwards to clarify the performance of the proposed method on the simulated and real examples and to provide some general guidelines to understand better the potential of our method.
![]() | (1) |
λi = rTiXTXri (i = 1, 2,…, n) | (2) |
Here δi,j is the Kronecker delta, and ri is the ith eigenvector of covariance matrix XTX associated with eigenvalue λi. Let the eigenvectors of XTX be numbered according to the magnitude of the eigenvalues λ1
λ2
…
λn, then
i = Xri is the ith principal component.
Assuming the rank of matrix X is p with p ⩽ min(m,n), in exact multicollinearity situation, we get
λn − p + 1 = … = λn = 0 | (3) |
In chemical systems, the rank can be determined by observing the significant eigenvalues by PCA of the data matrix XTX. Here, only the variance is considered for rank estimation. Although the size of variance is important for rank estimation, other characters of chemical data are also useful, which will lead to a more satisfactory result. For example, chemical observations often have a special property, that is, they often appear as smooth and continuous function of wavelength or potential. As they are linear combinations of original variables, the principal components representing chemical spectra signals are also smooth. Such special characters of data, if being utilized efficiently, will certainly enhance the predictive accuracy of rank estimation. Chen et al. first introduced the FPCA into chemistry.6 The roughness penalty is the key idea of FPCA, which is to find a set of orthogonal normalized vector rfi (i = 1, 2, …, n) by maximizing the following criterion
![]() | (4) |
Here δi,j is the Kronecker delta, rfi is the ith smooth eigenvector of matrix XTX associated with smooth eigenvalue λfi , and α is positive smooth parameter regulating the trade-off between the fidelity of measured data and roughness. The smoothing parameter α can be chosen subjectively or by cross-validation.16
In fact, smooth eigenvector rfi and smooth eigenvalue λfi can be obtained by solving following generalized eigenvalue problem.
XTXrfi = λfi(I + αQ)rfi (i = 1, 2, …, n) | (5) |
BXTXrfi = λfirfi (i = 1, 2, …, n) | (6) |
The matrix BXTX is of size n × n but not symmetric. Rather than finding its eigenvectors directly it may be easier to proceed as follows, solving a symmetric matrix XBXT problem instead.
Let ci be an eigenvector of XBXT with eigenvalue μi. Then it is easy to show that
wi = BXTci | (7) |
In implementing the methods, we have used a singular value decomposition (SVD) of matrix XB1/2 to find eigenvector wi of matrix XBXT.
λ f i /λoi (α = 1) | λ f i /λoi (α = 1.1) | λ f i /λoi (α = 10) | C (α = 1) | C (α = 1 noise) | |
---|---|---|---|---|---|
1 | 1.0000 | 1.0000 | 0.9998 | 1.0000 | 1.0000 |
2 | 0.9988 | 0.9987 | 0.9889 | 1.0000 | 1.0000 |
3 | 0.8775 | 0.8760 | 0.8486 | 0.9362 | 0.9336 |
4 | 0.5745 | 0.5676 | 0.4599 | 0.2633 | 0.3036 |
5 | 0.5446 | 0.5412 | 0.4305 | 0.0489 | 0.0665 |
6 | 0.5430 | 0.5341 | 0.4157 | 0.3904 | 0.3538 |
7 | 0.5486 | 0.5402 | 0.4097 | 0.2327 | 0.1187 |
8 | 0.5224 | 0.5155 | 0.3841 | 0.2448 | 0.0568 |
![]() | ||
Fig. 1 (a) The chromatographic profile and (b) spectra of a three-component simulated system. |
Under this consideration, artificial random noise, instead of great changes in the smooth parameter α, are added to the original measured matrix. It should be noted that the added artificial noises are small in order not to influence the model of the original data. In FPCA, the degree of roughness of eigenvectors is also considered beside the variance. Therefore, the smooth chemical signals are extracted in FPCA before random noise in the data. The added artificial Gauss noise, which is of course not smooth, will certainly give little contribution to the smooth principal components representing spectra signal. On the other hand, the added artificial noise changes the structure of original noise; therefore, the smooth eigenvectors representing noise will change dramatically.
Therefore, by comparing the space spanned by the eigenvectors of PCA and those of FPCA, the chemical rank of the mixture can be obtained thereafter.
Ck = rTkrfk (k = 1, 2, …, n) | (8) |
Here Ck is called congruence coefficient of the kth eigenvectors. If Ck is nearly to 1, and Ck+1 ≪ 1, then the rank (X) = k.
The third congruence coefficient of the eigenvectors Ck is bigger than 0.9, but it is still not convincing to regard it as chemical component, as this may take place by accident. When different artificial noises are added in different experiments, if the third congruence coefficient remains rather stable, we can draw a safer conclusion that the third eigenvector represents the chemical component.
A Monte-Carlo simulation is used here to study the change of eigenvectors with different random added noises. The same level of noise, produced by different random seeds, is added to the same measured matrix. Then we can calculate the standard deviation of Ck as follows.
![]() | (9) |
From another point of view, the above noise addition procedure can be regarded as many parallel experiments aiming for a same job. In analytical experiments, the random noises of parallel experiments are also different. If the added noises do not influence the structure of raw data, a constant rank estimation result will certainly be obtained.
In this paper, the comparison of eigenvectors, instead of eigenvalue, was chosen here for rank estimation. The reason is as follows: firstly, for a system containing minor components, the eigenvectors can provide more information than eigenvalues.22 Secondly, in the case of smooth column and row space, the rank estimation can be achieved by calculating not only the eigenvectors representing column space but also those representing row space. Results obtained by these two procedures can provide interrelated and validated information together. Furthermore, the eigenvectors representing column space and row space are not necessary all smooth. Only one smooth kind of them is enough to use the NPFPCA methods for rank estimation.
As mentioned before, smooth parameter α can be chosen by cross-validation or subjectively. In fact, a subjective choice of smoothing parameter is satisfactory or even preferable in many problems.17 Therefore, smooth parameter 1 is selected in the all experiments. In our experience, too big or too small a parameter is often not suitable. We choose a parameter usually between 1 and 10.
The level of noise is also another important parameter. In this paper, a Gauss noise is selected as a perturbation noise. In general, the level of noise chosen is not too high to change the structure of data set. Too much noise submerges the minor component, and indeed changes to the structure of original data set will give us a misleading result. On the other hand, too small noise will not give us enough perturbation. In general, if the added noise is around instrumental error, the proposed method can give us a satisfactory result. In all experiments, one experiment did not have added noise. By observing the difference between the one experiment without added noise with others, one can discern how added noise influences the result of rank estimation, whether the added noise changes the structure of original data or the added noise is big enough to perturb.
To our knowledge, the performance of the NPFPCA method is usually not sensitive to the level of noise, which is attractive in practice. On the other hand, one can try different levels of noise to obtain a stable and satisfactory result.
![]() | ||
Fig. 2 The congruence coefficient of eigenvector (Ck) versus component number. |
![]() | ||
Fig. 3 The standard deviation of congruence coefficient of eigenvectors (Ck) versus increasing estimated component number in a simulated system. |
Use the SVD to find the eigenvalue λi (i = 1, 2, …, n, assuming m > n) of covariance matrix XTX and do the following calculations.
![]() | (10) |
![]() | (11) |
![]() | (12) |
![]() | (13) |
Here m and n are the number of rows and columns of measured matrix X respectively.
The effect of concentration, noise and spectra collinearity on rank estimation of five indices has been shown in Tables 2–5. An inspection of the results in Tables 2 to 5 reveals that the NPFPCA method performed well with great variation in concentration level, noise level, and degree of collinearity and pattern of noise. In both cases, there are fewer components than desired and than identified by IE, ER and VPVRS. The reason is that these indices are often not sensitive enough; when the condition number of matrix is larger, they often regard some minor components as noises. However, IND is sensitive enough to detect components in low concentrations and high levels of homoscedastic noises situation. But in heteroscedastic noise, IND works rather badly. It regards heteroscedastic noise as components.
Concentration | IE | IND | ER | VPVRS | NPFPCA |
---|---|---|---|---|---|
0.003 | 2 | 3 | 2 | 2 | 3 |
0.005 | 2 | 3 | 2 | 2 | 3 |
0.010 | 3 | 3 | 2 | 2 | 3 |
0.020 | 3 | 3 | 2 | 3 | 3 |
0.050 | 3 | 3 | 3 | 3 | 3 |
Noise level | IE | IND | ER | VPVRS | NPFPCA |
---|---|---|---|---|---|
0.0002 | 3 | 3 | 2 | 2 | 3 |
0.0005 | 3 | 3 | 2 | 2 | 3 |
0.0010 | 2 | 3 | 2 | 2 | 3 |
0.0015 | 2 | 3 | 2 | 2 | 3 |
0.0020 | 2 | 3 | 2 | 2 | 3 |
β | IE | IND | ER | VPVRS | NPFPCA |
---|---|---|---|---|---|
2 | 3 | 3 | 2 | 2 | 3 |
4 | 3 | 7 | 2 | 2 | 3 |
6 | 3 | ![]() |
2 | 2 | 3 |
8 | 3 | ![]() |
2 | 2 | 3 |
10 | 3 | ![]() |
2 | 2 | 3 |
Correlation coefficient | IE | IND | ER | VPVRS | NPFPCA |
---|---|---|---|---|---|
0.900 | 2 | 3 | 2 | 2 | 3 |
0.950 | 2 | 3 | 2 | 2 | 3 |
0.970 | 2 | 3 | 2 | 2 | 3 |
0.990 | 2 | 3 | 2 | 2 | 3 |
0.999 | 2 | 3 | 2 | 2 | 3 |
IND usually works well when the noise pattern is homoscedastic, which is consistent with the model assumption of IND method. The other three indices often can not acquire stable and satisfactory results.
![]() | ||
Fig. 4 (a) The congruence coefficient of eigenvector (Ck) versus component number for 10 computation. (b) The standard deviation of congruence coefficient of eigenvectors (Ck) versus increasing estimated component number in pesticide system. |
Another attractive characteristic of the proposed method is that it can also be used to deal with GC-MS data, In contrast, the RESO algorithm often can not tackle it.6 Since functional data analysis has an assumption that the data are continuous and smooth, it can not be applied to bar-like mass spectra data. However, the NPFPCA method can achieve a result from chromatographic space, rather than mass spectra space, because chromatographic space often appears smooth and continuous. The final results are displayed in Fig. 5. Moreover, the rank estimation results by five other indices are summarized in Table 6. The indices IND, IE, ER and VPVRS can not usually obtain satisfactory results in real systems.
![]() | ||
Fig. 5 (a) The congruence coefficient of eigenvector (Ck) versus component number in chromatographic space versus increasing estimated component number for 10 computations. (b) The standard deviation of congruence coefficient of eigenvectors (Ck) in chromatographic space versus increasing estimated component number in Chinese medicine system. |
Sample | IE | IND | ER | VPVRS | NPFPCA |
---|---|---|---|---|---|
Pesticide | ![]() |
7 | 5 | 5 | 5 |
Chinese medicine | 2 | ![]() |
1 | 1 | 3 |
It should be noted that it is usually safe to combine the congruence coefficient of eigenvectors and its standard deviation plot. If the congruence coefficient is near to 1, and its standard deviation is also near to zero, then this eigenvector represents chemical signal.
Footnote |
† Electronic supplementary information (ESI) available: Program and test data. See http://www.rsc.org/suppdata/an/b2/b205818a/ |
This journal is © The Royal Society of Chemistry 2003 |