Sung-June
Baek
a,
Aaron
Park
*a,
Young-Jin
Ahn
a and
Jaebum
Choo
*b
aChonnam National University, Gwangju 500-757, South Korea. E-mail: tozero@jnu.ac.kr; Tel: +82-62-530-1795
bHanyang Univeristy, Ansan 426-791, South Korea. E-mail: jbchoo@hanyang.ac.kr
First published on 29th September 2014
Baseline correction methods based on penalized least squares are successfully applied to various spectral analyses. The methods change the weights iteratively by estimating a baseline. If a signal is below a previously fitted baseline, large weight is given. On the other hand, no weight or small weight is given when a signal is above a fitted baseline as it could be assumed to be a part of the peak. As noise is distributed above the baseline as well as below the baseline, however, it is desirable to give the same or similar weights in either case. For the purpose, we propose a new weighting scheme based on the generalized logistic function. The proposed method estimates the noise level iteratively and adjusts the weights correspondingly. According to the experimental results with simulated spectra and measured Raman spectra, the proposed method outperforms the existing methods for baseline correction and peak height estimation.
The diverse sources of background and additive noise make it hard to correct baseline for experimental spectral data. Furthermore as a baseline is usually varying from sample to sample, the situation is much worse. Wavelet transform was introduced to eliminate the varying background.5–7 As the method relies on the filtering capabilities of wavelet transform, a baseline should be well-separated in the transform domain. But real world signals often collide with this hypothesis. Moreover, it is rather complex to implement due to wavelet transform or related optimization.
A method without special assumption was proposed for baseline curve fitting.8 It is based on the smoothing and interpolation technique. While it is simple to implement and give some satisfactory results for various kinds of Raman spectra, it produces poor results in case a spectrum consists of peaks with various widths because the method uses a fixed smoothing span to interpolate the background curve. It could be overcome by adjusting the smoothing span adaptively, but there is no reliable method available currently.
By using a user defined subset of data which only belongs to background, a least squares polynomial fitting method was proposed without incorporating any constraints.9 However, selecting the right data is not always easy and could be burdensome because one should handle every spectrum individually. To alleviate the burden, a method minimizing a non-quadratic cost function was proposed.10 It relies on the truncated quadratic cost function's capability to reduce the effect of the high peak of the analyte. The method effectively reduces the influence of the high peak and produces satisfactory results. However it is not easy to properly set the threshold of a truncated quadratic function which is closely related to the performance. Also the method relies on an iterative algorithm to solve a non-quadratic minimization problem, which does not guarantee the global minimum.
Polynomial fitting methods were also proposed.11,12 The methods fit a baseline with a polynomial by cutting out signal peaks iteratively or by linear constraints. Although the methods adjust the threshold to cut the peaks automatically or estimate a baseline by optimization with linear programming, they rely on the smoothness of a polynomial of fixed order. Thus if the order of a polynomial is not set properly, the results are not guaranteed. This means that a user inspects every spectrum, which restricts automatic baseline correction.
Among commercial spectrum analysis tools, OPUS and OriginPro are the most widely used packages. They estimate the baseline by setting the baseline points manually or automatically and interpolating them with straight line or polynomial. For automatic baseline correction using OPUS, the spectrum is divided into n ranges of equal size. The number of ranges is predefined by the user. The minimum intensity of each range is determined first. Then connecting the minima with straight lines creates the baseline. Starting from below, a rubber band is stretched over this curve. The rubber band is the baseline. The baseline points that do not lie on the rubber band are discarded.13 It creates the smoothed baseline not exceeding the preset baseline points. However it suffers from a very loose baseline if the number of ranges is not set properly. Also it creates boosted baselines especially when there is relatively high random noise as the method relies only on the minimum intensity in the given range.
The methods based on penalized least squares were proposed to avoid the peak detection and other user interventions.14,15 The methods combine least squares smoothing together with a penalty on non-smooth behavior of an estimated baseline.16 To prevent an estimated baseline from following peaks, a weighting function is incorporated together with a penalty. According to the experimental results, they gave satisfactory results without user intervention.
The methods change weights iteratively by estimating a baseline. If a signal is below a previously fitted baseline, large weight is given. On the other hand, no weight or small weight is given when a signal is above a fitted baseline. However, it is desirable to give equal or similar weight to either case as additive noise is equally distributed along a baseline. To this end, a new weighting scheme based on the generalized logistic function is proposed in this paper.
In the following section, we give a brief review of the previous penalized least squares methods. Then we introduce a new weighting scheme and discuss some aspects of the proposed method. The experiments with simulated spectra are given to show the effectiveness of the proposed method, which is followed by experimental results with real Raman spectra.
Among the various smoothing methods, a regularized least squares smoothing method is popularly used. Let y be the signal of length N, assumed to be sampled at equal intervals. Let z be the smoothed signal to be found. The smoothed signal should follow the trend of y while keeping its smoothness. Assuming y and z are column vectors, z can be found by minimizing the following regularized least squares function.
S(z) = (y − z)T (y − z) + λzTDTDz, | (1) |
(2) |
The first term in eqn (1) expresses the fitness to the data while the second term expresses the smoothness of z. The parameter λ adjusts the balance between the two terms. In order to correct a baseline using the above smoothing method, a weight vector w is introduced. Let W be the diagonal matrix with w on its diagonal. Eqn (1) changes to the following penalized least squares function.
S(z) = (y − z)TW(y − z) + λzTDTDz. | (3) |
By finding the vector of partial derivatives and setting it to zero, i.e., ∂S/∂zT = 0, the solution of minimization problems of eqn (3) is given as follows.
(4) |
z = (W + λDTD)−1Wy. | (5) |
If peak regions are known beforehand, wi can be set to zero in those regions and set to one outside of the regions. But the existence of a baseline and noise makes it difficult to find peak regions. Eilers and Boelens proposed an AsLS (Asymmetric Least Squares) method which does not require peak finding.14,16 In the method, a new parameter p is introduced to set weights asymmetrically. The method assigns weights as follows.
(6) |
The asymmetry parameter p is recommended to set between 0.001 and 0.1. Given λ and p, a smoothed baseline is updated iteratively. Let the first solution of eqn (5) be given as z with w initialized to have ones. Get a new w according to eqn (6). Then solve eqn (5) again to get an updated baseline z. The iteration continues until the weight vector does not change anymore or it reaches the predefined number, e.g., 5 or 10.
According to Zhang et al., the method has some drawbacks. Two parameters, λ and p, need to be optimized to get a satisfactory result. More importantly asymmetry parameters in eqn (6) are all the same in the pure baseline region. But the weights in the pure baseline region are to be set according to the differences between the previously fitted baseline and the original signals. In this respect, the airPLS (adaptive iteratively reweighted Penalized Least Squares) method was proposed.15
The adaptive iteratively reweighted procedure is similar to the AsLS method, but uses a different way to assign weights and add a penalty to control the smoothness of a fitted baseline. In the method, the weight vector w is obtained adaptively using an iterative method. The w of each iteration step t is obtained with the following expression.
(7) |
The fitted vector z in the previous (t − 1) iteration is a candidate of the baseline. If a signal yi is greater than the candidate of the baseline, i.e., zi, it can be regarded as a part of the peak. So its weight is set to zero. Otherwise the weight is adjusted according to eqn (7). The iteration stops either with the maximum iteration count or when the following termination condition is satisfied.
|d| < 0.001 × |y|. | (8) |
We adopt a partially balanced weighting scheme to solve this issue. In the baseline region without peaks, noise could be assumed to be equally populated below and above a baseline. Thus we assign similar weights to the signals in that region not to underestimate the baseline. But if a signal is much greater than the baseline, weight is set to zero as it is a part of the peak. To meet these requirements, we choose the following partially balanced but asymmetric weights.
(9) |
(10) |
Given m and σ, the logistic function is depicted in Fig. 1. Considering that it is practically 1 when d < 0, i.e., yi < zi as you see in the figure, only one logistic function in eqn (10) is enough instead of two terms in eqn (9). We express the weights in that way only to emphasize its asymmetric properties.
The logistic function gives nearly the same weight to the signal below or above a baseline when the difference between the signal and the baseline is smaller than the estimated noise mean. It gradually reduces the weight as the level of the signal increases. If a signal is in the 3σ from the estimated noise mean which covers 99.7% of noise on Gaussian assumption, small weight is still given. Finally, zero weight is given when a signal is much higher than the baseline as it can be regarded as a part of the peak. In the extreme case the standard deviation is nearly zero, it becomes a shifted and reversed unit step function which smoothes and estimates a baseline while leaving the peak larger than noise mean untouched.
Modifications of eqn (10) would be possible. As the essence of the proposed method is to give a proper weight to a signal above a baseline as well as a signal below the baseline in the pure baseline region, one could push the curve of the logistic function to the left or to the right direction so long as it gives a meaningful weight to a signal above the baseline. Also squeezing the transient region would be possible. For example, one can narrow the region arbitrarily to get the result of the extreme case.
The smoothed baseline can be obtained by using the same iterative procedure as AsLS and airPLS methods. Assume that the first baseline z is computed with w initialized to have ones. Get a new w according to eqn (9). Then solve eqn (5) again to get an updated baseline z. The iteration continues until weights do not change anymore or weight changes are minimal.
Let y be the measured spectrum expressed as a column vector with N elements. Given the smoothness parameter λ, the proposed arPLS (asymmetrically reweighted penalized least squares) method can be summarized as an algorithm.
Implementation of Matlab is simple, as the following code shows. To implement the arPLS method in other programming languages, one can refer the books about linear equations with the symmetric pentadiagonal matrix.17,18 As the matrix, W + H, is sparse and symmetric band diagonal, an efficient algorithm can be easily implemented to solve eqn (5).
(11) |
Noise, n, was modeled using a uniform random number generator and a third order polynomial function was used to simulate a curved baseline in a concave and convex region. Narrow Gaussian peaks were treated as the spectra of interest. The simulated spectra were generated by adding a pure signal, a baseline, and random noise.
Two simulated data with a curved baseline are shown in Fig. 3. The SNR (Signal to Noise Ratio) of the low noise spectrum was set to 17.7 dB and that of the high noise spectrum was set to 31.7 dB. The SNR with respect to energy was measured without the baseline according to the following equation:
SNR = 10log10(Es/En) | (12) |
The maximum iteration number was set to 50 for all three methods. For early termination, the termination ratio was set to 10−6 for AsLS and arPLS while eqn (8) was used for airPLS.
The proposed method, arPLS, was compared with AsLS and airPLS methods. Before the experiments, the smoothness parameter, λ, was tuned to get a good estimation of the baseline. If λ is too large, a fitted baseline would not catch the curved baseline. On the other hand, a fitted baseline would follow peaks if λ is too small.
All three methods would show a little different performance according to various λ. So experiments with various λ were carried out to see the behaviour of the methods and find the optimum λ. As we know the exact spectrum is given as eqn (12) for the simulated spectra, we can compare the performance of three methods using RSME (root mean square error). Assuming that the baseline corrected spectrum is s with given λ, RSME(λ) is defined as
(13) |
In order to find the optimal value, λ is changed from 102 to 108 as λ is recommended to vary in the log scale.16 In Tables 1 and 2, we show the RSMEs of the baseline corrected spectra obtained from three methods.
log10λ | |||||||
---|---|---|---|---|---|---|---|
2 | 3 | 4 | 5 | 6 | 7 | 8 | |
AsLS | 23.4 | 8.63 | 3.77 | 6.25 | 15.4 | 21.1 | 22.1 |
airPLS | 30.4 | 26.3 | 5.30 | 2.92 | 5.21 | 17.8 | 22.4 |
arPLS | 39.5 | 23.6 | 1.93 | 1.22 | 1.19 | 2.98 | 6.01 |
log10λ | |||||||
---|---|---|---|---|---|---|---|
2 | 3 | 4 | 5 | 6 | 7 | 8 | |
AsLS | 23.9 | 12.3 | 11.2 | 12.6 | 22.4 | 27.9 | 29.0 |
airPLS | 31.7 | 24.6 | 10.5 | 10.9 | 11.6 | 24.2 | 29.7 |
arPLS | 44.5 | 39.65 | 23.1 | 6.10 | 5.74 | 5.86 | 7.24 |
The least RSMEs of each method in low noise are found at λ = 104, 105, and 106 while they are found at λ = 104, 104, and 106 in high noise. They are displayed in Fig. 4 for easy comparison. The RSME of arPLS is about half of the other methods, which means that the baselines are more accurately fitted by arPLS.
Let us see the baseline corrected spectra in detail. In Fig. 5 and 6, all the baseline corrected spectra by three methods are shown. As you see in the figures, the baselines are well-estimated and removed by arPLS. Especially in the non-signal region, the other two methods show some biases caused by the underestimated baseline.
There are also some biases in estimating the height of peaks. In the low noise spectrum, it is observed that airPLS underestimates the height of the second peak more than the others while AsLS and airPLS overestimate the height of the first and the second peaks more than arPLS in the high noise spectrum. That is the effect of additive noise.
In addition to them, there is one thing more to mention. As you see in the tables, the optimal λ is slightly different between the methods. As the parameter should be set manually for practical applications, it would be better if baseline correction performance is not too sensitive to λ.
In Fig. 7, the RSME of arPLS for various λ is displayed. While the lowest RSME is obtained when λ = 105 in the low noise spectrum, similar performance can be obtained when 104 ≤ λ ≤ 106.5. Even up to 107, the RSME of arPLS is comparable to the best case of the other methods. In the high noise spectrum, the arPLS method keeps the low RSME when 104.2 ≤ λ ≤ 108.2. This means that arPLS is relatively robust to the choice of λ, which is desirable for practical applications.
Another set of experiments was carried out with a simulated spectrum with a linear baseline. As linear baseline correction is rather simple, a spectrum with a strong baseline in high noise is only considered here. The simulated spectrum is shown in Fig. 8. Baseline corrected spectra obtained using AsLS, airPLS, and arPSL are given in Fig. 9. The processing results are very similar to those shown in Fig. 6. There are some bias in the non-signal region and the height of peaks is overestimated by AsLS and airPLS. The measured RSME of arPLS was 6.1 while those were 10.9, 10.5 for AsLS, airPLS, respectively.
These consistent results confirm that arPLS has the better capabilities in eliminating a baseline in the non-signal region and estimating the height of peaks. So we hope that the arPLS could be a promising alternative to the existing methods.
A single 26DNT spectrum was tested and is shown in Fig. 10 together with the baseline corrected spectra. All the three methods were used to obtain baseline corrected spectra. The figure shows that AsLS and airPLS methods underestimate the baseline especially in right half of the non-signal region, which is also observed with the simulated spectra. This might lead to overestimation of the height of peaks in the region. But we cannot confirm that as the exact heights of those peaks are not given for the experimental Raman spectrum.
The other two kinds of spectra were processed to show the capability of the proposed method. They are measured in highly fluorescent baselines. The 35DNT is chosen as an example of a spectrum with linear background in low noise while 2ADNT is chosen as an example of a spectrum with highly curved background in high noise. Two sets of 50 spectra are shown in Fig. 11 and 13. Even though they were measured by the same spectroscopy, they showed varying baselines according to the samples in issue.
The baseline corrected spectra obtained using the arPLS method are shown overlapped in Fig. 12 and 14. As you see in the figures, all the baseline corrected spectra from the same material looks quite similar, which is natural and desirable. So we could say that all the baselines of two sets are successfully removed by our method and then they can be analyzed easily.
Finally, it is worth noting that the smoothness parameter λ is set to 105 throughout the experiments with the experimental Raman spectra for the arPLS method. As the baseline corrected spectra are acceptable as you see in the figures with the value obtained from simulation data, we could convince that arPLS is robust to the variation of λ as mentioned previously.
We are currently investigating the method to adjust the smoothness parameter automatically. Except for it, the arPLS method requires no prior knowledge about the sample composition, no peak detection, and no mathematical assumption of background noise distribution. So it could be easily applied to various spectra. We hope that the proposed method would be a promising alternative to the existing baseline correction methods and widely used by many researchers.
This journal is © The Royal Society of Chemistry 2015 |