Roman M.
Balabin
*a and
Ekaterina I.
Lomakina
b
aDepartment of Chemistry and Applied Biosciences, ETH Zurich, 8093, Zurich, Switzerland. E-mail: balabin@org.chem.ethz.ch; Tel: +41-44-632-4783
bFaculty of Computational Mathematics and Cybernetics, Lomonosov Moscow State University, 119992, Moscow, Russia
First published on 25th February 2011
In this study, we make a general comparison of the accuracy and robustness of five multivariate calibration models: partial least squares (PLS) regression or projection to latent structures, polynomial partial least squares (Poly-PLS) regression, artificial neural networks (ANNs), and two novel techniques based on support vector machines (SVMs) for multivariate data analysis: support vector regression (SVR) and least-squares support vector machines (LS-SVMs). The comparison is based on fourteen (14) different datasets: seven sets of gasoline data (density, benzene content, and fractional composition/boiling points), two sets of ethanol gasoline fuel data (density and ethanol content), one set of diesel fuel data (total sulfur content), three sets of petroleum (crude oil) macromolecules data (weight percentages of asphaltenes, resins, and paraffins), and one set of petroleum resins data (resins content). Vibrational (near-infrared, NIR) spectroscopic data are used to predict the properties and quality coefficients of gasoline, biofuel/biodiesel, diesel fuel, and other samples of interest. The four systems presented here range greatly in composition, properties, strength of intermolecular interactions (e.g., van der Waals forces, H-bonds), colloid structure, and phase behavior. Due to the high diversity of chemical systems studied, general conclusions about SVM regression methods can be made. We try to answer the following question: to what extent can SVM-based techniques replace ANN-based approaches in real-world (industrial/scientific) applications? The results show that both SVR and LS-SVM methods are comparable to ANNs in accuracy. Due to the much higher robustness of the former, the SVM-based approaches are recommended for practical (industrial) application. This has been shown to be especially true for complicated, highly nonlinear objects.
To control the quality of industrial products in an online regime, spectroscopic methods are often used.1–9Vibrational spectroscopy14–17 (mid-infrared (MIR), Raman, and near infrared (NIR)) is one of the best ways to obtain information about chemical structure and quality coefficients of different mixtures, even multicomponent mixtures. Alternative analytical methods include ultraviolet-visible (UV-Vis) absorption spectroscopy,18 nuclear magnetic resonance (NMR) spectroscopy,19 gas or high pressure liquid chromatography (GC/HPLC),20,21 and mass spectrometry.22 The latter is frequently combined with a soft ionization technique, such as matrix-assisted laser desorption/ionization (MALDI) or electrospray ionization (ESI).23
The relatively low cost of modern MIR/NIR/Raman spectrometers compared to mass spectrometers or NMR spectrometers makes vibrational spectroscopy the technique of choice for real-world applications. The possibility of remote quality control via fiber optics, which is easily achievable in the NIR spectrum, makes NIR spectroscopy one of the most promising analytical techniques for industrial applications.10,11,24,25
The combination of an information-rich analytical technique, such as NIR spectroscopy, with efficient regression tools, provided by modern mathematics, makes the creation of accurate and robust methods for prediction of object properties possible.26–28 The analysis of such sophisticated, multicomponent, and “dirty” samples as petroleum (whose composition, properties, and even structure29,30 can vary greatly over time or by oil source) is almost impossible without multivariate data analysis (MDA) techniques. The progress in chemometrics has a direct influence on the field of analytical chemistry.4–7 The modern petroleum industry is in need of accurate and reliable calibration methods.4–7 The same can be said about the modern and rapidly growing biofuel industry.31 Note that Geladi32 has provided a general overview of the subject, including a description of how chemometrics can be used for data analysis, classification, curve resolution, and multivariate calibration with spectroscopic data.
The partial least squares (PLS) or projection to latent structures regression method appeared many years ago and has become extremely popular.33 Together with its variants and modifications, the PLS calibration model is the most widely used regression technique for spectroscopic data analysis.33 The greatest problem in PLS methodology is that the spectrum–property relationship is assumed to be linear. This assumption is not always valid for industrial samples, and it is completely unacceptable for systems with strong intermolecular or intramolecular interactions, including π-stacking29,34,35 and hydrogen bonding.35–38 The shifts in positions of vibrational bands15–17,35–38 and non-fulfillment of the Beer–Lambert–Bouguer law35 lead to intrinsic nonlinearity of the spectrum–property relationship in these systems. Examples of such systems include crude oil, black oil, ethanol–gasoline fuel mixtures, and solutions of petroleum macromolecules.4–7,13,29,34 Even relatively weak van der Waals intermolecular forces35,39–41 in chemical systems like gasoline, biodiesel, paraffin wax, or aromatic hydrocarbons can influence the accuracy of the PLS model. Note that nonlinear relations can only be modeled by PLS in a limited way by considering more latent variables.26,27,42 Exactly the same can be said about the principal component regression (PCR) technique.4,26,27
It should be stated separately that the degree of nonlinearity can be rather different for different properties of the same chemical system. However, one can sometimes make general conclusions about object nonlinearity, or system nonlinearity,4 based on a number of system properties or rather general characteristic behavior.
One should note that a number of nonlinear PLS-based approaches exist, such as Poly-PLS43,44 and Spline-PLS.45 The only difference between these two algorithms and (linear) PLS is one step in which the linear function is changed into a polynomial function (for Poly-PLS) or a piecewise polynomial function called a spline function (for Spline-PLS). These two techniques are referred to as “quasi-nonlinear” calibration methods.4,43–45
Although partial least squares regression has been a cornerstone of MDA of chemical data for many years, it is neither perfect nor complete.4,6,19,20,26–28,33 Since the assumption about the linearity of the input–output dependence is a rough approximation for most chemical systems, usually only valid within a small interval of input/output values, alternative regression tools are needed.4,6,43–45
Modern applied mathematics offers a wide variety of nonlinear methods, and artificial neural networks, or ANNs, are among the most effective and popular methods.46 Based on Kolmogorov's theorem,46–48 one can claim that the standard multilayer feed-forward neural network with a single hidden layer that contains a finite number of neurons (see Fig. 1 in ref. 28) can be regarded as a universal approximator; that is, ANN can approximate any linear or nonlinear dependence between the input and output values with an appropriate choice of free parameters or weights.28,46 This background makes ANN one of the most pervasive nonlinear data analysis techniques in almost all fields of chemistry, from quantitative structure–property relationship studies (QSPR/QSAR)49 to quantum chemistry (QC)28,50,51 to petroleum studies.4–7
The disadvantages of the ANN approach to spectroscopic data analysis are1–9,46 as follows:
(i) the stochastic nature of the ANN training (model building) process;
(ii) the dependence of the final result on the initial parameters;
(iii) the need to repeat network training many (hundreds of) times;
(iv) the non-uniqueness of the final solution, or ANN weights, that produces the best result, given that many networks with completely different sets of free parameters can produce very similar results;
(v) the available sample set should be relatively large for effective ANN training;
(vi) the tendency to overfitting; and
(vii) the training time and computational resources: ANN training can take many hours, and even days, of CPU time even with modern computers (as of mid-2010).
Note that techniques such as clamping and analysis of weights can provide detailed insights into how an ANN functions.
Does any alternative to these ANN-based methods exist? Support vector machines (SVMs) might be regarded as the perfect candidate for spectral regression purposes.1–3,52SVM-based techniques are very interesting methods, simple in their theoretical background and very powerful in model and real-world applications. A large advantage of SVM-based techniques is their ability to model nonlinear relationships.24,52–54 Compared to neural networks, SVM has the advantage of leading to a global model that is capable of efficiently dealing with high dimensional input vectors.1–3 SVMs have the additional advantage of being able to handle ill-posed problems and lead to global models that are often unique.3 Furthermore, due to their specific formulation, sparse solutions can be found in many cases. However, finding the final SVM model can be very difficult computationally because it requires quadratic programming and the solution to a set of nonlinear equations.3
First used as a classification methodology,55SVM has been extended to regression tasks via two approaches: support vector regression (SVR)1 and least-squares support vector machines (LS-SVMs).3 Both will be discussed in our current study. See Section 3 for the basic theoretical concepts of the both methods. It should be noted that support vector machines, unlike PLS and ANN regression methods, are still relatively unknown to scientists in the field of chemometrics.1–3
A number of studies dealing with SVM-based approaches for solving chemically or industrially important problems have been published in recent years.1–3,8,9,56–63 Unfortunately, none of them are sufficiently general; only a few sets of spectra (at most) are usually used in each case. So, it is currently difficult to draw any definite conclusions about the efficiency of SVR or LS-SVM and the potential for the application of these approaches in spectroscopic data analyses. Different studies report different accuracies for SVM- and ANN-based approaches that cannot be compared because of differences in experimental or computational methodologies.1–3,8,9,56–63 The role of SVM-based regression in the area of chemometrics and multivariate data analysis is still unclear.
In the current study, we try to make a rather general comparison of SVM-based regression models, SVR and LS-SVM, with linear (PLS), “quasi-nonlinear” (Poly-PLS), and nonlinear (ANN) regression methods. Due to our previous experiences4–7,53,54 and the great importance of this particular field, petroleum systems were chosen as a representative example of real-world samples. Five very different chemical systems were studied, differing in complexity, composition, structure, and properties; these systems are gasoline, ethanol–gasoline biofuel, diesel fuel, aromatic solutions of petroleum macromolecules, and petroleum resins in benzene. Fourteen different sample sets (“NIR spectrum—sample property”, see below) were used in total. We try to rule out factors that influence SVR/LS-SVM behavior (relative to PLS, Poly-PLS, and ANN) when dealing with spectroscopic data. General conclusions are made about the applicability of SVM-based regression tools in the modern analytical chemistry of petroleum and its products.
| Petroleum system | Property | Unit | Number of samples | Property range | Reference method accuracye | Spectral rangef/cm−1 | ||
|---|---|---|---|---|---|---|---|---|
| Min. | Max. | Max. | Min. | |||||
a
Ref. 4.
b
Ref. 6.
c
Ref. 5.
d
Ref. 7.
e
Ref. 74.
f The range of [14 000; 8000] cm−1 refers to [714; 1250] nm.
|
||||||||
| Gasolinea | Density at 20 °C | kg m−3 | 95 | 640 | 800 | 0.5 | 14 000 |
8000 |
| Initial boiling point (IB) | °C | 95 | 35 | 59 | 1–5 | 14 000 |
8000 | |
| End boiling point 10% v/v (T10) | °C | 95 | 58 | 117 | 1–5 | 14 000 |
8000 | |
| End boiling point 50% v/v (T50) | °C | 95 | 93 | 128 | 1–5 | 14 000 |
8000 | |
| End boiling point 90% v/v (T90) | °C | 95 | 121 | 175 | 1–5 | 14 000 |
8000 | |
| Final boiling point (FB) | °C | 95 | 178 | 205 | 1–5 | 14 000 |
8000 | |
| Benzene contentb | % w/w | 57 | 0 | 10 | 0.10–0.25 | 13 500 |
8500 | |
| Biofuel: ethanol–gasolineb | Density at 20 °C | kg m−3 | 117 | 672 | 785 | 0.5 | 13 500 |
8500 |
| Ethanol contentb | % w/w | 75 | 0 | 15 | 0.05b | 13 500 |
8500 | |
| Diesel fuel | Total sulfur content | ppm | 125 | 303 | 5100 | 2–20 | 11 000 |
4000 |
| Petroleum macromoleculesc | Asphaltene content | % w/w | 120 (80) | 0 | 10 | 0.01c | 14 000 |
8000 |
| Resin content | % w/w | 120 (80) | 0 | 30 | 0.01c | 14 000 |
8000 | |
| Paraffin content | % w/w | 120 (80) | 0 | 10 | 0.01c | 14 000 |
8000 | |
| Petroleum resins in benzened | Resin content | mg L−1 | 105 (54) | 0 | 6000 | 1.1d | 13 000 |
9000 |
NIR spectra of diesel fuel were collected using a MPA Multi Purpose FT-NIR Analyzer (Bruker) at room temperature. The MPA NIR spectrometer was calibrated with benzene and cyclohexane (c-C6H12) at least twice per day to minimize the influence of variable laboratory conditions. The spectral range between 11
000 and 4000 cm−1 (909–2500 nm) was scanned with an 8 cm−1 resolution. Sixty-four scans were averaged for each spectrum. A background spectrum was measured every 45 min. A cylindrical glass cell with an 8 mm optical path was used throughout this study. Approximately 1 mL of diesel sample was required for each NIR measurement, much less than the 200 mL needed for distillation analysis to determine the fractional composition.64 The NIR spectrum collection was repeated five times with cell rotation inside the spectrometer between repetitions to minimize the interference from the cell or glass defects. Measurement of one sample took less than five minutes. The averaged and background-corrected spectra were used for subsequent data pre-processing.
See ref. 4–7 for experimental spectra examples and their discussion.
The mean average percentage error (MAPE) was also calculated to estimate the relative accuracy of each calibration model. This is especially important for properties with a large range, such as sulfur in diesel fuel. See ref. 4–6 for the exact formulas and extra discussion.
Five-fold or ten-fold cross-validation was used to optimize the model's parameters based on the root mean squared error of cross-validation (RMSECV). It was checked that the cross-validation set consisted of samples from the entire property range. Other variants of the cross-validation procedure, e.g., 7-fold version, leave-one-out cross-validation (LOOCV), were checked and found to produce almost identical results.
In all cases a negligible difference between RMSECV and RMSEP of PLS, Poly-PLS, and ANN methods was found as discussed in ref. 4–7. The use of either of them does not change the conclusions drawn here. This conclusion is not general—there are many cases, even among petroleum systems, where the RMSECV and RMSEP results can be quite different. For SVM-based methods prediction error was calculated.
Note that one needs to use the same dataset division for unbiased comparison with previously published results.4–7
There, of course, are some reservations about using crossvalidation methods for optimizing regression models based on support vector machines. It is arguable that SV-type models cannot be compared directly as PLS-type models. There are a number of reasons for this. First of all, some samples (not SVs) do not contribute to the models, so removing them will make no difference for the final prediction of, e.g., SVR. This is a complicated issue: removing too many samples may mean that there are different SVs, but removing a single non-SV sample usually means no change in the final model. Second, some parameters such as the error penalty term (C or γ) have a “quantized” effect on the model, that is a range of C values will result in an identical model. Neither of these issues are problems encountered when optimizing the PLS model.
See ref. 4–7 for a detailed discussion of the outlier detection scheme for each particular petroleum system. In general, all results are reported for outlier-free sample sets. Note that for traditional statistical methods (such as PLS), it is sometimes indeed important to perform outlier detection prior to modeling, as outliers can have a huge influence on least squares approaches. However, for SVR this is not always necessary, because its behavior with respect to outliers can be controlled by the error penalty term. So, SVR can actually handle datasets with extreme outliers whereas some other approaches will fall down. Here we do not discuss the robustness of the techniques with respect to outliers; that is why the errors are reported for outlier-free sample sets.
These parameters and the corresponding model are as follows:
PLS: number of latent variables (LV);
Poly-PLS: LV and degree of polynomial (n);
ANN/MLP: number of input neurons (IN; equal to number of principal components, PC), number of hidden neurons (HN), and transfer function of hidden layer: f(x) = {logsig}; {tansig/tanh}. Detailed procedures for ANN training can be found in ref. 4. See for example, Table 4 in ref. 4 for the ANN training procedure for gasoline data.
SVR: the error weight (C), maximal error value (ε), and kernel-related parameters. The same set of kernels (linear, polynomial, and radial basis function (RBF)) was used for SVR and LS-SVM model building. See Table 4 in ref. 24 for a detailed list of parameters. See Section 3 for the parameter definitions and other clarifications.
LS-SVM: the regularization parameter (γ), determining the trade-off between the fitting error minimization and the smoothness of the estimated function, and the kernel-related parameters (e.g., σ or σ2 for the RBF kernel, Table 2). See Section 3 for the parameter definitions and other clarifications.
| Petroleum system | Property | Unit | PLS | Poly-PLS | |||
|---|---|---|---|---|---|---|---|
| LV g | RMSEP | LV g | n e,g | RMSEP | |||
| a Ref. 4. b Ref. 6. c Ref. 5. d Ref. 7. e Also known as ‘D’ in Ref. 4. f The second number (in parentheses) refers to smaller sample set, see Table 1. g The optimal values were determined by the RMSECV minimization. | |||||||
| Gasolinea | Density at 20 °C | kg m−3 | 10 | 2.8 | 9 | 3 | 2.4 |
| Initial boiling point (IB) | °C | 10 | 2.0 | 15 | 5 | 1.6 | |
| End boiling point 10% v/v (T10) | °C | 9 | 2.2 | 9 | 4 | 1.8 | |
| End boiling point 50% v/v (T50) | °C | 12 | 2.4 | 14 | 3 | 1.9 | |
| End boiling point 90% v/v (T90) | °C | 18 | 2.8 | 14 | 5 | 2.2 | |
| Final boiling point (FB) | °C | 19 | 2.8 | 18 | 3 | 2.1 | |
| Benzene contentb | % w/w | 5 | 0.87 | 5 | 2 | 0.85 | |
| Biofuel: ethanol–gasolineb | Density at 20 °C | kg m−3 | 11 | 2.70 | 9 | 3 | 2.40 |
| Ethanol contentb | % w/w | 5 | 0.22 | 3 | 2 | 0.22 | |
| Diesel fuel | Total sulfur content | ppm | 6 | 344 | 6 | 3 | 341 |
| Petroleum macromoleculesc | Asphaltene content | % w/w | 5 | 0.41 (0.43)f | 5 | 2 | 0.25 |
| Resin content | % w/w | 3 | 0.79 (0.79)f | 5 | 2 | 0.71 | |
| Paraffin content | % w/w | 6 | 0.35 (0.39)f | 6 | 2 | 0.35 | |
| Petroleum resins in benzened | Resin content | mg L−1 | 3 | 2.1 (2.1)f | 2 | 2 | 2.1 |
RBF kernels (default) were found to produce the lowest prediction errors in all cases studied. But the SVM-based methods were found not to be very sensitive to kernel choice; in many cases, polynomial kernels were able to produce very close results to RBF ones (compare with ref. 1–3).
Note that Spline-PLS, being a very time consuming technique, has not shown any considerable superiority over the Poly-PLS method for petroleum system analysis.4 This is why it was not used in the current study.
So, the regression methods were optimized based on cross-validation procedure and tested using fully independent test (validation) sets (see also above).
The four systems presented here greatly range in composition, properties, and behavior. While low molecular weight substances having 6–12 carbon atoms with low intermolecular forces (n-hexane, heptane isomers, isooctane, etc.) form gasoline,13 heavy (above 500 Da) molecules with high tendency to aggregation and phase separation, like resins and asphaltenes, are found in the last two systems.71 The number of effective components ranges from one in petroleum resins to millions. Therefore, rather general conclusions about algorithm behavior can be made based on the system studied.
The fourteen properties of the four petroleum systems described above form fourteen sample sets that are very different in nature (Table 1). For gasoline, these are the density at 20 °C, fractional composition (including initial boiling point (IB), end boiling points 10%, 50%, and 90% v/v (T10, T50, and T90, respectively), and final boiling point (FB)) and finally benzene content. For ethanol–gasoline fuel, these sample sets are based on density at 20 °C and ethanol content—[EtOH]. For diesel fuel, the sample set is based on the total sulfur content ([Sulfur]). For petroleum macromolecules, the sets are asphaltene content ([A]), resins content ([R]), and paraffins content ([P]). Finally, for petroleum resins the relevant sample set is the resin concentration in benzene ([R]).
Note that the quality (accuracy, repeatability, and reproducibility) of reference data ranges greatly from one property to another (Table 1). It is important to estimate the effect of initial data quality on final prediction results. The same can be said about property ranges; some are rather limited (e.g., T50), some are very broad (e.g., [Sulfur] or [R]). In industrial applications it is usually impossible to model the quality (in either accuracy or range) of datasets. Therefore, the machine learning algorithms that show very good, even brilliant, results on model systems do not always show the same results when applied to real-world problems.46,52 In this work we have tried to use wide ranges of reference data quality to help make our conclusions as general as possible.
The spectroscopic information for most sample sets (Table 1) was recorded in the short-wave part of the NIR region (above 8000 cm−1). This is the region with the second to fifth overtones of characteristic molecular vibrations observed by standard IR and Raman techniques.14,26,51 The only exclusion is the diesel fuel sample set, whose spectrum lies in the 4000–11
000 cm−1 region. In this particular case, it was important to get information from the long-wave part of the NIR spectrum due to the necessity of predicting the sulfur concentration in diesel samples.14
The number of samples in the sample sets ranged from 57 to 125 (Table 1). Since the number of samples can influence the quality of the multivariate model prediction, we tried to ensure that sample set saturation was observed at least in the case of the simplest (PLS) method, similar to the basis set limit (BSL) or complete basis set (CBS) methods in quantum chemistry.50,72,73Table 2 shows some representative examples of varying the number of training examples.
The principles of SVM can easily be extended to regression tasks. For detailed in-depth theoretical background on SVMs for both classification and regression, the reader is referred to the ref. 1–3, 52 and 55. No equations will be used in the following text; see ref. 1–3 for all necessary equations and formalism.
Similar to the approach of ordinary least squares (OLS) and PLS, SVR also finds a linear relation between the regressors (input variables, X) and the dependent variables (y).1 The cost function (the function that is minimized to obtain the best regression model) consists of a two-norm penalty on the regression coefficients, an error term multiplied by the error weight, C, and a set of constraints. Using this cost function, the goal is to simultaneously minimize both the coefficients' size and the prediction errors (function smoothness and accuracy). The first point is important because large coefficients might hamper generalization due to their tendency to cause excessive variance.1
In SVR, the prediction errors are penalized linearly with the exception of a deviation of below a certain value, ε, according to Vapnik's ε-insensitive loss function. Only predictions deviating more than ε (|y − ypred| > ε, where ypred is the SVR model prediction) are taken into account. The objects with prediction errors larger than ε are called “support vectors” and only these vectors determine the final prediction of the SVR model. Due to the fact that only the inner product is used in all calculations, it is possible to use kernel functions, or kernels, that enable nonlinear regression in a very efficient way. The values of ε and the parameter C have to be defined by the user; both are problem- and data-dependent.1,55
The ideology of the LS-SVM method is very close to that of SVR, but in this case the more usual sum of the squares of the errors is minimized, and no ε-based selection is made between samples. This is a general feature of least-squares (LS) methods.3 This can make the final model more accurate and less computationally expensive; see ref. 3 for extra details. Parameter γ, the analog of parameter C in the SVR model, controls the smoothness of the fit.
So, if one forgets about kernel-specific parameters, the error weight (C) plus maximal error value (ε) and regularization parameter (γ) were optimized for SVR and LS-SVM methods, respectively.
As described above, SVM-based regression techniques solve many of the intrinsic ANN problems, such as its stochastic nature, the necessity to repeat network training many times, and the non-uniqueness of the final ANN solution. This makes SVR and LS-SVM interesting and promising alternatives to ANN. Note that the most important advantage, namely the possibility of building a nonlinear model, is still valid in the SVM regression case. Here we will try to understand the extent to which SVM-based techniques can substitute ANN-based approached techniques in real-world (industrial) applications. Are SVR and LS-SVM models accurate enough to really be regarded as alternatives to neural networks?
The structure of PLS-based models, namely the number of latent variables and the degree on polynomial, is inline with previous results for petroleum systems. The general trend is that the more complicated the quality (that is, the greater the nonlinearity), the greater the number of latent variables needed to extract all necessary information and to take into account the deviation from linear spectrum–property dependence (Table 2).
Note that in all cases, the Poly-PLS approach shows a RMSEP that is not worse than that of the linear PLS analog.4,6,45 In other words, for all petroleum systems under study, some kind of nonlinearity was observed and modeled with differing success by the Poly-PLS model.4 The only property for which the Poly-PLS approach was really effective was the asphaltene content, in which the RMSEP was decreased by almost 40%. Almost no effect was observed for benzene, ethanol, and sulfur contents, where the RMSEP was decreased by only 2 ± 1%. Therefore, Poly-PLS approach is not the best model for increasing the accuracy of the calibration model, even though some effect (∼10%) can be observed in a number of cases.43,44
| Petroleum system | Property | Unit | ANN (MLP)e | ||
|---|---|---|---|---|---|
| IN (PC)f,g | HN g | RMSEP | |||
| a Ref. 4. b Ref. 6. c Ref. 5. d Ref. 7. e ANN architecture is the following: IN − NH − 1; so, in the case of diesel fuel it will be “7 − 5 − 1”. f The (optimal) number of input neurons (IN) is equal to the (optimal) number of principal components (PC) used for principal component analysis (PCA) of near infrared spectra.4 Compare with LV in Table 2. g The optimal values were determined by the RMSECV minimization. | |||||
| Gasolinea | Density at 20 °C | kg m−3 | 10 | 7 | 2.0 |
| Initial boiling point (IB) | °C | 16 | 8 | 1.3 | |
| End boiling point 10% v/v (T10) | °C | 19 | 6 | 1.4 | |
| End boiling point 50% v/v (T50) | °C | 15 | 9 | 1.6 | |
| End boiling point 90% v/v (T90) | °C | 14 | 9 | 1.7 | |
| Final boiling point (FB) | °C | 18 | 7 | 1.7 | |
| Benzene contentb | % w/w | 12 | 5 | 0.58 | |
| Biofuel: ethanol–gasolineb | Density at 20 °C | kg m−3 | 9 | 7 | 1.90 |
| Ethanol contentb | % w/w | 8 | 5 | 0.13 | |
| Diesel fuel | Total sulfur content | ppm | 7 | 5 | 155 |
| Petroleum macromoleculesc | Asphaltene content | % w/w | 5 | 3 | 0.15 |
| Resin content | % w/w | 5 | 4 | 0.30 | |
| Paraffin content | % w/w | 5 | 3 | 0.13 | |
| Petroleum resins in benzened | Resin content | mg L−1 | 3 | 2 | 1.9 |
An average prediction error decrease relative to PLS of 41 ± 15% (±σ) was observed. The largest error decrease was observed for the asphaltene concentration (−63%), with resins and paraffin contents also showing large, and almost identical, decreases. This fact can be explained by the extremely high tendency of petroleum macromolecules to form dimers, oligomers, clusters, and aggregates.12,29,71 Even phase separation, or asphaltene onset, can easily be observed in many petroleum systems. This is the process that is responsible for many troubles in the petroleum industry, from crude oil production to refining and transportation.7,12,13,29 Since all of the described processes are concentration-dependent, a high degree of nonlinearity in spectrum–concentration dependence is expected. This leads to the need of nonlinear treatment of systems containing petroleum macromolecules (especially asphaltenes). ANN is the technique of choice in this case.
The absence of such a pronounced effect of ANN application for pure resins solution in benzene (−10% only) can be explained as follows. First, the system is simple and ANN is just not needed. Second, the PLS approach is itself highly accurate, close to the accuracy of the reference method (Table 1), and neither ANN nor other multivariate method can do better than the reference data allow (see below).26,27
In general, one can state that the ANN approach is extremely efficient for analysis of NIR spectra of petroleum systems, regardless of boiling range or composition. Very different properties and quality coefficients of industrially important products can be accurately predicted by neural networks.4,46
| Petroleum system | Property | Unit | SVR | LS-SVM | ||
|---|---|---|---|---|---|---|
| PCe | RMSEP | PCe | RMSEP | |||
| a Ref. 4. b Ref. 6. c Ref. 5. d Ref. 7. e The optimal values were determined by the RMSECV minimization. | ||||||
| Gasolinea | Density at 20 °C | kg m−3 | 5 | 2.0 | 6 | 2.0 |
| Initial boiling point (IB) | °C | 8 | 1.4 | 8 | 1.3 | |
| End boiling point 10% v/v (T10) | °C | 8 | 1.4 | 8 | 1.4 | |
| End boiling point 50% v/v (T50) | °C | 7 | 1.5 | 7 | 1.6 | |
| End boiling point 90% v/v (T90) | °C | 10 | 1.8 | 9 | 1.8 | |
| Final boiling point (FB) | °C | 10 | 1.8 | 10 | 1.7 | |
| Benzene contentb | % w/w | 5 | 0.53 | 6 | 0.58 | |
| Biofuel: ethanol–gasolineb | Density at 20 °C | kg m−3 | 7 | 1.91 | 6 | 1.92 |
| Ethanol contentb | % w/w | 5 | 0.14 | 6 | 0.16 | |
| Diesel fuel | Total sulfur content | ppm | 7 | 136 | 7 | 131 |
| Petroleum macromoleculesc | Asphaltene content | % w/w | 4 | 0.15 | 4 | 0.13 |
| Resin content | % w/w | 6 | 0.29 | 4 | 0.26 | |
| Paraffin content | % w/w | 5 | 0.12 | 5 | 0.12 | |
| Petroleum resins in benzened | Resin content | mg L−1 | 3 | 2.3 | 3 | 2.0 |
One can see that, in general, both SVR and LS-SVM models show results not worse than those of ANN models. In cases of [Sulfur] prediction and petroleum macromolecules analysis, the SVM-based regression models have lower prediction error (−15 ± 1%) than ANN models. Good results are also shown by the SVR model for benzene concentration prediction (−9%). For T90, [EtOH] and petroleum resins in benzene, SVM regression models have higher RMSEP than ANN models (by 7%, 11%, and 7%, respectively). Note that in the last case all the methods show approximately the same results (±8%), so these data are not that representative (Table 4). The cause for this could be the relative system simplicity. In the five other cases, the results of the SVM approach are very close to those of neural networks (±2%).
The difference between SVR and LS-SVM results is small: −3 ± 7% with an advantage of LS-SVM regression model. A relatively significant difference (>10%) is observed for [Benzene], [A], [R] in toluene, and [R] in benzene. In the last three cases, the RMSEP of LS-SVM model is lower.
Based on data from Table 4, one can claim that both SVM-based methods are very effective for building calibration models (compare with Table 2). Both methods are recommended for analysis of petroleum products and biofuels (compare with Fig. 2 in ref. 3). Mostly due to computational aspects, the LS-SVM regression model is preferred. This conclusion supports the early analysis of Buydens and co-workers3 based on NIR spectra that were affected by temperature-induced spectral variation. Additional support for LS-SVM usage is the evidence that this model leads to robust models for spectral variations due to nonlinear interferences.3
![]() | ||
| Fig. 1 Correlation between the decrease in relative error (%) using ANN and SVM (LS-SVM) regression methods: (x-axis) 100% × (RMSEPANN − RMSEPPLS)/RMSEPPLS; (y-axis) 100% × (RMSEPLS-SVM − RMSEPANN)/RMSEPANN. Note the use of fourteen (14) different datasets. | ||
This observation can be explained by the fact that the ANN method tends to overfit highly nonlinear objects. This behavior can significantly lower the generalization ability of the network. The same is not observed for the LS-SVM calibration model.
Note that the point with the smallest absolute x-value on Fig. 1 is the resins in benzene sample (see also the Discussion above).
The maximum accuracy achieved by each technique is the main, but not the only characteristic of model applicability to real-world (industrial) tasks. For example, one of many benefits of the SVM approach is its deterministic nature. It leads to the fact that the range of prediction errors for different training/test subsets separation for the SVM-based techniques is much smaller than for top-20 ANNs: [130−133] vs. [147−281] ppm for diesel fuel analysis, [0.15–0.17] and [0.13–0.30] % w/w for [EtOH] in biofuel, etc. for LS-SVM and ANN methods, respectively. In other words, one needs to repeat the ANN training many times to get a really accurate result.
![]() | ||
| Fig. 2 Results of petroleum systems analysis by different multivariate techniques: LS-SVM vs.ANN and SVRvs. LS-SVM. Sample sets and properties: (top, from left to right) density—gasoline density at 20 °C, IB—initial boiling point, T10—end boiling point 10% v/v, T50–end boiling point 50% v/v, T90—end boiling point 90% v/v, FB—final boiling point, [Benzene]–benzene content in gasoline; (bottom, from left to right) density—ethanol–gasoline fuel density at 20 °C, [EtOH]—ethanol content, [Sulfur]—total sulfur content in diesel fuel, [A]—asphaltene content in petroleum macromolecule solution, [R]—resins content in petroleum macromolecule solution, [P]—paraffins content in petroleum macromolecule solution, [R]—petroleum resin concentration in benzene.7 Calibration models: PLS—partial least squares regression (projection to latent structures), Poly-PLS—polynomial partial least squares regression, ANNs–artificial neural networks (multilayer perceptron), SVR—support vector regression, LS-SVM—least-squares support vector machine regression. The root mean squared errors of prediction (RMSEP) are presented. The errors are normalized for comparison among different systems. | ||
(1) Fourteen different sample sets were studied by linear (PLS), quasi nonlinear (Poly-PLS), and three nonlinear (ANN, SVR, and LS-SVM) multivariate methods. NIR spectroscopy data were used in all cases.
(2) The accuracy of the SVM-based calibration models, SVR and LS-SVM, is comparable with the accuracy of the ANN-based approach.
(3) There is a correlation between the relative accuracies of the ANN- and SVM-based approaches.
(4) For highly nonlinear objects like petroleum macromolecules, SVM-based regression models are preferable to neural networks.
(5) Regression methodologies, based on the support vector machine ideology, are recommended for practical implementation. The regression models based on SVMs are sufficiently accurate and robust to be used for gasoline, biofuel, or diesel fuel analysis.
We hope that the role of SVM-based regression in chemometrics and multivariate data analysis is clearer after this study and that the possibilities of SVM-based approaches and obstacles to their application have become more evident to both analytical and industrial communities.
We believe that our results will help future chemometric investigations and investigations in the sphere of vibrational (IR, NIR, and Raman) spectroscopy of multicomponent systems.1–3,56–63,75–82 The results presented herein can help achieve rapid and accurate analysis or classification of biofuels, products of petroleum refining, and petrochemicals. The use of NIR spectroscopy in other fields of analytical chemistry, such as pharmaceutical quality control, food quality control, and active pharmaceutical ingredient/pharmacon (pharmakon) analysis of tablets, can be enhanced by the application of modern methods of multivariate data analysis, including support vector machines and artificial neural networks as well as other machine learning techniques.
| This journal is © The Royal Society of Chemistry 2011 |