Ensemble-based support vector machine classifiers as an efficient tool for quality assessment of beef fillets from electronic nose data

Over the past years, the application of electronic nose devices has been investigated as a potential tool for assessing food freshness. This relies on the application of various pattern recognition methods to provide accurate classification and regression models. The models' accuracy depends on the number of samples used during the training process. This often leads to unstable and unreliable classifiers in the case of food quality assessment, where the number of samples is typically less than 200 for a given experiment. The aim of this work is to tackle this problem through the development of a series of ensemble-based classifiers and regression models using support vector machines and electronic nose datasets based on the previously published work of this group. It was found that the developed ensemble provides a higher prediction accuracy compared to the single model approach when estimating the freshness score assigned by the sensory panel; achieving an overall accuracy of 84.1% compared to 72.7% in the case of the single classifier model. Another set of calibration ensembles were developed based on SVMregression, in order to predict bacterial species counts, achieving an increase in the average overall performance of 85.0%, compared to 76.5% when a single classifier was applied. This increase in the predictive power therefore suggests that combining an electronic nose with ensemble-based systems can be used as an innovative method to assess the freshness of beef fillets.


Introduction
The current practice of assessment/evaluation of food quality and safety relies heavily on regulatory inspection and sampling regimes. For example, according to EU authorities 1 the quality of fresh meat is evaluated only by viable counts of bacteria able to grow on a very generic medium or on counts of the Enterobacteriaceae family. It is well established that counting colonies is certainly time-consuming and it does not allow an online response, which would be needed to trigger appropriate corrective measures. Moreover, both the analysis of limited samples and/or their low counts, can signicantly underestimate the microbial contribution to meat quality because the contribution of certain microbial taxa through growth and release of key spoilage molecules can be overlooked with a consequent negative effect on spoilage prevention and handling by the major operators in the meat chain. The conventional approach, described above, seems inadequate because it cannot sufficiently guarantee consumer protection since 100% inspection and sampling is technically, nancially and logistically impossible. Instead the meat industry needs rapid analytical methods or tools to determine and select suitable processing procedures for their raw material and to predict the remaining shelf life of their products. Furthermore, the meat business operators for the wholesale and retail sectors need these methods to ensure the freshness and safety of their products and to resolve potential disputes between buyers and sellers. Tools and approaches are also desirable for the reliable indication of the safety and quality status of meat at retail and through consumption by the consumers. It is, therefore, crucial to have valid methods and tools to monitor freshness and safety in order to allow the consumers to be ensured of quality.
Electronic noses are among those instruments that may be potentially useful to the meat industry. Technically, E-noses comprise an array of electronic chemical sensors with partial specicity in tandem with an appropriate pattern recognition system allowing the recognition of simple or complex odours. 2 So far these instruments have been applied in a diverse range of applications, even on line, in the food industry such as process monitoring, shelf-life determination, spoilage evaluation, authenticity assessment, and quality control studies (for a comprehensive review see 3 ). For instance, Hasan et al. 4 have successfully deployed an E-nose to identify decayed products within meat products by identifying the smell signature of fresh beef mixed with decayed sh, and fresh sh with decayed beef. An E-nose has also been previously used to detect volatile compounds produced by foodborne bacteria in contaminated beef. 5 Further applications of E-noses include proling of seasoning and grading in beef and chicken products, 6 spoilage proling in beef products, 7 and discrimination between storage periods in cod-sh 8 and eggs. 9 The data generated by E-nose instruments are too abstract to be of use without some kind of processing to map the data to commonly used freshness metrics such as microbiological counts or sensory scores. This mapping can be performed by the application of advanced statistical methods (partial least squares discriminant analysis, 8 clustering algorithms, 10 and other methods under the chemometrics banner) and machine learning methodologies (articial neural networks 9 and support vector machines 11 ).
One common problem oen associated with machine learning classiers is the poor performance when tested against unseen data, despite promising performance on the training set, which is usually an indication of model over-tting. A good training performance of a given classier does not necessarily mean a good generalization performance (i.e. performance of the classier on data not seen during the training process). Furthermore, a set of classiers with similar training performance may have a different generalization performance, this variability in performance will become even more evident when the classier's performance is evaluated against a new dataset generated by a different experiment. 12 For this reason, an ensemble of several classiers has been shown in other applications to overcome this limitation, by combining different results (or votes) obtained by all the classiers within the ensemble. 13 The overall performance of a given ensemble depends to a large extent on the quality of the training set and how representative it is to the eld data.
Another problem that is oen associated with electronic nose outputs and food spoilage datasets in general, is the limited number of available samples, usually not more than 150 samples per experiment. In the absence of a good-sized adequate training subset, resampling techniques can be applied in order to generate a series of random overlapping subsets, each of which can be used to train a classier to form the ensemble. 12 In this work, we present the rst ensemble-based predictive tool for assessing the freshness of beef llets using an electronic nose dataset. Generally speaking, the freshness of meat products is assessed using two methods; the rst method is based on a sensory score, assigned for a given sample by highly trained taste panels based on the perception of colour and smell before and aer cooking, 14 the second freshness assessment method is based on enumeration of bacterial counts in a given sample as a quantitative indicator of spoilage. This includes total viable counts (TVC), Pseudomonas spp., Brochothrix thermosphacta, Enterobacteriaceae and lactic acid bacteria. For this purpose, two sets of ensembles were developed, based on our previously published data. 11 These are: a classication set to predict the sensory quality of llets stored aerobically under different isothermal conditions (0, 4, 8, 12, and 16 C), and a regression set of ensemble-based systems to estimate the microbial counts directly from sensor array data (electronic nose). In this approach, we compare the predictive powers of a set of single SVM classiers to predict sensory score values as well as bacterial counts, versus a series of ensemble-based systems, consisting of 200 individual models each.

Experimental analyses
A detailed description of the microbiological analyses carried out in this work is presented elsewhere. 15 In brief, fresh beef llets (M. longissimus dorsi, pH ¼ 5.6) obtained from different carcasses were purchased from the Central Meat Market in Athens and transported under refrigeration to the laboratory within 30 min, then divided in portions of 50 g in a laminar ow cabinet and packed aerobically.
Samples were stored under controlled isothermal conditions at 0, 4, 8, 12, and 16 C in high precision (AE0.5 C) incubators for up to 430 h, depending on storage temperature, until spoilage was pronounced. Total viable counts (TVC), Pseudomonas spp., Brochothrix thermosphacta, Enterobacteriaceae and lactic acid bacteria, were enumerated in parallel with the sensory evaluation of beef llets as reported elsewhere. 16,17 A three-class evaluation scheme was employed in this experiment. The rst class (fresh) corresponded to acceptable meat quality and absence of off-avours; the second class (semi-fresh) corresponded to the presence of slight off-avours but not spoiled (still acceptable quality); and the third class (spoiled) corresponded to clear development of off-avours (unacceptable quality). Semi-fresh was the rst indication of meat spoilage (incipient spoilage) in which the sample was marginally accepted. Overall, 177 beef llet samples were scored by the taste panel and discriminated into the dened groups as fresh (42), semi-fresh (63), and spoiled (72).
For electronic nose measurements, a gas sensor array system (LibraNose, Technobiochip, Napoli, Italy) implemented with an array of 8 quartz crystal microbalance (QMB) non-selective sensors coated with different poly-pyrrole derivatives, synthesized at Technobiochip was used to generate a chemical ngerprint of the volatile compounds of beef llet samples during storage. The active matrix (poly-pyrrole polymers) used to coat the quartz microbalance sensors of Libra nose and the sensitivity of each one of the 8 sensors to particular volatile compounds are reported elsewhere. 18 Further details on the LibraNose instrumentation and mode of action can be found elsewhere. 19 A schematic representation of the LibraNose system is provided in ref. 20.
For each measurement, a beef llet sample of 5 g was introduced inside a 100 ml volume glass jar and le at room temperature (20 C AE 2 C) for 15 min to enhance desorption of volatile compounds from the meat into the headspace. The headspace was then pumped over the sensors of the electronic nose and the generated signal was continuously and in real time recorded and stored to a laptop computer.

Ensemble-based support vector machines
Support vector machines SVMs are a relatively new tool introduced by Vapnik, 21 that has gained popularity over the past decade as a promising machine learning technique for pattern classication and regression problems. It is a supervised learning method for object classication in n-dimensional hyperspace while advances in optimisation and generalisation methods are used to increase efficiency and prevent "over-tting". 22 SVMs can simultaneously minimise estimation errors and model dimensions. 23 More background details about SVMs can be found in. 22 Statistical analysis was performed using the open-source soware environment R. SVM classication and regression models were developed using the R library "e1071". The library allows a modication of the original SVM classication approach to be applied to a multi-class problem (fresh, semi-fresh, and spoiled). In order to enhance the computational speed needed to generate the classiers ensemble, the libraries "doMC" and "foreach" were deployed to allow parallel development of the models ensemble. Firstly, the function "regis-terDoMC()" is used to register the number of CPU cores that can be allocated for the analysis, followed by deploying the function "foreach" to allow each ensemble model to be developed and optimised in parallel.
The rationale behind the ensemble-based method is to develop a group of classiers where the nal prediction output is a result of combining individual prediction of all classiers within the ensemble. 24 The key for an ensemble to have more accurate prediction than any of its individual members is to ensure that the classiers are diverse and have accurate individual performance. 25 For instance, let us consider an ensemble 3 of T classiers where 3 ¼ {D 1 , ., D T } and an unseen test sample x. If no diversion exists between D 1 , ., D T and if D 1 (x) is wrong, then D 2 (x) to D T (x) are also wrong. However, if there is no correlation of the errors made by the different classiers, then if D 1 (x) is wrong and the majority of D 2 (x) to D T (x) are right, then, by applying a majority voting system will correctly classify x. This of course requires that the individual classiers within 3 have a good accuracy. If there are a total of T classiers for a cclass problem, the ensemble decision will be correct if at least [T/c + 1] classiers choose the correct class. 12 Assume that each classier of 3 has a probability p of choosing the right class, then the ensemble's probability of choosing the right class has a binomial distribution and the probability of choosing k > T/c + 1 correct classiers out of T is: The general outline for developing a classier ensemble is shown in Fig. 1. 2.2.1. Bagging. Bootstrap aggregating, or "bagging" is one of the earliest approaches for developing ensemble-based systems. 26 Bagging is an ensemble method that creates classi-ers for its ensemble by training each classier on a random redistribution of the training set using resampling. So, it incorporates the benets of both bootstrap and aggregating approaches. 27 Since its introduction, bagging started to gain a lot of attention mainly due to its simple implementation and good performance. 12 Diversity is achieved in bagging by resampling training subset using bootstrapping: for each clas-sier within the ensemble 3 ¼ {D 1 , D 2 , ., D n }, a different training subset is drawn from the original training set using resampling with replacement, resulting in N number of subsets. Each of the generated subsets is used to train one classier within the ensemble. The developed ensemble is then used to predict a subset of unseen testing data, where the output of all classiers within the ensemble is combined using an appropriate voting technique. Several approaches have been developed for voting aggregation, such as majority voting, weighted majority voting, naïve Bayes, and continuous counts. Bagging technique applying for the purpose of this work focused on: majority voting, weighted majority voting, and naïve Bayes.  1 Flowchart showing processes involved with developing a pattern recognition ensemble-based system. The experimental data is divided into training and testing subset. Depending on the size of the training subset and the classification algorithm being applied, a suitable resampling technique (e.g. bootstrapping) is applied to reproduce overlapping random subsets of training subset for each classifier. The ensemble is then used to classify the unseen testing subset samples, where the total output of all classifiers forming the ensemble is fused by applying a suitable voting technique (e.g. majority voting, weighted majority voting and naïve Bayes).

Majority voting.
Majority voting is one of the oldest methods used for decisions making and is considered to be the simplest way of fusing the ensemble votes. 28 The rationale behind majority voting is based on considering the ensemble output for a given sample x as being the class that gets the maximum number of votes by individual ensemble classiers. Let us assume that the label outputs of the classiers are given as c-dimensional binary vectors [d i,1 , ., d i,c ] T˛{ 0, 1} c , i ¼ 1, ., T, where d i,j ¼ 1 if D i labels x in u j , and 0 otherwise. The majority vote will result in an ensemble decision for class u k if: Voting ties are then resolved arbitrarily. Despite the simplicity of implementation of the majority voting concept, the method has a main drawback: it does not take into account the accuracy of individual classiers within the ensemble. This is a minor issue if the ensemble does not suffer from a large variation between individual performances and when classier accuracy is generally good. However, if this is not the case, this will have an impact on the nal accuracy of prediction. In this case, it will be more appropriate to apply the weighted output fusion method.
2.2.3. Weighted majority voting. If the ensemble classiers do not have similar prediction accuracy, giving more voting weight to classiers with high accuracy will be more appropriate. 28 This approach is called weighted majority voting. Combining bootstrapping with the weighted majority voting aggregation method belongs to a category of ensemble-based systems called "boosting".
2.2.4. Naïve Bayes. Naïve Bayes, 29 which is also known as "independence model" or "idiot's Bayes" 30 is an aggregation method that assumes that the classiers are mutually independent. 28 The principle behind naïve Bayes is provided in the accompanying ESI S1. † 2.2.5. Boosting. The concept of boosting originates from an on-line learning algorithm named "Hedge(b)" 31 that allocates weights to a set of strategies to improve the outcome of a certain event. The idea is to assign higher "weights" to classiers showing high accuracy during the training process, while assigning lower weights to classiers with lower accuracy, increasing therefore the probability of a correct nal output for the ensemble. Adaptive boosting, or "Adaboost", 31 is the most popular boosting technique available, and has been successfully applied since its introduction to improve classication performance. [32][33][34] Adaboost, similar to the bagging, generates a set of models and combines the nal ensemble output using weighted majority voting. However, individual Adaboost clas-siers are developed via the process of training a weak model using samples drawn from an updated distribution of the training data. This distribution update ensures that the samples misclassied by the previous classier are more likely to be included in the training data of the next classier. Hence, consecutive classiers' training data are geared towards increasingly hard-to-classify instances. 12 Adaboost was initially developed to solve a binary class problem, and then extended for multiple classes. Adaboost.M1 is the most straightforward multi-class extension of Adaboost.

Models implementation
2.3.1. Single classier. The rst classication is obtained using a single radial SVM classier. The model is initially optimised in order to identify the best regulization parameter C for the training criterion and the bandwidth g of the Gaussian kernel. 35 In order to achieve this, a grid search is performed using the parameter ranges C ¼ [1, 2, 3, ., 30] and g ¼ [0.1, 0.2, 0.3, ., 5]. The entire dataset Z is rst divided into training T and testing subset S on a 3 : 1 ratio respectively. The training subset is divided further into training Ts and testing Ss subsets using the same ratio (3 : 1). For each C and g parameter combination, the Ts subset is used to train the SVM model while the accuracy is measured by classifying the Ss subset.
The same approach was followed for developing a series of regression SVM calibration models. SVM regression (SVM-R) models were built in an attempt to correlate the population of selected microbial groups, namely total viable counts (TVC), Pseudomonas spp., B. thermosphacta, Enterobacteriaceae, and lactic acid bacteria, to the responses of the electronic nose sensors. In this case, the signals of the sensors were used as input variables in the SVM regression models and the output was the counts of each individual microbial group. For a given regression problem, the goal of SVM is to nd the optimal hyper-plane from which the distance to all the data points is minimum. The kernel function type selected in the development of SVM regression models was also the radial basis function (RBF).

Ensemble-based systems
Bagging. The overall implementation for the bagging approach is shown in Fig. 2. Firstly, the original dataset Z is split into a training subset T and a testing subset S. The training subset T is divided further into training Ts and testing Ss. Ts is then bootstrapped into a 330 subset using resampling, where a grid search using the parameter ranges C ¼ [1, 2, 3, ., 30] and g ¼ [0.1, 0.2, 0.3, ., 5] is performed in order to identify the optimum C and g values. The nal SVM classier is then developed using these parameters before being added to the ensemble. This procedure is repeated 200 times until the entire ensemble is generated. The ensemble is used in order to label the unseen testing subset S samples. The nal ensemble clas-sication is calculated using a voting aggregation method. In this work, the voting output fusion methods applied were: majority voting, weighted majority voting, and naïve Bayes.
Boosting -Adaboost.M1. The initial data splitting into training and testing subsets for Adaboost.M1 as well as the classiers optimisation procedure applied is similar to the approach followed for bagging as described in Fig. 2. The algorithm developed to generate the ensembles is provided in the accompanying ESI S2. † Performance metrics. The performance of the classication and regression models developed was assessed in terms of accuracy of prediction. In the case of the sensory score classication models, the performance was obtained by calculating the percentage of the number of the correctly classied samples in the three sensory scores out of the total number of the samples within the dataset. For the bacterial count regression models, a similar process was followed; for a given sample, the prediction was considered a mismatch if the difference between the predicted and the actual value is larger than 1 log value as follows: The overall model performance is the percentage of correctly classied samples out of the total number of samples analysed.

Results and discussion
The microbiological analyses carried out in this work is presented elsewhere. 15 In brief, the total viable counts (TVC), Pseudomonas spp., Brochothrix thermosphacta, Enterobacteriaceae and lactic acid bacteria, were enumerated in parallel with the sensory evaluation of beef llets as reported elsewhere. 16,17 The sensory panel judged a meat sample as semifresh aer 73, 73, 58, 30 and 24 h at 0, 4, 8, 12 and 16 C respectively. Furthermore, when the sensory panel identied a sample as spoiled, the total viable count was found at 6.9-9.57 log cfu g À1 (mean ¼ 8.6 cfu g À1 ), which is in line with previous ndings, that bacterial counts of 7-8 log cfu g À1 can cause off-odours and slime. 36 A series of six single RBF-SVM classiers were developed based on electronic nose measurements, in order to predict the sensory score as well as bacterial species counts. The optimum parameters for each model were identied using grid search, these parameters were used in order to build the nal set of models using the training set T. Model accuracy was measured using the testing subset S. The overall classication accuracy and individual classication parameters are summarised in Table 1, while the graphical presentation of observed vs.   predicted values for various groups of microorganisms are shown in Fig. 3. The sensory scores model achieved 72.7% overall classication accuracy when tested against the randomly selected testing subset S, showing a performance of 60%, 64.2% and 93.3% for fresh, semi-fresh and spoiled sample classes respectively. The prediction confusion matrix for sensory scores prediction is shown in Table 2. The bacterial species count models showed a performance ranging from 70.4% for total viable counts, to 87.2% for B. thermosphacta. The bagging approach 26 was followed to develop an ensemble-based system 24 for predicting quality based on the sensory evolution scores given by the panel as described earlier.
The dataset was split into training and a testing subset. For each of the classiers forming the ensemble, the training subset was bootstrapped to form a subset of 330 samples, which is then divided further into training and testing subsets in order to perform the grid search optimization process as described in Fig. 2. The ensemble prediction accuracy was calculated using the testing subset and the nal output was computed using various aggregation methods. As shown in Table 3, the bagging approach has improved the overall prediction accuracy by more than 10% when compared to the single classier performance when assessed using the same unseen testing subset, showing a performance of $83%. All aggregation methods applied for output fusing performed equally well. Naïve Bayes aggregation showed the best classication performance, with an overall accuracy of 84.10%.
Another set of ensemble-based classiers were also developed using the boosting approach, and the nal ensemble output was fused using weighted majority voting aggregation (Adaboost.M1), 28 which showed an overall and individual classes similar to the bagging ensemble as shown in Table 4. The root mean square of error for both calibration (RMSEC) and prediction (RMSEP) were calculated for each developed ensemble as shown in Table 5.
Furthermore, a set of ensemble-based classiers were developed based on RBF-SVM for regression, to predict bacterial species count values. Similarly to sensory score prediction, two ensemble systems were developed using bagging and boosting    approaches for each species count type. The same procedures were followed for bootstrapping and grid search parameters optimization. The ensembles prediction accuracies were computed using majority and weighted majority voting aggregation for bagging, and weighted majority voting in the case of boosting as shown in Fig. 3. It was noted that the prediction accuracy for bagging combined with weighted majority voting was similar to Adaboost.M1, which was also signicantly higher than bagging combined with majority voting. The best prediction accuracy for total viable counts and Enterobacteriaceae was achieved using Adaboost, showing a performance of 79.5% (RMSEP ¼ 1.06) and 87.3% (RMSEP ¼ 0.77) respectively. On the other hand, bagging combined with weighted majority voting was found to give the best prediction while for Pseudomonas spp., B. thermosphacta, and lactic acid bacteria, with prediction accuracy of 85.9 (RMSEP ¼ 1.15), 84.5 (RMSEP ¼ 1. 19) and 88.0% (RMSEP ¼ 0.84) respectively. For each ensemble system developed, a total of 200 classi-ers were included. This was found to be a sufficient number to stabilise the ensemble prediction accuracy. In order to assess stability, each ensemble was built in an accumulative manner, where one SVM model was added to the ensemble at a time, and the overall prediction accuracy was assessed using the unseen testing subset S, the stabilisation process was assessed by individually according to the voting aggregation method applied as shown in Fig. 3.
The sensory score ensemble shows similar stabilisation patterns for all algorithms applied (Fig. 4a), however naïve Bayes was found to stabilise at fewer SVM models, and provided the best overall prediction accuracy at 84.1% as shown in Table  4. For the bacterial count prediction ensembles, bagging combined with weighted majority voting aggregation and Adaboost were found to stabilise at fewer number of SVM models, and showed the best overall accuracy when compared with bagging combined with majority voting (Fig. 4b-f). This stabilisation pattern suggests therefore that the individual classiers within the bacterial counts ensembles are less stable compared to those of the sensory score ensemble, which is somehow expected for regression models. The classiers stability was however increased by applying the weighted majority voting aggregation. The ensemble approach followed in this work is comparable to other similar machine learning approaches based on the inclusion of individually trained and optimised multi-models systems to improve the prediction performance. This includes Genetic Programming (GP) 37,38 and Successive Progression Algorithm (SPA). 39 GP has been previously applied by Ellis et al., 2004 (ref. 40) has been previously applied to successfully estimate meat spoilage based on Fourier transform infrared (FTIR) in tandem with genetic programming to determine the wavenumbers associated with the bacterial spoilage of fresh beef over 24 h.
The grid search performed for hyper-plane optimisation represents a very computationally intensive process, especially when repeated over 200 models for each ensemble developed and various voting aggregation methods. For this purpose, the analysis was performed on a special computing facility of two Intel Xeon processors (six cores each) and 64 GB of RAM, yet each ensemble optimization process takes 8 to 12 hours, which limits the application of this approach in web-based applications. The deployment of the R libraries "doMC" and "foreach" was extremely useful as it reduces the processing time needed to develop the ensemble by almost 10 folds in the case of 12 processor cores. The parallelization of the optimization process becomes a necessity particularly when the input database is of larger dimension as it is the case for spectral data such as nuclear magnetic resonance or near-infrared.

Conclusions
While SVMs have already been applied in the food sector and have proven to be successful in a number of practical applications (e.g. ref. 13, 23 and 41), this work presents the rst application of ensemble-based SVM systems to assess freshness in beef llets based on electronic nose datasets. The results obtained in this study demonstrated the potential of using an electronic nose system as a rapid and non-destructive method for spoilage identication of aerobically packaged beef llets regardless of storage temperature. The collected signal responses could be considered as a volatile ngerprint of an active biological system, containing information for discrimination of meat samples in sensory classes corresponding to different spoilage levels. The application of ensemble classiers was proven to increase prediction accuracy compared to the application of single classier models. The classication performance for sensory classes was increased from 72.7% to 84.1% when the same unseen testing subset was used. The overall prediction was also increased in the case of regression models for bacterial species count prediction from 76.5% to 85.0%. This approach highlights therefore the potential of applying electronic nose, as a method of assessing freshness in beef llets. However, the ensemble development is a computationally expensive task. Future improvement of the presented methodology can be achieved by reducing the processing time needed for model optimization through the parallelization of this process using for example General Purpose Graphical Processing Units (GPGPU).