Wei Wanga,
Hui Jiang*a,
Guohai Liua,
Quansheng Chenb,
Congli Meia,
Kangji Lia and
Yonghong Huanga
aSchool of Electrical and Information Engineering, Jiangsu University, Zhenjiang 212013, PR China. E-mail: h.v.jiang@ujs.edu.cn; h.v.jiang@hotmail.com; Fax: +86 511 88780088; Tel: +86 511 88791245
bSchool of Food and Biological Engineering, Jiangsu University, Zhenjiang 212013, PR China
First published on 10th May 2017
To improve the yield of industrial fermentation, herein, we report a method based on Fourier-transform near-infrared spectroscopy (FT-NIR) to predict the growth of yeast. First, the spectra were obtained using an FT-NIR spectrometer during the process of yeast cultivation. Each spectrum was acquired over the range from 10000 to 4000 cm−1, which resulted in spectra with 1557 variables. Moreover, the optical density (OD) value of each fermentation sample was determined via photoelectric turbidity method. Then, using a method based on competitive adaptive reweighted sampling (CARS), characteristic wavelength variables were selected from the preprocessed spectral data. Gaussian mixture regression (GMR) algorithm was employed to develop the prediction model for the determination of OD. The results of the model based on GMR were achieved as follows: only 13 characteristic wavelength variables were selected by CRAS, the coefficient of determination Rp2 was 0.98842, and the root mean square error of prediction (RMSEP) was 0.07262 in the validation set. Finally, compared to kernel partial least squares regression (KPLS), support vector machine (SVM), and extreme learning machine (ELM) models, GMR model showed excellent performance for prediction and generalization. This study demonstrated that FT-NIR spectroscopy analysis technology integrated with appropriate chemometric approaches could be utilized to monitor the growth process of yeast, and GMR revealed its superiority in model calibration.
Fourier-transform near infrared (FT-NIR) spectroscopy can possibly serve as a noninvasive technique for the quantitative analysis of the growth process of yeast as it interacts with molecular groups associated with process parameters such as biomass (C–H group), organic acid and moisture (O–H group), and scattering from microstructures.5,6 Most of the near-infrared absorption bands associated with these groups are overtone or combination bands of the fundamental absorption bands in the mid-infrared regions, which are due to vibrational and rotational transition.7 In recent years, FT-NIR spectroscopy technology has been applied in the field of yeast fermentation.8,9 The abovementioned studies show that FT-NIR spectroscopy is a highly potential technique for the analysis of the growth process of yeast.
However, FT-NIR spectroscopy analysis technique is an indirect measurement technique. In recent years, a number of studies have shown that near infrared spectral information has complicated backgrounds with peak overlapping and weak signal. Generally, NIR has hundreds of variables, and some uninformative variables, redundant variables, and serious multicollinearity exist among the wavelength variables. Model calibration using complete spectral data will not only reduce the modeling speed, but also affect the accuracy and robustness of the model. Therefore, it is necessary to screen the spectral characteristic wavelength variables by an appropriate wavelength variable selection method prior to model calibration.10
Additionally, the application of a proper multivariate analysis method in model calibration has been proven to be greatly beneficial for providing more reliable and parsimonious model. During the last few decades, many different algorithms, such as partial least squares regression (PLS),11 kernel partial least squares regression (KPLS),12 neural network (NN),13 support vector machine (SVM),14 extreme learning machine (ELM),15 mixture Poisson regression (MPR),16 and Gaussian mixture regression (GMR)17 have been developed for model calibration. Among these, GMR is a relatively new algorithm, which not only has the advantages of smaller calculation quantity and few parameters, but also is suitable for dealing with the problem of non-normal distribution.17 Thus, in this study, GMR was applied to construct a regression model for the prediction of the growth process of yeast.
In the process of microbial culture, optical density (OD) is often used as an index to reflect the growth state of a microorganism.18 Therefore, in this study, FT-NIR spectroscopy technique combined with proper multivariate data analysis was employed to carry out quantitative analysis on the growth process of yeast culture (i.e. OD values). The specific objectives of this study were
(1) to eliminate suspended particles, surface astigmatism, and optical path change by SNV;
(2) to filter out the characteristic information variables and compressed spectral data dimension by CARS;
(3) to use optimal spectral data for the construction of a prediction model via Gaussian mixture regression (GMR).
To highlight the superiority of the prediction precision of GMR algorithm adopted in this study, the results of the GMR model were compared with those of other three different regression algorithms: kernel partial least squares, KPLS; support vector machine, SVM; and extreme learning machine, ELM. Simultaneously, the parameters of the models were optimized via a cross-validation method.
For each set of yeast culture experiment, sampling was carried out at 19 different time points during the yeast culture, from loading to the end of culture (0, 4, 8, 12 … 72 h). In addition, to avoid contamination of sterile malt medium by multiple sampling, 19 sampling time points were divided into three parts: the first 7 sampling time points (0, 4, 8, 12, 16, 20, and 24 h) were executed in the volumetric flasks numbered as I, the next 6 sampling time points (28, 32, 36, 40, 44, and 48 h) were implemented in the volumetric flasks numbered as II, and the last 6 sampling time points (52, 56, 60, 64, 68, and 72 h) were carried out in the volumetric flasks numbered as III. Thus, 19 samples were obtained for each set of experiment, and data from a total of 114 samples were obtained in 6 groups. Moreover, these four sets of experimental data were chosen as the training set, and the remaining two sets were used as the validation set.
Fig. 1(a) shows the raw FT-NIR spectra of the 114 yeast cultivation samples. FT-NIR spectra are affected by multifarious conditions such as changes in temperature, diffusion of light, a baseline shift or instrument noise.8 In addition, FT-NIR spectra contain chemical as well as physical information, which can be useless or mask important information.19 Therefore, to ensure the predicted effect of the calibration model, it was essential to select a suitable pretreatment method to weaken the physical and chemical interference. At present, many spectral preprocessing methods such as first and second derivative, standard normal variate transformation (SNV), and multiplicative scatter correction (MSC) have been reported. On comparing these spectral preprocessing, SNV was found to be superior to others in this study. In this experiment, a gap or bubble among yeast culture media was observed in the cuvette, which resulted in scattering of light. SNV has advantages with respect to correcting scattered light and removal of slope variation. Therefore, SNV was employed for light scatter correction and reducing the changes of light path length in the proposed work. SNV preprocessing spectra is presented in Fig. 1(b).
CARS can work in four successive steps:23
Step 1. Monte Carlo approach was applied for model sampling, 80% of the sample were randomly selected to build the PLS model, and the regression coefficient β of the corresponding model was retained. The weight wi of the ith variable can be defined as follows:
![]() | (1) |
Step 2. Exponentially decreasing function was employed to perform enforced wavelength selection. Wavelength retention rate was directly calculated using the following algorithm:
ri = ae−ki | (2) |
Step 3. The adaptive reweighted sampling (ARS) method was adopted to realize a competitive selection of wavelengths. Wavelength variables of the larger weights were selected to form subsets of wavelengths. After repeating this step for N times, CARS sequentially selected N subsets of wavelengths to build the PLS model.
Step 4. 5-Fold cross validation method was utilized to evaluate the subset. The subset with lowest RMSECV value was chosen as the optimal subset.
![]() | (3) |
In addition, mean and covariance can be divided into input and output parts as follows:
The marginal probability density fX(x) and mixing weight wj(x) can be calculated by27
![]() | (4) |
![]() | (5) |
From eqn (3)–(5), we can obtain the global GMR function as
![]() | (6) |
The mean and variance of the conditional distribution can be estimated as follows:
![]() | (7) |
![]() | (8) |
For a given input variable, its prediction can be achieved by calculating the expectation over the conditional distribution fY/X(y/x)27
![]() | (9) |
To build a GMM, the mixture components K were set as 4 and the unknown parameter set θ of probabilistic weights were estimated first. Therefore, the maximum likelihood estimation (MLE) and expectation-maximization (EM) algorithm were adopted to optimize the parameters. With a set of given data, (X, Y) can be realized by estimating the model parameters θ in eqn (3). By maximizing the log-likelihood function L(θk), this process can be can realized, and the calculation formula can be expressed as28
![]() | (10) |
For the given training data, θ was calculated by maximizing this function via the ELM algorithm in the iterative means. It included two steps:29
(1) E step (expectation step):
![]() | (11) |
(2) M step (maximum step):
![]() | (12) |
![]() | (13) |
![]() | (14) |
Fig. 2(a) shows the process of the characteristic wavelength selection by CARS. It can be seen from the graph of the relationship between the number of reserved wavelengths and the number of sampling runs that with the increase in the number of runs, the selected wavelength variables present a decreasing trend. This trend was initially rapid and then slowed down, thereby reflecting the process of rough and careful selection of variables. Fig. 2(b) shows the variation trend chart of the root mean square error of cross validation (RMSECV), wherein, it can be seen that RMSECV first descends and then ascends. When the number of sampling runs was 28, RMSECV attained the minimum value at 0.1736. After 28th time sampling, some of the relevant variables started to disappear, thereby increasing the RMSECV value. In Fig. 2(c), “*” perpendicular to the horizontal axis indicates that the minimum value of RMSECV was obtained on 28th time sampling. According to the principle of minimum RMSECV, 13 characteristic variables were selected at last. Fig. 3 shows the distribution of the 13 selected characteristic variables in the entire spectral region after the CARS operation.
In addition, through the comparison of these methods, we found that there are several explanations for this phenomenon. KPLS and SVR are the common techniques for the regression of complex non-linear data sets. The key to this model is to map the data in a higher dimensional feature space using kernel transformation. However, the disadvantage of using this kernel function is that the correlation between the obtained regression model and the original input space is lost. As a result, it may lose some useful information variables, which would cause decline in the prediction precision of the model. Moreover, because of the application of the kernel function, the running time KPLS and SVR program is longer than that of other models. ELM as compared to traditional neural network methods has simple structure, high learning speed, and good generalization performance; however, the dimension of the spectral data is usually very high while more hidden nodes should be incorporated in the original ELM model for spectral data. Therefore, the output matrix of the hidden layer of ELM model appeared as a high dimensional and high collinearity problem due to yeast growth in a complex environment. The process data did not originate from a single operating region; moreover, data distribution may be complicated with arbitrary non-Gaussian patterns. As a mixture model can represent arbitrarily complex probability density functions, GMR is one of the ideal tools for modeling complex multi-class dataset. Moreover, GMR not only has the tight structure of a parametric model, but also still retains the flexibility of a nonparametric model. Considering sufficient linear combinations of basis single multivariate Gaussian distribution, GMM can smoothen the probability distribution of arbitrary shape. Therefore, GMR reflected excellent generalization in its theory, which brings a slightly better prediction effect than the other regression algorithms.
This study not only broadens the scope of CARS and GMR algorithm's application, but also provides a new theoretical basis for the rapid and non-destructive detection of microbial growth process. Moreover, it also makes a reference in the research on the improvement of the fermentation technology informationization and intelligent monitoring of other fermentation processes and has broad application prospect.
This journal is © The Royal Society of Chemistry 2017 |