Silviu Florin
Acaru
*a,
Rosnah
Abdullah
b,
Daphne Teck Ching
Lai
c and
Ren Chong
Lim
*a
aCentre for Advanced Material and Energy Sciences (CAMES), Universiti Brunei Darussalam, Jalan Tungku Link, Gadong BE1410, Brunei Darussalam. E-mail: s.f.acaru@outlook.com; renchong.lim@ubd.edu.bn
bFaculty of Science (FOS), Universiti Brunei Darussalam, Jalan Tungku Link, Gadong BE1410, Brunei Darussalam
cSchool of Digital Science (SDS), Universiti Brunei Darussalam, Jalan Tungku Link, Gadong BE1410, Brunei Darussalam
First published on 18th July 2023
Energy from fossil fuels is forecasted to contribute to 28% of the energy demand by 2050. Shifting to renewable, green energy is desirable to mitigate the adverse effects on the climate posed by resultant gases. Continuous flow hydrothermal liquefaction holds promise to convert biomass into renewable energy. However, sustainable conversion of biomass feedstocks remains a considerable challenge, and more process optimization studies are necessary to achieve positive net energy ratios (NERs). To fast-track this process development, we investigated the integration of Fourier transform infrared spectroscopy (FTIR) for data collection coupled with a support vector machine classifier (SVC). We trained the model on data labeled after the analysis of the aqueous stream by high-performance liquid chromatography (HPLC). Multiple test data, such as liquified wood and cotton, and dissolved glucose, were used to classify the aqueous streams. The results showed that fused original data achieves 84% accuracy. The accuracy increased to 93% after merging synthetic data from generative adversarial networks (GANs) and hand-crafted statistical features. The effect of Uniform Manifold Approximation and Projection for Dimension Reduction (UMAP) on accuracy was also studied. We noticed that UMAP increases accuracy on some variations of the datasets, but it does not exceed the highest reported value. Shapely Additive Explanations (SHAP) were used to investigate the contribution of the top 20 features. We discovered that features representative of glucose contribute positively to the model's performance, whereas those found in water have a negative influence.
Despite considerable progress achieved through different configurations, the sustainability of HTL systems is still not favorable for adhering to positive conversion principles. Continuous flow HTL emerges as a more appealing option over the batch and semi-continuous types due to its ability to generate fuel materials in large quantities and extract desired compounds while controlling the biomass retention periods.8 During the conversion process in continuous flow HTL, two primary by-products are generated: a solid residue and an aqueous phase. The aqueous phase is rich in fermentable sugars and other valuable compounds. The sugars can be further exploited for their calorific properties or enhanced through fermentation to generate high-yield liquid fuels, a form of renewable energy.9
The sustainability of the continuous flow HTL system is deemed favourable when the Net Energy Ratios (NERs) exceed 100%. The NER serves as a measure of the energy yield of a compound relative to the energy input into the system. Unlike other reported metrics such as energy recovery, which solely consider the energy content of the resulting fuel or bio-oil obtained, NER offers a more comprehensive analysis by taking into account the total energy efficiency of the HTL system.
A preliminary study focusing on the conversion of pre-treated wood waste residues has demonstrated that continuous flow HTL can achieve glucose NER values as high as 63%.10 To enhance the optimization of biomass conversion in continuous flow HTL, additional studies are required to refine parameters optimization, biomass load-to-weight ratio, and pre-treatment methods. Nevertheless, the rapid optimization of biomass conversion in continuous flow HTL encounters two primary challenges.
The first challenge arises from traditional optimization studies, which necessitate significant resources such as consumables, energy, time, and skilled labour. For instance, high-performance liquid chromatography (HPLC), an offline analysis technique used to determine compound concentrations in the aqueous phase, provides highly reliable data. However, the HPLC analysis entails a series of labour-intensive steps, including sample preparation, instrument qualification, compound identification, and quantification.
The second challenge lies in the intricate nature of HPLC analysis, which hampers the swift adjustment and control of HTL parameters during testing. To address these limitations, alternative methods that provide fast, cost-effective, and inline measurements can be employed. Fourier transform infrared spectroscopy (FTIR) is one such technique that enables simultaneous analysis of complex mixtures. It offers qualitative and quantitative information of sufficient accuracy, making it a valuable complement to overcome these challenges. This technique can analyse complex mixtures simultaneously, revealing adequate qualitative and quantitative information.11 With regards to aqueous solutions, FTIR has applications in several fields, such as diabetes monitoring,12 food additives,13 allergens14 and bio-hybrid fuel cells.15 Compounds of interest resulting from conversion processes have also been analysed, such as the sugar content in enzymatic hydrolysis of alkali-pretreated biomasses,16 the quantification of glucose in aqueous solutions,17 and quantification of aqueous phases (bio-crudes) derived from HTL.18 Additionally, apart from the analysis hindrance, the sheer number of experimental runs to reach conclusive results also slows down the process optimization.
Research applying machine learning (ML) algorithms to solve problems associated with energy studies is increasingly prevalent, ranging from material design models, discovery of unknown compounds and acceleration of innovation development such as high-performance fast charging batteries.19,20 The increasing importance of incorporating these concepts into hydrothermal liquefaction yields cannot be overstated. However, ML algorithms generally perform well when trained on large datasets. To complement the lack of data in determining the best HTL parameters, researchers resolve to compiling data from various published literature.21,22 In the case of continuous flow HTL, this approach is not feasible for two reasons:
a. there are not enough studies that published results using a similar HTL setup, and
b. the parameters and outputs are specific to the level of control and handling of the biomass.
Acknowledging the inadequacy of assuming uniform handling of all experiments, deep learning (DL), a subfield of ML, offers powerful algorithms that effectively tackle the challenges posed by limited data availability and expedite the optimization process. These DL algorithms play a crucial role in enhancing learning capabilities and facilitating more efficient decision-making.
Among the notable techniques in DL, generative adversarial networks (GANs) stand out as a preeminent approach for augmenting data from real-world examples, particularly in low data scenarios. GANs have demonstrated their effectiveness across various domains, including the design of materials models,19 generation of realistic medical images,23 object detection,24 augmentation of sensory signals,25 and improvement of Raman spectroscopy data.26 The latter is analogous to the infrared spectra obtained through FTIR.
Nonetheless, correct identification of the molecules in aqueous solution by FTIR is challenging due to the contribution of water molecules to the absorption spectrum. Absorption peaks of chemical bonds under aqueous mid-infrared radiation are broad, spreading across several wavelengths.11 Solutions such as feature engineering using statistical values has shown to capture the interconnection of movements by depth sensors.27 Similarly, hand-crafted statistical features could also be applicable to the vibrational intensity across wavelengths and amplify the response.28 However, FTIR spectrums have regions that are not significantly important in explaining the presence of a compound and with the generation of statistical features, insignificant values are introduced for each sample. Training a model on irrelevant data can have a negative impact on model's performance. Dimensionality reduction techniques, such as the Uniform Manifold Approximation for Projection (UMAP) algorithm can be used to improve a model's performance. UMAP selects the essential features using nearest neighbours to construct the simplicial set.29 The question remains whether the final ML model is reliable, and to confirm, one needs to ensure that the contribution of the significant features are the determining ones. Shapley Additive Explanations (SHAP) can be used to interpret each value of the features and understand the respective contribution of the vibrational spectra wavelengths.30
Therefore, the aim of this study is to implement a ML model into a continuous flow HTL system that could rapidly classify samples with high accuracy and confidence during biomass conversion into biofuel materials. The study's objectives are as follows:
1. Investigate the suitability of GANs in synthetic data generation from FTIR spectrums to increase dataset size for improving ML classification performance.
2. Enhance the model's performance using hand-crafted statistical features and a dimensionality reduction technique.
3. Verify whether the features with significant importance in glucose compounds are contributing positively to the model's performance.
The novel proposed framework will accelerate glucose recognition in the aqueous phase from the continuous flow HTL conversion process when the level is above a set threshold. The framework involves numerical data collected from three different experiments. The first dataset is derived from wood (W) waste and it represents the minimum viable real data of the lignocellulosic biomass conversion process. The second dataset is derived from conversion of cotton (C), which is a cleaner lignocellulosic biomass representative. The last dataset is attained from dissolved glucose (DG) with a high purity content. The dissolved glucose dataset is meant to enforce the model into training with more samples, representative of the target material.
(1) |
In other words, the generator model is responsible for generating new data samples from a given dataset. In contrast, the discriminator model acts as a classifier and tries to distinguish whether the new data sample is real or fake by comparing the training and fake data.
Numerous variations have been proposed over the traditional GANs. For example, the Wasserstein model (WGAN) improved the training stability by introducing the Earth-Mover distance (or Wasserstein-1) to the loss function.34 Still, the model experienced difficulties in generating accurate samples due to weight clipping. As a remedy, improvements such as gradient penalty (WGAN-GP) to the original critic loss showed promising results.35 While some architectures focused on generating new image variations, others concentrated on table data types (tabular). The implicit joint distribution of columns, which is the probability of two variables happening together, can be learned from the real data. Synthetic data can be produced from the resulted distribution. Algorithms such as tabular GAN (TGAN) and conditional tabular GAN (CTGAN) which are based on recurrent networks, outperformed previous statistical ways of augmenting tabular data (e.g., classification, regression trees, and Bayesian networks).36 Table-GAN, which is based on convolutional neural networks, is another case model that generates synthetic valuable tabular data.37 The interest in synthetic data and the proven capability of this new form of data augmentation is in the incipient stages. Continuous improvements are being reported at very fast pace but no studies looked at generating synthetic data using near-infrared spectrums captured by ATR-FTIR. In this study, the standard structure of GAN as outlined in the work reported by ref. 33 is adopted. Detailed implementation instructions can be found within the Data processing and augmentation section.
Data pre-processing encompassed the manipulation necessary to adhere to a matrix structure. For instance, in the wood dataset, the dimensions were established as 24 rows and 900 columns, the cotton dataset consisted of 39 rows and 900 columns, and the dissolved glucose dataset was shaped into 40 rows and 900 columns. Each row in these datasets represents an analyzed sample of the aqueous fluid conducted through the FTIR analysis, with the features designated as wavelengths. Each sample was labeled according to the glucose concentration determined by the HPLC analysis.
To prepare the datasets for augmentation, they were individually loaded and subjected to further processing steps. These steps involved removing irrelevant columns, scaling the data using the Min-Max scaling technique, and dividing it into feature and label components.
The configuration of the GAN algorithm employed a multilayer perceptron architecture. Within the code (available under this link: https://github.com/silviu20/GAN_IR_Spectroscopy.git), various essential functions were specified to facilitate the augmentation process. One such function was “generate_latent_points(latent_dim, n_samples),” which generates random points (latent space vectors) by sampling from a standard normal distribution. These points serve as input for the generator model. Another crucial function is “generate_fake_samples(generator, latent_dim, n_samples),” which generates counterfeit samples by feeding randomly generated latent points into the generator model. The resulting samples are labeled as “fake” (y = 0). The function “generate_real_samples(n)” randomly selects genuine samples from the dataset, labeling them as “real” (y = 1).
In order to define the structure of the generator model, the function “define_generator(latent_dim, n_outputs)” is utilized. This function employs the Keras sequential model API and consists of two dense layers with the ‘relu’ activation function. The first hidden layer comprised of 15 nodes, while the second hidden layer had 30 nodes. The generator takes latent points as input and produces synthetic samples as output. The sequential model facilitates the creation of a linear stack of layers.
Similarly, the function “define_discriminator(n_inputs)” is used to establish the structure of the discriminator model, also utilizing the Keras library. The discriminator takes input samples, including the counterfeit samples generated by the generator, and evaluates their authenticity. Through its layers, the discriminator extracts features and processes them using weighted connections and activation functions. This transformation enables the capture of relevant information. The discriminator architecture incorporates three hidden layers with the ‘relu’ activation function. The first hidden layer had 25 nodes, the second hidden layer had 50 nodes, and the last layer contained a single node. As the data flows through the discriminator's layers, it gradually learns to differentiate between real and fake samples based on the acquired features. The last layer of the discriminator employs a sigmoid activation function, producing a binary output ranging from 0 to 1. This output represents the probability of the input sample being real or fake, with a value close to 1 indicating high authenticity and a value close to 0 indicating low authenticity. Using multiple layers in the generator and discriminator offers the benefit of enhancing the models' capacity to comprehend and depict intricate patterns within the data. This advantage translates into improved performance, enabling the models to generate more realistic samples and achieve greater accuracy in distinguishing between real and fake samples.38 By increasing the dimension of the nodes in discriminator, it was expected that the network would extract more information from the generator.39
Combining the aforementioned generator and discriminator models results in the construction of the GAN model. The GAN model takes latent points as input, generates counterfeit samples using the generator, and predicts their authenticity using the discriminator. Finally, the program trains the GAN by utilizing a combination of real and counterfeit samples. The discriminator and generator models were alternately trained for 100 epochs. The GAN algorithm was configured to produce an output of three times the size of the data it was generating from. The training progress was monitored through the evaluation of discriminator and generator losses. These losses were visualized in a history plot to provide insights into the dynamics of the GAN model (Fig. A II in ESI†).
To incorporate the three datasets into a unified framework, a low-level data fusion technique was employed. This technique involved stacking the data from different sources on top of each other, resulting in the creation of a new matrix.40 In order to augment the data, two distinct modes were employed, as described in ref. 41:
1. Posterior (post-fusion) to the merging of the datasets (e.g., W + C + DG + GAN)
2. Interstitial (pre-fusion) of the datasets (e.g., W + GAN_W + C + GAN_C + DG + GAN_DG)
Applying GAN to posteriorly merged dataset results in the generation of synthetic data that exhibits variations across different dataset types. On the other hand, the interstitial dataset contains more individual and homogeneous data types.42 Moving forward, the datasets generated through HTL will be referred to as the “original” datasets. The datasets consisting of the original dataset along with the synthetic samples generated by GAN will be referred to as the “hybrid” datasets.
In general, the application of feature engineering techniques can significantly enhance the accuracy of classifiers for various reasons.
Firstly, these techniques facilitate the capture of crucial distributional properties of the data, assisting classifiers in distinguishing between various classes or patterns. Analyzing the distributional properties of features can also aid in outlier identification and handling. Outliers, being data points that deviate significantly from the majority, have the potential to distort the distribution and impact classifier performance. Detecting and potentially treating or removing outliers can enhance the accuracy of the classification process.43
Secondly, feature engineering techniques can exhibit discriminative power, meaning they possess distinct values for different classes or patterns within the data.44 For example, in the case of spectra of IR spectroscopy, calculating the differences of statistical values can help highlight the unique characteristics of different classes, making it easier for the classifier to differentiate between them.
Thirdly, feature engineering can help reduce the impact of noise, by emphasizing the relative changes in the spectra rather than absolute intensity value.45 For instance, in the context of spectra from IR spectroscopy, the calculation of differences between statistical values can help emphasize the variations that are relevant for classification while reducing the impact of noise or absolute intensity values that may be subject to fluctuations.
Non-linear feature extraction techniques have demonstrated superior performance compared to classical approaches like linear principal component analysis (PCA) or linear discriminant analysis (LDA) on datasets with a similar tabular structure, such as the time-series ECG200.46 In this study, UMAP method was employed as a feature selection technique to reduce the dimensionality of the dataset, focusing on the most valuable features. Dimension reduction techniques have been found to improve classification performance, prevent overfitting and underfitting of SVC, and enhance the runtime efficiency of the classification algorithm.47 The hyperparameters selection was done by plotting the UMAP results on different purposely selected values for n_neighbors and n_components as it was applied in these studies.48,49 Example of the datapoints distribution is plotted in Fig. A IV in ESI.† A guide to the code used to generate and plot the figure can be found at ref. 50. Following the investigation of hyperparameters, the dimension was embedded with 65 components (n_components) as the default parameter. To ensure a comprehensive overview of the data's overall structure, the size of the local neighborhood (n_neighbors) was limited to 15. This constraint enabled UMAP to effectively capture the inherent structure of the data. Notably, in the context of infrared spectroscopy, the interaction between atoms and infrared radiation occurs across multiple wavenumbers. The Euclidean metric parameter was utilized to control the computed distance between data points.
Dataset | Precision | Recall | F1 score | Accuracy/% |
---|---|---|---|---|
W + C + DG | 0.9397 | 0.7790 | 0.8382 | 84 |
W + C + DG + GAN | 0.7594 | 0.3429 | 0.4696 | 88 |
W + GAN + C + GAN + DG + GAN | 0.9286 | 0.8592 | 0.8903 | 91 |
The accuracy of the base model 1, which is the hybrid dataset with GAN applied posteriorly to the merging (W + C + DG + GAN) was 88%. In the first ablation study (ablation study A1, Fig. 4), fifteen permutations showed more than 10% decrease in accuracy, two returned similar values, while the others manifested incremental increases, with three reaching 92% (highlighted by green borders). For the most performing models, this represents a 4% increase in accuracy compared to the base model 1. UMAP application (ablation study A2, Fig. 4) showed similar performance to the base model 1, with the exception of three outliers that reached 91% accuracy. Interestingly, UMAP stabilized the performance of those models that were fitting poorly earlier in ablation study A1. This could be the result of retaining only the glucose contributory features.
The accuracy of the base model 2, which is the hybrid dataset with GAN applied interstitially (W + GAN_W + C + GAN_C + DG + GAN_DG) was approximately 91%. The process of feature engineering improves the models to 92% and 93% respectively, a small but valuable contribution to the classification of glucose in aqueous solution (ablation study A3, Fig. 4). UMAP application (ablation study A4, Fig. 4) performed poorly compared to base model 2, reducing the accuracies to 70 to 80% range.
Dataset permutation | Accuracy/% | Standard deviation | Precision | Recall | |
---|---|---|---|---|---|
A1 | M | 91.80 | 2.88 | 0.9737 | 0.8981 |
A2 | St + V + Sk | 90.96 | 3.64 | 0.9237 | 0.9254 |
A3 | M + V + K | 93.49 | 2.35 | 0.9588 | 0.9313 |
A4 | M + Sk | 81.56 | 4.50 | 0.8873 | 0.8070 |
In Fig. 5 the SHAP values and their contribution to the classification model based on the W + C + DG dataset are shown. Absorption values from across the spectrum are present, from the O–H group characteristic to 3000 to 4000 cm−1 (10 out of 20 features) stretching and C–O group stretching. The impact of these features is shown by the coloured dots. Preponderant high values are present in the absorption of the C–H stretching in CH3 at ∼1364 cm−1, the syringyl ring breathing represented at ∼1267 cm−1 and ∼1215 cm−1, as well as the C–O stretching and ∼1073 cm−1, respectively. The O–H group stretching has lesser impact towards the model output, as highlighted by the blue dots. The even distribution of features impact might be the reason for the average accuracy of 84%.
In Fig. 6, the feature contribution of the dataset W + GAN_W + C + GAN_C + DG + GAN_DG + M + V + K (highlighted in Table 3) are presented. The classification model using this dataset showed the highest performance accuracy, an average of 93.49% ± 2.35%.
Fig. 6 Top 20 features contribution towards the model classification for the W + GAN_W + C + GAN_C + DG + GAN_DG + M + V + K dataset. |
Compared to the model in Fig. 5, feature engineering played a more significant role in the order of importance of values. In this case, only two values from the original dataset are part of the top 20 most important features, namely the absorption from the O–H stretching, ∼3291 cm−1, and the aliphatic C–H stretching in CH3, ∼1364 cm−1. The primer has negative impact on the classification model, whereas the secondary has a high positive impact. Having the C–H stretching contributing towards the model is valuable, since this stretching is part of the glucose composition as seen in Fig. A I (in ESI†). The resulted mean difference feature from the same wavelength is also positively influencing the model output, and it tops as the most influential feature (∼1364 cm−1 + M). Similarly, the engineered feature of the O–H stretching containing the kurtosis difference value is also showing a negative contribution towards the model output. Other significant features captured in the top 20 most important features contributing to the model output include the mean difference of compounds in the frequencies ∼1468 cm−1, ∼1162 cm−1, ∼2963 cm−1, ∼1140 cm−1, ∼1431 cm−1, ∼1032 cm−1, ∼1405 cm−1, 1103 cm−1, and kurtosis difference of 1032 cm−1. These frequencies directly express those seen in solid and aqueous glucose solution spectrums. Additionally, it can be noticed that O–H stretching range and C–O group range (3000 to 4000 cm−1) contribute negatively to the model output. This is because they represent groups of compounds found in water, which are not important in identifying compounds specific to glucose.
First, individual datasets were used for building a classification model. Second, GAN was applied under two data fusion circumstances. It was found that the classification of the hybrid datasets is dependent on the fusion type. GAN used posteriorly scored a lower accuracy compared to GAN applied interstitially. Furthermore, hand-crafted features were added to improve the classification models. The results showed an average accuracy increase of more than 9% over the base model, from 84% to more than 93%. Under the same argument, we also applied UMAP. The dimensionality reduction method did not exceed the earlier reported accuracy but it improved above the base model from 84% to 91%. The best performing model was explained by employing SHAP values. It was found that within the top 20 features, those related to the glucose compounds are positively influencing the classification model, whereas those found in water are negatively contributing towards the model output. Although this framework is tested on the HTL biomass conversion system, it opens new avenues for integrating FTIR in continuous process monitoring.
For example, the integration of data augmentation using generative AI and IR spectroscopy for process monitoring has the potential to revolutionize costly and lengthy research and development activities such as monoclonal antibody production, gene therapy manufacturing, and cultured meat production. Generative AI techniques enable the generation of synthetic data, augmenting existing datasets and providing greater volume and variability. This augmented dataset improves machine learning model training, enhancing accuracy and generalization. Consequently, it accelerates the research cycle by enabling simulation, prediction, and optimization of process parameters without extensive physical experimentation. FTIR as a sensory technique allows real-time process monitoring, continuously analyzing critical quality attributes and parameters to ensure consistency, reproducibility, and early detection of deviations. When coupled with a classifier such as SVC, can even outperform traditional process control techniques (e.g., Proportional–Integral–Derivative). This enables timely interventions and corrective actions, reducing batch rejections and enhancing overall product quality. Ultimately, the implementation of generative AI and IR spectroscopy mitigates risks in the aforementioned research and development activities, resulting in cost savings by minimizing production failures and optimizing process performance.
Consequently, the current method offers the distinct benefit of being a decentralized AI system, addressing the issue of biases found in master datasets. Master datasets, typically sourced from large-scale platforms, may unknowingly harbor biases and dominant features that contribute to inequalities or reinforce societal imbalances. However, by training decentralized AI models using local data, such as the data generated under the HTL conditions outlined in this study, this method potentially mitigates these biases and fosters fairer and more inclusive machine learning applications.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d3ya00236e |
This journal is © The Royal Society of Chemistry 2023 |