Predicting lignin removal efficiency in deep eutectic solvent-based biomass fractionation: an explainable machine learning approach

Zijing Zhong ab, Babbiker Mohammed Taher Gorish ab, Yue Bai ab, Waha Ismail Yahia Abdelmula ab, Wenqian Dang ab and Daochen Zhu *ab
aInternational Joint Laboratory on Synthetic Biology and Biomass Biorefinery, Biofuels Institute, School of the Environment and Safety Engineering, Jiangsu University, Zhenjiang 212013, China. E-mail: dczhucn@ujs.edu.cn
bJiangsu Collaborative Innovation Center of Technology and Material of Water Treatment. Suzhou University of Science and Technology, Suzhou 215009, China

Received 13th May 2025 , Accepted 1st August 2025

First published on 11th August 2025


Abstract

Efficient and sustainable lignin extraction is essential for advancing green biorefinery processes. Deep eutectic solvents (DESs), as environmentally benign and tunable media, offer promising alternatives to conventional solvents, yet their rational design and optimization remain highly resource-intensive. In this study, we developed a novel machine learning-guided framework that integrates computational chemistry and data-driven modeling to accelerate the design of DES systems for lignocellulosic biomass pretreatment. A comprehensive dataset of 467 pretreatment conditions was constructed, featuring molecular descriptors and fingerprints to characterize DES chemical structures and key operational parameters. Among several models evaluated, the XGBoost model achieved the best predictive performance (R2 = 0.8259, RMSE = 0.0928, MAE = 0.0672). SHAP analysis provided mechanistic insights, revealing the dominant influence of hydrogen bond donor–acceptor structures and solvent-to-solid ratio on lignin solubilization. The model demonstrated robustness when validated against an independent dataset (R2 = 0.8034, MAE = 0.0661). This work offers a sustainable and mechanistically interpretable strategy for green solvent development, significantly reducing experimental workload and chemical waste, and contributing to advancing biomass valorization technologies.



Green foundation

1. This study proposes a sustainable approach to optimizing lignin removal from lignocellulosic biomass by assisting resource-intensive, trial-and-error Deep Eutectic Solvent (DES) experimental methods with an explainable machine learning framework.

2. Our XGBoost model (R2 = 0.8259) for predicting lignin removal efficiency in DES systems significantly reduces chemical waste, energy consumption, and raw material usage by minimizing the need for extensive physical synthesis and testing of numerous DES formulations.

3. By providing mechanistic insights through SHAP analysis, this data-driven framework accelerates the rational design of more efficient and environmentally benign DES formulations, directly advancing the green and sustainable development of biomass valorization technologies.


1. Introduction

Effective fractionation of lignin and cellulose is critical for maximizing biomass valorization, but conventional pretreatment methods remain plagued by high energy demands and poor selectivity, severely limiting biorefinery performance. For example, the kraft pulping process-which accounts for over 70% of global industrial lignocellulose pretreatment-consumes on the order of 4.4 GJ per tonne of pulp and generates “black liquor” rich in degraded lignin fragments that require combustion or energy–intensive recovery steps.1–3 Acid pretreatments (e.g., dilute sulfuric acid at 160–200 °C) can achieve substantial hemicellulose solubilization and partial lignin removal but often produce 4–18 g L−1 of inhibitory compounds such as furfural and hydroxy methyl furfural, complicating downstream bioconversion.4,5 Alkali methods (e.g., NaOH at 120–180 °C) similarly disrupt lignin–carbohydrate complexes but lead to extensive lignin degradation, loss of phenolic functionality, and the formation of condensed, recalcitrant structures that resist further valorization.6,7 Organic–solvent approaches (organosolv) can yield relatively pure lignin fractions. Still, the economic viability of this approach depends critically on efficient solvent recycling, typically achieved by distillation, which incurs substantial energy costs. Moreover, the high volatility of these solvents introduces significant safety hazards, requiring rigorous operational controls and strict adherence to safety protocols.8,9 To achieve high lignin yield, the technical lignin extraction process typically employs harsh reaction conditions, that frequently trigger lignin condensation via C–C coupling, further lowering monomer yields and complicating catalyst or biocatalyst access to reactive sites.1 These challenges underscore the urgent need for environmentally benign, energy-efficient, and highly selective fractionation technologies that preserve lignin's chemical integrity and unlock its full potential as a feedstock for high-value chemicals.10

Deep eutectic solvents (DESs) have gained traction as next-generation green solvents due to their low toxicity, negligible volatility, non-flammability, facile preparation from inexpensive precursors, and intrinsic biodegradability.11 They are typically formed by mixing a hydrogen-bond donor (HBD) (e.g., organic acids, polyols) with a hydrogen–bond acceptor (HBA) (e.g., choline chloride) in defined molar ratios to produce a eutectic mixture whose melting point is drastically lower than that of its components.12,13 In lignocellulose fractionation, ternary DES systems-such as choline chloride/oxalic acid/ethylene glycol-have been shown to preserve up to 53 β-O-4 linkages per 100 aromatic units in the recovered lignin while achieving delignification yields exceeding 80% at 120 °C within four h, thereby maintaining lignin's structural integrity for downstream valorization.14 A recent systematic investigation of over forty choline chloride-based DESs further demonstrated that increasing the HBD-to-HBA ratio and employing shorter-chain HBDs (e.g., glycerol vs. ethylene glycol) can enhance industrial lignin solubility from 50% to over 70% at 100–140 °C.15 Despite these promising results, the enormous diversity of possible DES formulations-spanning combinations of quaternary ammonium salts, carboxylic acids, polyols, sugars, and urea derivatives—and the sensitivity of lignin solubility to process parameters (temperature, reaction time, moisture content, solvent-to-solid ratio) generate a complex, multidimensional design space.16 Synergistic and antagonistic hydrogen-bonding interactions among DES components further complicate the prediction of solvent performance. Traditional trial-and-error screening is therefore insufficient to navigate these nonlinear, multivariate relationships, and there remains a lack of systematic understanding linking specific molecular descriptors (e.g., hydrogen bond strength, polarity, functional group) to lignin dissolution efficiency, limiting the rational design of optimized DES formulations.

Machine learning (ML) has emerged as a transformative tool for accelerating solvent discovery and process optimization in sustainable chemistry. By mining historical experimental datasets, ML models can uncover hidden patterns, build predictive relationships, and dramatically reduce the experimental burden of navigating complex design spaces.17,18 For instance, Taylor et al. critically reviewed the integration of ML with quantum and molecular-scale simulations for ionic-liquid–lignin systems, highlighting how data-driven approaches can predict solvent–lignin interactions with improved accuracy over purely empirical methods.19 Similarly, in lignocellulose pretreatment studies, XGBoost models achieved R2 values between 0.756 and 0.980 in predicting lignin destabilization under varying thermal pretreatment conditions, demonstrating ML's potential to optimize key parameters such as temperature and time.20 Despite these successes, the application of ML to DES-based lignin extraction remains nascent. A recent survey of ML in lignocellulosic biorefineries noted that while artificial neural networks (ANN) dominate bioconversion predictions—appearing in nearly 90% of published studies-few efforts have extended such methods to DES solvent design or provided mechanistic insights into the underlying driving forces.21 There remains no systematic framework linking DES molecular descriptors, biomass composition, and operational parameters to lignin dissolution efficiency. This “black-box” nature of conventional ML models limits their scientific interpretability and impedes the derivation of generalizable design principles for green solvents. To overcome these challenges, there is an urgent need for ML frameworks that integrate multi-scale descriptors, including solvent molecular structure, biomass characteristics, and process variables, and employ explainable AI techniques to reveal how specific features (e.g., hydrogen bond strength, polarity, molar mass) govern biomass fractionation outcomes.

In this study, we developed a machine learning-guided framework to optimize lignin extraction from lignocellulose using DESs (Fig. 1). We trained and compared multiple representative machine learning models by constructing a comprehensive dataset integrating solvent molecular descriptors, process parameters, and extraction outcomes. Furthermore, explainable AI techniques were employed to elucidate the key molecular and process factors influencing lignin dissolution, providing mechanistic insights into DES performance. This work offers a data-driven strategy for the rational design of green solvent systems, advancing the sustainable valorization of biomass toward circular economy and carbon neutrality goals.


image file: d5gc02370j-f1.tif
Fig. 1 Schematic illustrating the conventional and ML formulation development methods for fractionating biomass to extract lignin using deep eutectic solvents.

2. Green chemistry concept on optimizing deep eutectic solvents for sustainable lignin removal using machine learning

Our study introduces a machine learning-guided framework for designing DES systems for lignin removal, offering a significantly greener approach than previous traditional experimental (trial-and-error) DES methods (Table 1). While traditional DES methods are inherently greener than conventional biomass fractionation techniques due to their solvent choice, they still incur substantial environmental costs during the research and development phase. Our method revolutionizes this R&D process by integrating computational chemistry and data-driven modeling, drastically reducing experimental workload and chemical waste. By accurately predicting lignin removal efficiency with an XGBoost model (R2 = 0.8259) and gaining mechanistic insights through SHAP analysis, we enable a more rational and efficient design of DESs, minimizing the need for extensive physical experimentation. This approach directly translates to a substantial reduction in raw material consumption, energy usage, and chemical waste generation during the optimization of DES-based biomass pretreatment, making the discovery process considerably more sustainable than prior art that relied heavily on repeated laboratory synthesis and testing.
Table 1 Comparative analysis of machine learning-guided vs. traditional DES development for lignin extraction
Risk type Metric Previous methods (data & references) The current study
Mass of waste E factor/process mass intensity High (due to repeated trial-and-error experiments) Low (ML predicts optimal conditions, reducing failed trials)
Ecotoxicity LC50 (aphnia magna, mg L−1) Moderate (some solvents had high aquatic toxicity) Low (ML selects DES with benign HBD/HBA components)
Energy usage Heating/distillation required? High (frequent heating/mixing for each trial) Low (only validated conditions are tested)
Human toxicity LD50 (rat, oral, mg kg−1) Variable (depends on solvent choice) Optimized (non-toxic DES components prioritized) Safer chemical profiles due to explainable ML guidance.
Persistence Half-life (days) Some solvents were persistent (e.g., ionic liquids) Lower environmental persistence via greener solvent design.
Resource depletion Rare elements used Possible (if metal catalysts are used) Minimal (avoids precious metals, uses abundant HBD/HBA)


It is important to note that although DESs are frequently regarded as green solvents, our model was developed solely to predict lignin removal efficiency without applying explicit environmental constraints to the DES components. Given the broad chemical space of DESs, not all predicted combinations may adhere to green chemistry principles. Future work should integrate environmental and sustainability metrics—such as toxicity, biodegradability, and renewability of components—into the modeling framework to guide the selection of truly green and efficient solvent systems.

3. Materials and methods

3.1. Data collection

A comprehensive dataset of DES–lignin pretreatment experiments was compiled from peer-reviewed publications (Tables S1 and S2). In assembling this dataset, we observed that herbaceous and woody biomass exhibit pronounced differences in monomer composition, inter-unit linkages (e.g., β-O-4 vs. C–C), and cell-wall architecture, leading to divergent fractionation behaviors under identical DES conditions; Baran et al.22 demonstrated via FreeViz projection of ionic-liquid pretreatment parameters that these two biomass classes occupy distinct regions in multivariate feature space. Mixing both courses in a single training set would therefore obscure model fitting, degrade predictive accuracy, and produce ambiguous feature-importance interpretations, limiting practical guidance for DES formulation. To maximize model generalizability and interpretability, we accordingly restricted our dataset to herbaceous-biomass studies. For each of the 467 unique experimental conditions, we extracted solvent descriptors (HBD/HBA identity and ratio, physicochemical properties), process parameters (temperature, reaction time, moisture content, solvent-to-solid ratio) and lignin yield (gravimetric separation efficiency), standardizing all values to standard units (°C, minutes, wt% moisture, g solvent per g biomass, % yield). Missing values (<5% of entries) were imputed via k-nearest neighbors (k = 5) based on similarity in solvent composition and pretreatment conditions.

To enhance the practical value of the ML model's predictions, the dataset only uses prior features. It excludes experimental outcomes, such as post-treatment lignin content, as input features to avoid data leakage. The feature variables of the dataset were categorized into three groups: biomass particle size and composition (cellulose, hemicellulose, and lignin content), process parameters (temperature, reaction time, solid-to-liquid ratio, DES molar ratio, and severity factor log[thin space (1/6-em)]R0), and chemical structural descriptors of the DES solvent components (molecular descriptors (MDs), molecular fingerprints (MFs)). Due to the varying methods of lignocellulose component determination in different studies, this work selected data points measured using the same detection methods (NREL standard methods23 and modified variants). However, due to factors such as reagent quality and experimental conditions, even under similar process conditions, the lignin fractionation results of DES solvents may still show some discrepancies across different studies. In cases of inconsistent information, the source articles were carefully reviewed to avoid data with missing feature information, and more reliable data were selected based on the evaluation of the experimental protocols used.

The final curated dataset comprises 467 experimental records covering 19 different herbaceous biomass types, including wheat straw, rice straw, switchgrass, and Arabidopsis thaliana mutants. The compositional diversity of biomass inputs is illustrated in Fig. 2a, which shows the range of cellulose, hemicellulose, and lignin contents. These variations reflect the chemical heterogeneity inherent to herbaceous lignocellulose and underscore the broad applicability of the ML framework. The DES formulations were composed of 27 HBAs and 36 HBDs, with varied molar ratios across different combinations. The molar mass distributions of HBA and HBD species are summarized in Fig. 2b. The experimental conditions span a wide range of pretreatment temperatures (Fig. 2c), solid-to-liquid ratios (L[thin space (1/6-em)]:[thin space (1/6-em)]S) (Fig. 2d), and reaction times (Fig. 2e), capturing the operational diversity commonly used in DES fractionation protocols. The distribution of the target variable, lignin removal yield, is shown in Fig. 2f. To ensure numerical stability and accelerate model convergence, all continuous input features were normalized using min–max scaling before model training. Normalization also improves generalization performance and enhances model interpretability by eliminating scale-based feature dominance. For features involving nonlinear process intensity, the pretreatment severity factor log[thin space (1/6-em)]R0 was calculated according to eqn (1).

 
Log[thin space (1/6-em)]R0 = log(t × exp(T − 100)/14.7)(1)
where t represents the pretreatment time (min); T represents the pretreatment temperature (°C).


image file: d5gc02370j-f2.tif
Fig. 2 Distribution of partial input characteristics and lignin removal rates. (a) Biomass composition (cellulose, hemicellulose, lignin); (b) molar mass of HBAs and HBDs; (c) pretreatment temperature (°C); (d) liquid-to-solid (L[thin space (1/6-em)]:[thin space (1/6-em)]S) ratio; (e) pretreatment time (h); (f) lignin removal yield (%).

3.2. Data pre-processing

Biomass composition and process parameters were manually extracted from the literature, whereas solvent structural representations were generated computationally to capture their chemical complexity. Because molecular structure underpins physicochemical behavior, the choice of encoding scheme critically affects model performance. We therefore employed two complementary, QSPR-proven approaches-MDs and molecular MFs to featurize each DES component.24–27 MDs (e.g., molecular weight, topological polar surface area, H-bond donor/acceptor counts) and MFs (binary vectors encoding substructure presence) were computed via the open-source cheminformatics libraries RDKit and Mordred,28,29 enabling rapid, large-scale extraction of over 1800 features per molecule. For datasets of this size, the MD + MF combination delivers a stable, interpretable feature space that obviates the need for more complex graph-based embeddings while ensuring robust predictive accuracy.22

For each DES formulation, we encoded both HBAs and HBDs using MDs and MFs, labeling features with an “HBA_” or “HBD_” prefix. Morgan fingerprints were generated with radius = 1 and nBits = 6000, following the original algorithm (for detailed information about the reasons for setting the Morgan fingerprint, see Text S1; Fig. S1). Because a DES is a mixture rather than a single compound, we aggregated the component fingerprints by summing the HBA and HBD bit vectors in proportion to their molar ratios – a procedure commonly used in multi-component cheminformatic studies. However, this summation produces entries greater than 1, which violates the binary “presence/absence” logic of Morgan fingerprints and complicates downstream interpretation. This limitation underscores the need for alternative aggregation schemes (e.g., bitwise OR, feature hashing, or continuous vector embeddings) that preserve the interpretability of substructure counts while accurately reflecting mixture composition.

To overcome the limitations of binary Morgan fingerprints, we adopted count-based Morgan fingerprints (C-MFs), in which each bit can take integer values ≥1 corresponding to the number of distinct atom-centered environments mapped to that bit (Table S3).30 C-MFs preserve substructure count information, enabling straightforward aggregation of HBA and HBD contributions in proportion to their molar ratios and providing a richer, continuous feature space that enhances dynamic feature influence and mitigates low-variance issues common with binary vectors.

MFs provide qualitative representations of molecular structures via binary vectors or sequence encodings, whereas MDs numerically capture structural, physicochemical, and topological properties. When constructing the different QSAR models, both MFs-based and MDs-based approaches exhibit distinct strengths.24,26,31 Their predictive performance hinges on the intrinsic structure–property relationships within the dataset and the selected machine learning algorithm, and it isn't easy to judge which is the optimum molecular representation.32 Prior works have shown that combining MFs (including C-MF) with MDs can yield synergistic gains in predictive accuracy across diverse chemical datasets,32,33 This dual-encoding strategy is well-suited for robust ML modeling of DES performance.

The extensive set of MDs and MFs can easily exceed the number of observations, invoking the “curse of dimensionality”, which undermines model generalization and computational efficiency. Mitigation strategies include increasing data size or reducing feature dimensionality, the latter achieved by either feature selection-removing constant or highly correlated variables- or feature extraction-projecting the data onto a lower-dimensional space (e.g., via principal component analysis, PCA). In our study, constant features in the MF matrix (bits with zero variance) were first eliminated to remove noninformative descriptors. We opted against PCA for fingerprint reduction (Text S2) to preserve the interpretability of substructure counts.

For the >1800 MDs generated by Mordred, many descriptors represent intermediate computational quantities or redundantly correlate with more interpretable features, complicating mechanistic insight. For this purpose, we addressed the high dimensionality of the MDs dataset by performing targeted feature selection based on both domain-specific chemical knowledge and mechanistic insights reported in DES-biomass fractionation literature.34 Specifically, we curated 13 descriptors to describe HBA and HBD, respectively, totaling 26 features. These variables were selected not only for their statistical variability but also for their direct relevance to known physicochemical mechanisms underlying lignin disruption, such as hydrogen bonding potential, acid–base interactions, solvation polarity, and molecular size. The 13 selected MD features represented a broad spectrum of DES molecular properties, including acidity/basicity (e.g., pKa for acidic/neutral species and pKb for basic groups), polarity indices (e.g., polar surface area, polarizability), partition coefficient (log[thin space (1/6-em)]P), molecular weight, and carbon atom count. These descriptors collectively capture the key physicochemical parameters that govern the solvent's ability to disrupt lignin–hemicellulose linkages. Prior mechanistic studies have established that the delignification process by DESs often involves cleavage of ether or ester bonds, which can be catalyzed or facilitated by the presence of solvated H+ or OH species in the medium.34–36 Therefore, the inclusion of acid–base dissociation constants and the number of proton-donating or -accepting functional groups in the feature set was essential to account for these catalytic aspects.

Moreover, the hydrogen bond network formed between DES components and lignin molecules plays a central role in the dissolution mechanism. Specifically, the capability of the DES to establish strong hydrogen bonds with lignin's hydroxyl-rich aromatic macromolecules is critical for disrupting its supramolecular structure.37 Parameters such as HBD and HBA counts, polar surface area, and polarizability are therefore key indicators of the strength and density of such interactions. Log[thin space (1/6-em)]P values offer additional insight into solvent hydrophobicity, which may influence phase compatibility and interfacial interactions during pretreatment. The number of carbon atoms and the molecular weight of each component further inform solvent properties such as viscosity, diffusivity, and molecular mobility, all of which influence the kinetics of biomass deconstruction.

In conclusion, guided by domain knowledge and prior QSPR work on lignin solubility, we preselected MDs likely to influence dissolution, such as hydrogen-bond donor/acceptor counts, molecular polarizability, and topological polar surface area.11,22,38–40 The specific data of the relevant MDs can be found in Table S4. We then performed agglomerative hierarchical clustering on these MDs alongside biomass and process parameters, using Euclidean distance and average linkage to group strongly correlated features into clusters.41 At each clustering level, we evaluated the impact of cluster-specific feature subsets on lignin-removal prediction accuracy, iteratively pruning the least contributive clusters to arrive at a final, parsimonious feature set for subsequent model training. This combined domain-guided and data-driven filtering approach ensures both robustness and interpretability of our ML framework.

The use of MFs in conjunction with chemically grounded MDs allowed us to construct hybrid models that combine structural specificity with mechanistic interpretability, facilitating both predictive accuracy and explanatory insight into DES-driven lignin removal.

3.3. Algorithm

To benchmark predictive performance for deep eutectic solvent-mediated lignin extraction, we compared five machine learning algorithms representing four core paradigms. Support Vector Regression (SVR) is based on statistical learning theory and excels at high-dimensional nonlinear regression with limited samples.42 Random Forest (RF) builds an ensemble of decision trees on bootstrap samples and averages their outputs, which captures complex interactions in data.43 eXtreme Gradient Boosting (XGBoost) uses gradient boosting with regularization and parallel computation to achieve high accuracy on structured data, and it handles missing values robustly.44 Categorical Boosting (CatBoost) extends gradient boosting by processing categorical variables without extensive encoding; it uses ordered boosting and symmetric tree structures to reduce bias from target leakage.45 Finally, the ANN model is a fully connected feed-forward network capable of approximating complex nonlinear relationships in high-dimensional feature spaces.46 This diverse set of algorithms provides a comprehensive evaluation of modeling strategies for the rational design of solvent systems.

3.4. Construction and optimization of machine learning models

To predict lignin removal efficiency in DES systems, five ML models were developed using rationally engineered molecular representations. After rigorous feature selection, various combinations of molecular fingerprints, molecular descriptors, and process parameters were integrated into a unified feature matrix. This feature set was used to construct five ML models: SVR, RF, ANN, XGBoost, and CatBoost. The SVR, RF, and ANN models were built using the Scikit-learn library, while XGBoost and CatBoost were implemented using their respective optimized platforms.47,48

Model development followed a structured and reproducible workflow. The dataset was randomly divided into a training set and a testing set at a ratio of 80 to 20. The training set was used exclusively to construct and tune the models, while the testing set was held out for final evaluation of model generalization. Bayesian optimization was applied in conjunction with five-fold cross-validation to optimize hyperparameters.30 The optimization objective was to maximize average validation performance. During this process, candidate hyperparameter combinations were selected and updated iteratively. The hyperparameter search ranges were defined based on prior knowledge of their impact on model complexity and predictive stability, as detailed in Table S5.

Due to the limited size of the dataset, ten-fold cross-validation was employed to reduce the risk of overfitting and to exploit the available data fully. The data was divided into ten mutually exclusive subsets. In each iteration, one subset served as the validation set while the remaining nine subsets were used for model training. This process was repeated ten times, and the average validation score was calculated to ensure robustness and statistical reliability. This approach improves generalization performance and ensures that the observed results are not artifacts of a specific data split.

3.5. Optimization of molecular features

In this work, we optimized molecular representations to enhance predictive modeling of DES interactions with lignocellulose. C-MFs were applied to convert raw structural data into meaningful chemical features. To improve model efficiency, a variance-based bit selection method was utilized, reducing dimensionality from 6000 to 157 bits. This process effectively eliminated non-discriminative features while retaining critical substructure patterns. To minimize redundancy in MD descriptors, we applied Spearman's rank correlation and agglomerative hierarchical clustering. This approach allowed us to group highly correlated descriptors and eliminate overlapping information, ensuring that only essential features remained for model construction.

3.6. Model evaluation

To quantitatively assess the performance of the machine learning models developed for predicting lignin removal efficiency, three commonly used regression evaluation metrics were employed: the coefficient of determination (R2), root-mean-square error (RMSE), and mean absolute error (MAE). These metrics were applied consistently to both training and testing phases to ensure a robust and comprehensive evaluation of predictive accuracy and model generalization.

The MAE reflects the average magnitude of prediction errors without considering their direction. It is defined as follows:

 
image file: d5gc02370j-t1.tif(2)
where n is the total number of samples, yi denotes the actual observed value, and Yi is the model-predicted value. A lower MAE indicates smaller average deviations between the predicted and actual values, implying improved predictive accuracy.

The RMSE measures the standard deviation of the residuals (prediction errors) and is particularly sensitive to large deviations due to the squared error term:

 
image file: d5gc02370j-t2.tif(3)

Compared to MAE, RMSE penalizes significant errors more heavily, thus providing insight into the model's robustness against outliers. Together, MAE and RMSE offer a balanced view of model reliability and resilience across varying error magnitudes.49

The R2 evaluates the proportion of variance in the dependent variable that is captured by the model, serving as a direct measure of model fit:

 
image file: d5gc02370j-t3.tif(4)
where image file: d5gc02370j-t4.tif is the mean of the actual target values. R2 values closer to 1 indicate a stronger explanatory capability of the model, while values approaching 0 suggest a poor fit and limited predictive power. In the context of green chemistry applications, achieving high R2 with minimal error metrics is essential to ensure data-driven process optimization and sustainable decision-making.

3.7. Model interpretation

To improve interpretability, we selected the XGBoost model with the best predictive performance for feature attribution analysis using the SHapley Additive exPlanations (SHAP) method.48 SHAP quantifies the contribution of each feature to the model's output by assigning it a SHAP value, which represents the impact of that feature on the predicted target value. Features were ranked by their average absolute SHAP values to identify the most influential variables affecting lignin removal efficiency. This approach provides both local and global interpretability, revealing not only how individual predictions are made but also which features consistently drive model behavior. The analysis was conducted in Python 3.8 using the SHAP package, enabling transparent and chemically meaningful interpretation of the model's results.

3.8. Experimental materials

Corn straw (CS) was collected from a local farm in Lianyungang, Jiangsu Province, China. The biomass was rinsed thoroughly with water to remove surface impurities, oven-dried at 60 °C to constant weight, ground into fine powder, and passed through a 60-mesh sieve. The processed straw powder was stored in airtight containers at room temperature until further use. On a dry weight basis, the CS composition was determined to be 43.80% cellulose, 24.69% hemicellulose, and 16.82% lignin, which included 13.18% acid-insoluble lignin and 3.68% acid-soluble lignin.

Choline chloride (ChCl, 98%), lactic acid (85%), and urea (95%) were purchased from Macklin Biochemical Technology Co., Ltd (Shanghai, China). Glycerol (98%) and Nile Red (98%) were obtained from Shanghai Shenggong, and p-nitroaniline (99%) was purchased from J&K Scientific (Beijing, China). All chemicals were of analytical grade and used without further purification.

3.9. DES synthesis and biomass pretreatment

DESs were prepared by mixing ChCl with HBDs such as lactic acid, urea, and glycerol in a 1[thin space (1/6-em)]:[thin space (1/6-em)]2 molar ratio. The mixtures were heated to 80 °C in an oil bath with continuous magnetic stirring until a clear and homogeneous liquid was formed. The solutions were then left at room temperature for 6 hours to assess clarity and phase stability, followed by drying in a vacuum oven at 60 °C for 24 hours to remove residual moisture. The dried DESs were sealed and stored at ambient conditions. For pretreatment, 2.0 g of corn straw was combined with 10.0, 16.0, or 40.0 g of DES to obtain solid to liquid ratios of 1[thin space (1/6-em)]:[thin space (1/6-em)]5, 1[thin space (1/6-em)]:[thin space (1/6-em)]8, and 1[thin space (1/6-em)]:[thin space (1/6-em)]20, respectively. The sealed bottles were heated in an oil bath at 100 °C, 120 °C, 130 °C, or 160 °C with magnetic stirring at 300 rpm for durations of 0.5, 1.25, or 2 hours. After reaction, the bottles were cooled to room temperature, and an acetone–water solution (v/v 3[thin space (1/6-em)]:[thin space (1/6-em)]7) was added. The resulting slurry was filtered under vacuum, and the solid residue was washed three times using 20 mL of an acetone–water solution. The recovered solids were dried at 90 °C for 12 hours and then weighed. Lignin content of the pretreated residues was analyzed according to NREL protocols, and lignin removal efficiency was calculated based on changes in lignin content before and after pretreatment. All experiments were performed in triplicate, and results are presented as mean values to ensure accuracy and reproducibility.

4. Results and discussion

4.1. Feature engineering

A total of 35 MDs were analyzed using pairwise Spearman correlation, as shown in Fig. 3a. Strong correlations were observed between various descriptors, such as HBA molecular weight and polarizability (r = 0.95), HBA pKa/pKb (r = 0.89), HBA carbon count and number of rotatable bonds (r > 0.95), HBD molecular weight and polarizability (r > 0.92), and HBD acidity count and pKa/pKb (r < –0.89). To address this, we employed agglomerative hierarchical clustering (Fig. 3b) to group similar features and eliminate redundancies. The optimized descriptor set, summarized in Table 2, was subsequently used for final model construction.
image file: d5gc02370j-f3.tif
Fig. 3 (a) Heat map of Spearman's rank correlation between the 35 features initially selected. Dark red indicates a Spearman rank correlation of 1, and dark blue indicates a Spearman rank correlation of −1. (b) A dendrogram obtained through agglomerative hierarchical clustering represents the hierarchical clustering of features obtained through cluster analysis.
Table 2 Hierarchical clustering selected feature groups
Groups of variables Abbreviation of variables Variables interpretation Group
0 1 2 3 4 5 6 7
Composition of biomass Cellulose Cellulose content of raw biomass, %
Hemicellulose Hemicellulose content of raw biomass, %
Lignin's Lignin content of raw biomass, %
Size Particle size of raw biomass, mm
Technological parameter Temp Pretreatment temperature, °C
Time Pretreatment time, h
HBD[thin space (1/6-em)]:[thin space (1/6-em)]HBA Molar ratio
L[thin space (1/6-em)]:[thin space (1/6-em)]S Mass ratio of DES to raw biomass
Log[thin space (1/6-em)]R0 Severity factors of pretreatment reaction
MDs HBA-Sp Polarizability OF HBA
HBD-Sp Polarizability OF HBD
HBA-pKa/pKb Acidity coefficient/alkalinity coefficient of HBA
HBD-pKa/pKb Acidity coefficient/alkalinity coefficient of HBD
HBA-MW Exact molecular weight of HBA
HBD-MW Exact molecular weight of HBD
HBA-TopoPSA Topological polar surface area of HBA
HBD-TopoPSA Topological polar surface area of HBD
HBA-nHBAcc Number of hydrogen bond acceptors of HBA
HBD-nHBAcc Number of hydrogen bond acceptors of HBD
HBA-nHBDcc Number of hydrogen bond donors of HBA
HBD-nHBDcc Number of hydrogen bond donors of HBD
HBA-S[thin space (1/6-em)]log[thin space (1/6-em)]P_VSA1 The estimated partition coefficient by to the atoms’ contribution to the molecular surface area of HBA
HBD-S[thin space (1/6-em)]log[thin space (1/6-em)]P_VSA1 The estimated partition coefficient by to the atoms’ contribution to the molecular surface area of HBD
HBA-S[thin space (1/6-em)]log[thin space (1/6-em)]P Hydrophobic constants of HBA
HBD-S[thin space (1/6-em)]log[thin space (1/6-em)]P Hydrophobic constants of HBD
HBA-nAromAtom The aromatic atom count of HBA
HBD-nAromAtom Aromatic atom count of HBD
HBA-nRot Rotatable bond count of HBA.
HBD-nRot Rotatable bond count of HBD.
HBA-nAcid Number of acid groups of HBA
HBD-nAcid Number of acid groups of HBD
HBA-nBase Number of alkaline groups of HBA
HBD-nBase Number of alkaline groups of HBD
HBA-nC Number of carbon atoms in HBA
HBD-nC Number of carbon atoms in HBD
MFs C-MFs DES's count-based Morgan fingerprint


Our feature engineering process is grounded in both chemical interpretability and statistical rigor, ensuring that the final feature space captures critical information related to DES–lignin interactions. This approach improves model robustness and generalizability while maintaining chemical relevance. Compared with previous studies, such as those by Ge et al.,50 which relied on broad solvent parameters, our method provides a more detailed, molecularly resolved view of solvent–biomass interactions. Specifically, our model considers molecular features like hydrogen bond donor/acceptor counts, polarizability, and acid–base dissociation constants, facilitating more rational solvent design. Furthermore, Sumer et al.51 highlighted the use of Kamlet–Taft parameters and COSMO-RS predictions for assessing solvent-lignin affinity, but their approach was computationally intensive and not easily scalable. In contrast, the descriptors we selected are not only chemically meaningful but also easily computable using tools like RDKit, PaDEL, or AlvaDesc, making our approach more scalable for large molecular libraries.

The feature clusters identified through correlation-based hierarchical clustering were validated against known structure–property trends in open-access chemical databases, such as PubChem, ChemSpider, and the OECD QSAR Toolbox. These clusters, including molecular weight, polarizability, and number of rotatable bonds, align with existing quantitative structure–activity relationship (QSAR) models for solvent design in biomass pretreatment. By integrating data-driven correlation analysis with chemically informed feature selection, our framework offers improved predictive power and chemical transparency compared to existing models. It provides a foundation for developing interpretable and generalizable models, which can aid not only in lignin removal predictions but also in broader biorefinery process optimization, including the solubilization of hemicellulose and the generation of lignin-derived aromatic platform chemicals.

4.2. Model construction

To evaluate the influence of molecular representations on model performance, four feature sets were tested: (0) no molecular representation (baseline), (1) MDs only, (2) MFs only, and (3) a combined MDs + MFs representation. As summarized in Table 3, models without molecular-level input (feature group 0) showed consistently poor performance across all algorithms, emphasizing that process-level parameters alone (e.g., temperature, time, solid–liquid ratio) are insufficient for capturing the physicochemical diversity of DES systems. Introducing MDs or MFs significantly improved model accuracy, underlining the importance of structure–function relationships in solvent–biomass interactions.
Table 3 The model performances for each dataset, ML algorithm, and molecular characterization method
Feature group Model MAEtest RMSEtest R 2 test
All data in the table are results obtained through ten-fold cross-validation; the bolded indicates the optimal overall performance.
0 XGBoost 0.0905 ± 0.0078 0.1216 ± 0.0121 0.6982 ± 0.0662
CatBoost 0.0913 ± 0.0073 0.1215 ± 0.0122 0.7024 ± 0.0477
RF 0.0948 ± 0.0096 0.1250 ± 0.0113 0.6814 ± 0.0680
SVR 0.1045 ± 0.0078 0.1381 ± 0.0134 0.6157 ± 0.0588
ANN 0.1321 ± 0.0144 0.1666 ± 0.0159 0.4356 ± 0.1108
1 XGBoost 0.0708 ± 0.0062 0.973 ± 0.0069 0.8076 ± 0.0355
CatBoost 0.0743 ± 0.0067 0.0997 ± 0.0093 0.7999 ± 0.0296
RF 0.0830 ± 0.0088 0.1100 ± 0.0990 0.7540 ± 0.0510
SVR 0.0870 ± 0.0115 0.1190 ± 0.0184 0.7058 ± 0.0950
ANN 0.1108 ± 0.0141 0.1433 ± 0.0183 0.5875 ± 0.0705
2 XGBoost 0.0741 ± 0.0082 0.1023 ± 0.0116 0.7886 ± 0.0391
CatBoost 0.0758 ± 0.0074 0.1026 ± 0.0126 0.7887 ± 0.0340
RF 0.0826 ± 0.0077 0.1106 ± 0.0112 0.7527 ± 0.0439
SVR 0.0957 ± 0.0106 0.1309 ± 0.0170 0.6488 ± 0.0886
ANN 0.1156 ± 0.0116 0.1532 ± 0.0192 0.5266 ± 0.0961
3 XGBoost 0.0688 ± 0.0041 0.0953 ± 0.0055 0.8163 ± 0.0267
CatBoost 0.0734 ± 0.0083 0.0996 ± 0.0136 0.7984 ± 0.0488
RF 0.0823 ± 0.0086 0.1100 ± 0.0101 0.7554 ± 0.0429
SVR 0.0987 ± 0.0115 0.1379 ± 0.0225 0.6072 ± 0.1247
ANN 0.1183 ± 0.0126 0.1546 ± 0.0184 0.5210 ± 0.0704


Among single molecular representations, MDs (feature group 1) performed better than MFs (feature group 2). This is expected, as MDs encode physicochemical properties like acidity, polarity, molecular size, and hydrogen bonding, which directly relate to lignin solubilization mechanisms. These properties exhibit greater variance and lower sparsity, making them more suitable for practical machine learning models. MFs, though useful for molecular similarity screening, remain sparse and high-dimensional, which limits their effectiveness in regression-based modeling of complex biochemical processes.

The combination of MDs and MFs (feature group 3) aimed to exploit complementary information from both representations. Tree-based models (XGBoost, CatBoost, and RF) generally performed well with this configuration, showing modest improvements in R2_test and reductions in MAE_test and RMSE_test compared to models using MDs alone. This suggests that, although MDs capture most predictive information, MFs provide additional structural signals that refine predictions. Models like XGBoost and RF, which have built-in feature selection and can handle high-dimensional data, particularly benefited from this combination.30 In contrast, for SVR and ANN, the addition of MFs increased the total feature count from 35 to 194, leading to decreased performance. In SVR, the increased dimensionality led to overfitting, with the kernel-based projection in high-dimensional spaces resulting in hypersurface models that generalized poorly. ANN performance was similarly reduced due to increased feature sparsity and noise disrupting gradient descent convergence.

Despite these challenges, MFs remain valuable from an interpretability perspective. By encoding functional groups like hydroxyl, carbonyl, or amide groups, MFs provide chemical insights into how these groups in HBAs and HBDs influence lignin extraction. Such qualitative insights are difficult to derive from continuous MDs alone, which focus on quantitative aspects. While MDs offer superior predictive performance due to their direct statistical suitability, MFs contribute structural insights that enhance model interpretability. This combined approach can improve prediction accuracy for robust models like XGBoost, although caution is needed when using them in models sensitive to high-dimensional noise. These findings highlight the importance of aligning molecular representation strategies with model characteristics to achieve both predictive power and chemical transparency in machine learning-guided solvent design.

As shown in Fig. 4a, models trained with various feature groups showed distinct predictive performances. XGBoost and CatBoost exhibited high accuracy, with predictions clustering closely along the ideal diagonal line, indicating high predictive accuracy. RF models showed a more dispersed pattern but remained symmetrical, suggesting stable generalization. Conversely, SVR and ANN models displayed a wider scatter, especially at the extremes, suggesting lower precision and generalization compared to tree-based models.


image file: d5gc02370j-f4.tif
Fig. 4 (a) Scatter plot of prediction performance of five ML models based on feature group 3; (b) scatter plot of prediction performance of five ML models based on feature group 4; (c) scatter plot of prediction performance of five ML models based on feature group 5; (d) scatter plot of prediction performance of five ML models based on feature group 6; (e) scatter plot of prediction performance of five ML models based on feature group 7; (f) rader plot of R2(f1), MAE(f2), RMSE(f3) performanceof ML model based on feature group 4 (all data in the table are results obtained through tenfold cross-validation).

XGBoost consistently outperformed other models, achieving the highest R2 values and the lowest RMSE and MAE across both training and test sets. In feature group 4, XGBoost achieved an R2 of 0.83 on the test set, outperforming all other configurations, demonstrating its superior generalization capability. This was particularly notable as XGBoost performed effectively even with a relatively small dataset, highlighting its robustness in handling high-dimensional chemical descriptors and unbalanced data distributions, a challenge in materials modeling.18,33

Feature group 4 consistently outperformed all other sets across the five algorithms tested, highlighting its superiority over the more extensive feature group 3, which included the complete set of MDs and MFs. This enhanced performance stems from a targeted reduction of redundant descriptors—specifically, HBA-Sp, HBD-Sp, and HBA-MW—identified through hierarchical clustering. The elimination of these polarizability-related descriptors, which exhibited strong correlations with molecular weight (as shown in Fig. 4b), led to improvements in model efficiency without sacrificing predictive accuracy. Notably, while both polarizability and molecular weight provide insights into molecular structure, molecular weight captures a broader range of chemical information. This is reflected in the evaluation results (Table S6), where models retaining only polarizability performed worse than those using feature group 4, underscoring its lower informational value in this context. Further improvements were achieved by removing acidic group counts that were correlated with pKa and pKb, thereby reducing feature redundancy and enhancing computational efficiency. These findings demonstrate that informed feature selection—combining statistical analysis with chemical domain knowledge—is a powerful approach for optimizing predictive models in biomass solvent screening.34,37

The presence of log[thin space (1/6-em)]R0 in feature group 4 was critical, as it has a functional correlation with reaction temperature and time, helping models capture their synergistic effect. Contrastive modeling (Table S7) confirmed that log[thin space (1/6-em)]R0 significantly enhanced model performance by capturing the temperature–time interaction, outweighing potential artifacts from cross-correlation. Beyond feature group 4, further feature reduction in groups 5 to 7 led to performance degradation for most models, highlighting the importance of balancing dimensionality reduction with information retention. Features in group 4, such as pKa, log[thin space (1/6-em)]P, hydrogen bonding counts, and polar surface area, are known to influence solvation capacity and lignin solubility, crucial for DES-mediated lignin dissolution.52–54 Previous studies have emphasized the role of polarity and hydrogen bonding in lignin deconstruction, and our approach quantifies their impact on predictive model accuracy.34,37

The superior performance of feature group 4 across all models validates the effectiveness of our feature refinement strategy for DES-based biomass pretreatment. This method combines hierarchical clustering with chemical interpretability to retain the most informative features, ensuring high model accuracy while maintaining mechanistic relevance. The strategy can be adapted to other modeling tasks, such as optimizing organocatalytic reactions or estimating enzymatic hydrolysis efficiency across different biomass types.

This modular approach contrasts with prior studies, which relied on predefined descriptor sets without structural refinement.39,50,55 Our pipeline is adaptive and allows for tailored feature reduction across different systems, with the features retained in feature group 4 demonstrating relevance in other fields such as green solvent screening, polymer solubility prediction, and pharmaceutical crystallization. Thus, this strategy offers a promising framework for data-driven molecular design across diverse applications. Complete evaluation metrics for all machine learning models and feature groups are provided in Table S8.

4.3. Model interpretation

Fig. 5a presents the SHAP summary plot for the XGBoost model trained on feature group 4. Each point in the plot represents a SHAP value for a specific input instance, where the magnitude indicates the feature's contribution strength and the sign shows whether it has a positive or negative influence on lignin removal efficiency. The plot highlights that process parameters and biomass composition are the primary drivers of the model's predictions. Key influential features include severity factor (log[thin space (1/6-em)]R0), temperature, reaction time, liquid-to-solid ratio, and HBD[thin space (1/6-em)]:[thin space (1/6-em)]HBA molar ratio, which are consistent with known factors in lignin fractionation. Elevated temperatures, for instance, increase reaction rates and reduce DES viscosity, improving mass transfer and bond cleavage efficiency. Log[thin space (1/6-em)]R0 captures both temperature and time effects, reflecting process intensity.
image file: d5gc02370j-f5.tif
Fig. 5 Feature importance analysis of the XGBoost model with feature group 4 as input dataset. (a) SHAP analysis. The average of the absolute SHAP values of all points can be used as an indicator of the importance of the feature. (b) Variables F-score analysis.

To validate the consistency of SHAP-based findings, we also examined the F-score values in the XGBoost model (Fig. 5b). The F-score quantifies how frequently a feature is used as a decision node across the ensemble of decision trees. The top 20 features identified by F-score largely overlap with those highlighted by SHAP, with process-related and biomass-related features accounting for 51.0% and 28.6% of the top-ranked variables, respectively. While slight differences exist in ranking, both methods converge on a standard set of dominant features, supporting the robustness of the model interpretation. In addition, to verify the reliability of the results obtained by the K-fold variant of the SHAP method, we analyzed the optimal model using the traditional SHAP method (Fig. S2). It can be seen that the top 20 features contributing to the model are consistent with those obtained by the K-fold variant of the SHAP method. The traditional SHAP method only utilized 92 data points in the test set, thus appearing sparser. Some differences in the ranking of feature importance indicate the impact of different dataset divisions on modeling.

It is particularly noteworthy that the SHAP dependence analysis reveals non-monotonic relationships between several key process parameters and lignin removal efficiency, underscoring the limitations of assuming “more is better” in biomass pretreatment. As shown in Fig. 6a, the SHAP values for reaction time plateau after approximately 4 hours, indicating that extending the duration beyond this point offers limited additional benefit. A similar pattern is observed for reaction temperature (Fig. 6b), where SHAP values peak between 130 °C and 140 °C and decline thereafter. The same non-linear trend emerges in the log[thin space (1/6-em)]R0 dependence plot (Fig. 6c), reinforcing the notion that excessive reaction severity may hinder rather than enhance delignification.


image file: d5gc02370j-f6.tif
Fig. 6 The dependence plots of SHAP value with (a) log[thin space (1/6-em)]R0, (b) temperature, (c) time, (d) the molar ratio of HBA to HBD, (e) the ratio of liquid to solid, (f) particle size of raw biomass, (g) cellulose, (h) hemicellulose, (i) lignin content of raw biomass.

The decline in lignin removal efficiency can be linked to polycondensation reactions occurring under high-severity conditions.56–59 Some studies in the datasets have observed and documented a decrease in the delignification rate of biomass under elevated temperatures or extended reaction times, attributing this trend to structural alterations in lignin under harsh conditions that promote its re-deposition on the biomass surface. Under the influence of acidic or basic DES components and elevated temperatures, cleavage of ether and ester linkages within the lignin macromolecule can lead to the formation of new C–C bonds. This re-polymerization increases the molecular weight of lignin fragments, reduces solubility, and promotes re-deposition onto cellulose surfaces, ultimately decreasing the net removal efficiency. Similar observations have been reported in acidic organosolv and IL-based pretreatments, highlighting a general trade-off between depolymerization and recondensation at high temperature and acidity.37,60–62

The L[thin space (1/6-em)]:[thin space (1/6-em)]S (liquid-to-solid) ratio emerged as a key parameter influencing lignin removal, as indicated by SHAP analysis. Higher L[thin space (1/6-em)]:[thin space (1/6-em)]S ratios consistently showed positive SHAP values (Fig. 6a), suggesting improved solvent penetration, increased surface contact, and enhanced mass transfer. This observation aligns with experimental studies reporting that excess solvent volume can alleviate diffusion limitations during biomass delignification.63 The HBD[thin space (1/6-em)]:[thin space (1/6-em)]HBA molar ratio also plays a critical role in tuning DES properties. Higher HBD content generally reduces viscosity and enhances solvent mobility, thereby facilitating molecular diffusion (Fig. 6b). In addition, HBDs influence the proton-donating ability of the DES, which can alter acidity and hydrogen bonding behavior. Variations in HBD[thin space (1/6-em)]:[thin space (1/6-em)]HBA ratios—even within similar component classes like choline chloride–acid systems—can lead to markedly different delignification outcomes.11

Biomass composition features revealed more complex SHAP patterns (Fig. 6d and i). A high initial lignin content (>25%) was associated with negative SHAP contributions, possibly due to structural resistance and limited porosity in high-lignin feedstocks such as bamboo.64 These materials often contain tightly cross-linked lignin–carbohydrate complexes (LCCs), which may hinder solvent accessibility. In contrast, cellulose content above 43% correlated with improved lignin removal. This may reflect a more open biomass structure or better dispersion of lignin across the cellulose matrix, facilitating solvent access and dissolution.

The SHAP dependence plots further highlight optimal operational thresholds for effective delignification. Conditions such as log[thin space (1/6-em)]R0 > 3, temperature > 120 °C, time > 2.5 h, HBD[thin space (1/6-em)]:[thin space (1/6-em)]HBA > 3.5, and L[thin space (1/6-em)]:[thin space (1/6-em)]S > 10 were consistently associated with positive SHAP values. These parameter ranges offer practical guidance for DES pretreatment optimization, supporting efficient lignin removal while minimizing solvent use and energy demand in line with green chemistry objectives.

Compared to process parameters and biomass composition, molecular structure features such as MDs and MFs contribute less to overall model performance but are crucial for capturing subtle chemical effects. Key descriptors, particularly HBD-pKa/pKb, which reflect the proton-donating or -accepting capacity of the hydrogen bond donor, were highly influential, with low pKa/pKb values associated with improved lignin removal. This suggests DES pretreatment not only solubilizes lignin but also facilitates bond cleavage between lignin and hemicellulose, supporting the view of DESs as mild acid–base catalytic systems that cleave bonds in the lignin structure.65 Kwok et al.,66 similarly found that delignification efficiency is governed more by chemical cleavage mechanisms than solubility.

Other MDs and MFs, although ranked lower in overall feature importance, contributed interpretable and chemically meaningful insights. For example, the descriptor HBD-nRot (number of rotatable bonds in the hydrogen bond donor) had a value range between 0 and 5, and its negative SHAP contribution (Fig. 7a and c) suggests that longer alkyl chains reduce DES effectiveness, likely due to increased viscosity and steric hindrance. This trend aligns with experimental observations: DESs composed of ChCl-formic acid (C1) exhibit better lignin solubilization than those with longer-chain HBDs such as butanoic acid (C4).67 Teles et al.68 similarly reported that short alkyl chains in HBDs and HBAs enhance proton transfer capacity by increasing electron density around hydroxyl oxygen atoms, thereby strengthening hydrogen bonding.


image file: d5gc02370j-f7.tif
Fig. 7 The dependence plots of SHAP values with (a) HBD-nRot, (b) HBD-nC. (c) The relationship between SHAP value and the HBD-nRot related substructure for HBD-MW. The relationship between SHAP value and the (d) H2O group, (e) α =, (f) α = O related substructure for DESs.

Molecular weight (MW) of the HBD also showed a complex relationship with lignin removal, as indicated by the vertical spread of SHAP values in Fig. 7c. While high MW typically corresponds to higher viscosity and lower diffusivity, a subset of data showed unexpectedly positive SHAP contributions. This was associated with the MF-390 feature, corresponding to an H2O substructure from bound water in hydrated oxalic acid (Fig. 7d). Compared to the anhydrous form, the hydrated DES displayed improved lignin removal, likely due to lower viscosity and enhanced molecular mobility. Even small amounts of bound water can reduce DES viscosity and increase polarity, facilitating lignin solubilization.69 Conversely, low SHAP values within the MW distribution (Fig. 7e and f) were linked to MF-321 and MF-2999, representing unsaturated moieties such as C[double bond, length as m-dash]C or aldehyde (–CHO) side chains. These substructures may interfere with effective DES–lignin interactions, potentially through electronic or steric effects.

Other descriptors also provided chemical context. HBD-nC, related to molecular size and carbon chain length, showed a negative SHAP impact (Fig. 7b), possibly due to coupled effects with nRot and MW. S[thin space (1/6-em)]log[thin space (1/6-em)]P, representing the octanol–water partition coefficient, also exhibited an inverse relationship with lignin removal, suggesting that lower polarity reduces compatibility with lignin's polar functionalities. In contrast, the topological polar surface area (TopoPSA), associated with polarity and hydrogen-bonding capacity, showed a positive correlation with lignin extraction efficiency. This aligns with previous studies highlighting the importance of solvent polarity in dissolving polyphenolic and aromatic lignin structures.50 Collectively, these results underscore that even lower-ranked structural features contribute valuable mechanistic insights, supporting improved model generalization and rational design of DESs through an enhanced understanding of structure–property relationships.

Compared to MDs, the structural representations provided by MFs offer a more intuitive connection to discrete chemical substructures. However, this representation approach presents several inherent limitations. First, the radius of substructure encoding (typically 1–2 atomic layers in circular fingerprints) may be too small to capture chemically relevant motifs, resulting in fragments that are difficult to interpret. Conversely, increasing the fingerprint radius can result in over-specific representations that closely resemble entire molecules, significantly reducing the generalizability and statistical variance across the dataset. Second, MFs are often characterized by low variance, especially in datasets with structurally similar molecules. This diminishes the individual information gain of each feature and limits their importance in models that combine MFs with MDs.

Despite challenges in interpretability, MFs significantly enhance model accuracy. As shown in Fig. S3, their cumulative contribution, measured by Gain in the XGBoost model, reduces prediction error more effectively than F-score, which only reflects feature usage frequency. Gain quantifies the reduction in squared error attributable to a feature, making it sensitive to interaction effects and marginal contributions of low-variance inputs. Table S1 summarizes the impact of individual MFs on lignin removal predictions, highlighting their role in differentiating structurally similar DES solvents under identical process conditions. These limitations are reflected in the SHAP analysis. In the model that incorporates both MDs and MFs (feature group 3), MFs exhibited more complex and dispersed SHAP value distributions compared to when they were used alone (feature group 2). As illustrated in Fig. S4, fingerprints with identical binary values showed scattered SHAP contributions across different samples. This suggests that in the presence of MDs, MFs do not act as primary explanatory variables but instead contribute through nonlinear interaction effects with other features. Consequently, individual MFs are challenging to interpret in isolation, mainly when the prediction arises from feature synergies rather than direct effects.

Specific MFs displayed expected chemical behavior. For example, MF_4272, representing CH2 units, complements the nRot descriptor, and both negatively impact lignin extraction due to increased viscosity or steric hindrance. MF_2311, linked to –OH or –COOH substructures, was relevant but challenging to isolate due to functional group overlap. MF_4234, indicating the absence of chloride anions, was negatively associated with lignin removal, aligning with the known role of Cl in choline chloride-based DES systems, where it aids solubilization by forming hydrogen bonds with lignin and disrupting lignin–carbohydrate complexes.37,70

These insights, which are difficult to extract from MDs alone, demonstrate the complementary value of MFs in capturing molecular topology and functional group presence. Their interaction with process variables and physicochemical descriptors allows for more nuanced predictions, emphasizing the importance of hybrid molecular representations. The SHAP-based interpretation of the XGBoost model quantifies the contributions of process parameters, biomass composition, and DES molecular features to lignin removal. While process and compositional variables dominate globally, chemically meaningful MDs and specific MFs provide essential mechanistic insights, supporting solvent design strategies at both the formulation and molecular levels. This integrative modeling approach offers a rational and interpretable pathway for optimizing DES-based biomass pretreatment systems.

4.4. Blind test evaluation

To assess the generalization capacity of the developed model, a blind test was performed using an independent dataset comprising 42 experimental data points (Tables S9 and S10). This dataset includes lignin removal efficiencies for corn straw treated with three representative DES systems, choline chloride: urea, choline chloride: lactic acid, and choline chloride: glycerol, under varying process parameters. None of these data points were included in the model training phase. The blind test thus serves to evaluate the extrapolative robustness of the XGBoost model when confronted with new chemical compositions and process conditions.

For prediction, we utilized the optimized XGBoost model trained entirely on the original dataset and employed feature group 4 as input variables. As shown in Fig. 8a, the predicted values for the blind dataset align closely with the experimental values across most data points. The ChCl: urea-based DES exhibited firm agreement, with minimal deviation from the diagonal reference line, suggesting that the model accurately captures the physicochemical behavior of this solvent system. Predictions for ChCl: lactic acid were also satisfactory, though slightly more scattered, likely due to the use of only 85% pure lactic acid in the experimental formulation. Impurities may have introduced deviations from previously reported performance values (as training set data).


image file: d5gc02370j-f8.tif
Fig. 8 (a) Scatter plot of the relationship between the predicted lignin removal rate of the XGBoost model and the experimental value of the blind test set. Dashed lines indicate that the predicted values agree with the experimental values. (b) Residual plots of the expected lignin removal rate of the XGBoost model. The red dashed reference lines, positioned at residuals of −0.1, 0, and 0.1, were employed to assess the distribution of residuals within the model. (c) SHAP diagram of the XGBoost model in the blind test set.

Notably, significant discrepancies were observed for ChCl: glycerol at extreme process conditions (e.g., 160 °C or L[thin space (1/6-em)]:[thin space (1/6-em)]S = 5), with model predictions overestimating lignin removal. These deviations are highlighted with red circles in both the scatter and residual plots (Fig. 8a and b). In contrast to the model's upward trend, the experimental data showed that ChCl: glycerol DES peaked in efficiency at around 130 °C, with further heating resulting in diminished performance. This behavior, while partially captured by the model, was not accurately predicted under high-severity conditions.

Several factors may contribute to these deviations. From a machine learning perspective, every model operates within a defined Domain of Applicability (DoA), and predictions made outside the training feature distribution are inherently less reliable. In this case, the L[thin space (1/6-em)]:[thin space (1/6-em)]S ratio in the training dataset ranged from 9 to 30, whereas the blind test dataset included values as low as 5. Two of the four outlier points correspond to L[thin space (1/6-em)]:[thin space (1/6-em)]S = 5, suggesting that extrapolation beyond this range led to error. A broader and more balanced training set may mitigate this issue in future iterations.

From an experimental standpoint, discrepancies may stem from the thermal instability of glycerol at temperatures exceeding 150 °C, where partial decomposition has been reported.71 Such degradation could alter the hydrogen bonding network or viscosity of the DES, reducing its delignification efficiency. Furthermore, the hygroscopic nature of DESs, particularly those containing glycerol, could introduce variability in solvent properties due to unintended water uptake. Although care was taken to avoid external water addition, condensation was observed during sealed reactor operation, indicating that environmental humidity may have altered solvent polarity and flow characteristics during the reaction.72

Despite these challenges, the SHAP summary plot for the blind dataset (Fig. 8c) confirms that the relative importance of input features remains consistent with prior observations. This suggests that the model maintains interpretive coherence and stable predictive logic even when applied to previously unseen data.

Overall, the XGBoost model achieved an R2 of 0.8034 on the blind test set, with a MAE of 0.0661. This translates to an average deviation of approximately 6–7 percentage points, lower than the 10-pp error reported in a previous lignin removal model based on regression approaches.22 Given the intrinsic variability of biomass and the experimental uncertainty inherent in fractionation processes, these results demonstrate that the model possesses reliable generalization capacity and can serve as a decision-support tool for solvent screening and process optimization in lignin-first biorefinery systems.

5. Conclusions

In this study, we developed an interpretable machine learning framework to predict lignin extraction efficiency during DES-based biomass fractionation. By integrating chemically meaningful MDs and MFs, the model effectively captured the physicochemical and structural properties of DES systems. Following a comprehensive feature selection and clustering process, an optimized input feature set (feature group 4) was identified, enabling the XGBoost algorithm to achieve superior predictive performance with an R2 of 0.8259, RMSE of 0.0928, and MAE of 0.0672 on the cross-validation test set.

To address the black-box nature of machine learning models, we employed the SHAP method to interpret feature contributions. The results revealed that process conditions such as temperature, severity factor, and L[thin space (1/6-em)]:[thin space (1/6-em)]S ratio were dominant predictors of lignin removal. Additionally, key molecular descriptors, including HBD-pKa/pKb, the number of rotatable bonds (nRot), molecular weight, S[thin space (1/6-em)]log[thin space (1/6-em)]P, and polar surface area, were shown to play mechanistically consistent roles, reinforcing the model's chemical interpretability. Substructure-level contributions captured by MFs further elucidated the role of molecular topology in solvent performance, although their interpretability was more context-dependent. A blind test using an experimentally obtained external dataset containing 42 previously unseen data points confirmed the generalization ability of the model, with an R2 of 0.8034 and MAE of 0.0661. While most predictions closely matched the experimental values, a few deviations, particularly in the ChCl: glycerol system under extreme conditions, highlighted the limitations of extrapolating beyond the domain of applicability. Nonetheless, the SHAP analysis on the blind test set demonstrated consistent feature attributions, further validating the model's robustness.

Overall, this study demonstrates that the data-driven XGBoost model, trained on rationally engineered features, can reliably predict lignin removal outcomes in DES-based biomass pretreatment systems. Compared to traditional trial-and-error experimentation, this approach significantly reduces time, cost, and material usage. More importantly, the incorporation of SHAP interpretation enables mechanistic insights and parameter optimization, thereby bridging predictive modeling with fundamental understanding. The framework presented here provides a transferable platform for accelerating DES formulation, guiding experimental design, and advancing green and efficient lignocellulose biorefinery strategies. Looking ahead, this interpretable ML framework can be further extended to predict other biomass component behaviors (e.g., hemicellulose solubilization), solvent recyclability, or synergistic effects in multi-component DESs. Such developments will accelerate the transition toward data-driven and sustainable solvent engineering.

6. Limitations and future perspectives

The top-performing model (R2 = 0.83) exhibits strong and reliable predictive accuracy across a wide range of herbaceous biomass types and standard pretreatment conditions. While the model is broadly applicable, its predictions may be less precise for underrepresented scenarios—such as non-herbaceous feedstocks or extreme HBA[thin space (1/6-em)]:[thin space (1/6-em)]HBD molar ratios—which are relatively rare in the current dataset. This study excludes ash content and inorganic components from our analysis for practical reasons. Although DES solvents mainly interact with lignin–carbohydrate complexes and hydrogen bonds, high ash content in feedstock can still impact lignin's suitability for applications like carbon fibers and adhesives, indicating a need for further research. Moreover, the current scope does not include ternary DES systems, which present more intricate molecular interactions and offer a valuable direction for future model expansion. Furthermore, it is essential to note that the dataset used in this study is predominantly composed of Type III DESs, with only minimal representation of other traditional DES classes. As a result, the current model is best suited for predicting the performance of Type III DES systems. Care should be taken when applying the model to other DES types, which remain underexplored and warrant further experimental investigation to expand the predictive scope of future models.

Author contributions

Zijing Zhong: conceptualization, methodology, investigation, formal analysis, writing, review & editing. Babbiker Mohammed Taher Gorish: methodology, investigation, formal analysis, and writing. Yue Bai: methodology, investigation, formal analysis, and writing. Waha Ismail Yahia Abdelmula: methodology, investigation, formal analysis, and writing. Dang Wenqian: formal analysis, and writing. Daochen Zhu: conceptualization, supervision, project administration, funding acquisition, and writing, review & editing.

Conflicts of interest

The authors declare no competing financial interest.

Data availability

The relevant data of the dataset and the MFs (which refer to the chemical structures) are published and submitted in the form of an Excel file. The model-related data are submitted in the form of a Word file. No layout or editing is required. The responsibility for scientific accuracy and content remains entirely with the authors.

Supplementary information is available. See DOI: https://doi.org/10.1039/d5gc02370j.

Acknowledgements

This work was supported by the National Key R & D Program of China (2023YFC3403600).

References

  1. F. Brienza, D. Cannella, D. Montesdeoca, I. Cybulska and D. P. Debecker, RSC Sustainability, 2024, 2, 37–90 RSC.
  2. L. Reyes, C. Nikitine, L. Vilcocq and P. Fongarland, Green Chem., 2020, 22, 8097–8115 RSC.
  3. K. Braimakis, D. Magiri-Skouloudi, D. Grimekis and S. Karellas, Renewable Energy, 2020, 154, 252–269 CrossRef.
  4. M. Madadi, G. Song, V. K. Gupta, M. Aghbashloh, C. Sun, F. Sun and M. Tabatabaei, Green Chem., 2023, 25, 7141–7156 RSC.
  5. L.-Q. Wang, L.-Y. Cai and Y.-L. Ma, RSC Adv., 2020, 10, 38409–38415 RSC.
  6. A. Toledano, L. Serrano and J. Labidi, Fuel, 2014, 116, 617–624 CrossRef CAS.
  7. H. Kaur and D. Goyal, J. Environ. Manage., 2024, 372, 123334 CrossRef CAS PubMed.
  8. M. Margarida Martins, F. Carvalheiro and F. Gírio, Biomass Convers. Biorefin., 2022, 14, 3183–3207 CrossRef.
  9. S. C. Rabelo, P. Y. S. Nakasu, E. Scopel, M. F. Araújo, L. H. Cardoso and A. C. d. Costa, Bioresour. Technol., 2023, 369, 128331 CrossRef CAS PubMed.
  10. S. Sethupathy, G. Murillo Morales, L. Gao, H. Wang, B. Yang, J. Jiang, J. Sun and D. Zhu, Bioresour. Technol., 2022, 347, 126696 CrossRef CAS PubMed.
  11. M. Zhou, O. A. Fakayode, A. E. Ahmed Yagoub, Q. Ji and C. Zhou, Renewable Sustainable Energy Rev., 2022, 156, 111986 CrossRef CAS.
  12. A. P. Abbott, D. Boothby, G. Capper, D. L. Davies and R. K. Rasheed, J. Am. Chem. Soc., 2004, 126, 9142–9147 CrossRef CAS.
  13. Y. Ma, Y. Yang, T. Li, S. Hussain and M. Zhu, Green Chem., 2024, 26, 3627–3669 RSC.
  14. Y. Liu, N. Deak, Z. Wang, H. Yu, L. Hameleers, E. Jurak, P. J. Deuss and K. Barta, Nat. Commun., 2021, 12, 5424 CrossRef CAS PubMed.
  15. H. Chen, A. Wang, C. Yan, S. Liu, L. Li, Q. Wu, Y. Liu, Y. Liu, G. Nie, S. Nie, S. Yao and H. Yu, Sustainability, 2023, 15, 7118 CrossRef CAS.
  16. P. Li, C. Yang, Z. Jiang, Y. Jin and W. Wu, J. Bioresour. Bioprod., 2023, 8, 33–44 CAS.
  17. S. Kumar, G. Ignacz and G. Szekely, Green Chem., 2021, 23, 8932–8939 RSC.
  18. D. Shi, F. Zhou, W. Mu, C. Ling, T. Mu, G. Yu and R. Li, Phys. Chem. Chem. Phys., 2022, 24, 26029–26036 RSC.
  19. B. R. Taylor, N. Kumar, D. K. Mishra, B. A. Simmons, H. Choudhary and K. L. Sale, Molecules, 2024, 29, 5073 CrossRef CAS.
  20. Đ. Kovačić, D. Radočaj, D. Samac and M. Jurišić, AgriEngineering, 2024, 6, 171–184 CrossRef.
  21. A. Coşgun, M. E. Günay and R. Yıldırım, J. Bioresour. Bioprod., 2023, 10, 1859–1875 Search PubMed.
  22. K. Baran, B. Barczak and A. Kloskowski, Sci. Total Environ., 2024, 935, 173234 CrossRef CAS PubMed.
  23. A. Sluiter, B. Hames, R. Ruiz, C. Scarlata, J. Sluiter, D. Templeton and D. Crocker, Lab. Anal. Proced., 2008, 1617, 1–16 Search PubMed.
  24. A. Tayyebi, A. S. Alshami, Z. Rabiei, X. Yu, N. Ismail, M. J. Talukder and J. Power, J. Cheminf., 2023, 15, 99 CAS.
  25. S. Zhu, B. H. Nguyen, Y. Xia, K. Frost, S. Xie, V. Viswanathan and J. A. Smith, Green Chem., 2023, 25, 6612–6617 RSC.
  26. T. Tang, D. Song, J. Chen, Z. Chen, Y. Du, Z. Dang and G. Lu, Processes, 2024, 12, 384 CrossRef CAS.
  27. Y. Ding, M. Chen, C. Guo, P. Zhang and J. Wang, J. Mol. Liq., 2021, 326, 115212 CrossRef CAS.
  28. H. Moriwaki, Y.-S. Tian, N. Kawashita and T. Takagi, J. Cheminf., 2018, 10, 4 Search PubMed.
  29. G. Landrum, 2013, RDkit documentation, Release 1, pp. 1–79, https://www.rdkit.org.
  30. S. Zhong and X. Guan, Environ. Sci. Technol., 2023, 57, 18193–18202 CrossRef CAS PubMed.
  31. P. Gramatica, S. Cassani and A. Sangion, Green Chem., 2016, 18, 4393–4406 RSC.
  32. S. Zhong, Y. Chen, J. Li, T. Igou, A. Xiong, J. Guan, Z. Dai, X. Cai, X. Qu and Y. Chen, Environ. Sci. Technol. Lett., 2024, 11, 1193–1199 CrossRef CAS.
  33. F. Zhou, D. Shi, W. Mu, S. Wang, Z. Wang, C. Wei, R. Li and T. Mu, Green Chem., 2024, 26, 7857–7868 RSC.
  34. S. Hong, X.-J. Shen, Z. Xue, Z. Sun and T.-Q. Yuan, Green Chem., 2020, 22, 7219–7232 RSC.
  35. X.-D. Hou, A.-L. Li, K.-P. Lin, Y.-Y. Wang, Z.-Y. Kuang and S.-L. Cao, Bioresour. Technol., 2018, 249, 261–267 CrossRef CAS.
  36. C. He, F. Luo, Y. Zhu, A. Zhan, J. Fan, J. H. Clark, J. Lv and Q. Yu, Green Chem., 2025, 27, 2019–2034 RSC.
  37. Q. Xia, Y. Liu, J. Meng, W. Cheng, W. Chen, S. Liu, Y. Liu, J. Li and H. Yu, Green Chem., 2018, 20, 2711–2721 RSC.
  38. H. Xu, C. Dong, W. Wang, Y. Liu, B. Li and F. Liu, Ind. Crops Prod., 2023, 196, 116431 CrossRef CAS.
  39. H. Ge, Y. Liu, B. Zhu, Y. Xu, R. Zhou, H. Xu and B. Li, Ind. Crops Prod., 2023, 203, 117138 CrossRef CAS.
  40. H. Xu, Y. Kong, J. Peng, X. Song, Y. Liu, Z. Su, B. Li, C. Gao and W. Tian, Bioresour. Technol., 2021, 319, 124209 CrossRef CAS.
  41. A. K. Jain, M. N. Murty and P. J. Flynn, ACM Comput. Surv., 1999, 31, 264–323 CrossRef.
  42. B. Mahesh, Int. J. Sci. Res., 2020, 9, 381–386 Search PubMed.
  43. G. Biau and E. Scornet, Test, 2016, 25, 197–227 CrossRef.
  44. T. Chen and C. Guestrin, presented in part at the Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, California, USA, 2016.
  45. A. V. Dorogush,V. Ershov and A. Gulin, arXiv, 2018, preprint, arXiv:1810.11363, 10.48550/arXiv.1810.11363.
  46. M. Mohan, O. Demerdash, B. A. Simmons, J. C. Smith, M. K. Kidder and S. Singh, Green Chem., 2023, 25, 3475–3492 RSC.
  47. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss and V. Dubourg, J. Mach. Learn. Res., 2011, 12, 2825–2830 Search PubMed.
  48. S. M. Lundberg and S.-I. Lee, presented in part at the Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, California, USA, 2017.
  49. T. O. Hodson, Root-mean-square error (RMSE) or mean absolute error (MAE): when to use them or not, Geosci. Model Dev., 2022, 15, 5481–5487 CrossRef.
  50. H. Ge, Y. Bai, R. Zhou, Y. Liu, J. Wei, S. Wang, B. Li and H. Xu, ACS Sustainable Chem. Eng., 2024, 12, 7578–7590 CrossRef CAS.
  51. Z. Sumer and R. C. Van Lehn, ACS Sustainable Chem. Eng., 2023, 11, 187–198 CrossRef CAS.
  52. N. Wang, B. Xu, X. Wang, J. Lang and H. Zhang, J. Mol. Liq., 2022, 366, 120294 CrossRef CAS.
  53. E. Scelsi, A. Angelini and C. Pastore, Biomass, 2021, 1, 29–59 CrossRef.
  54. M. A. Hubbe, R. Alén, M. Paleologou, M. Kannangara and J. Kihlman, BioResources, 2019, 14, 1 Search PubMed.
  55. Z. S. H. Chan, H. W. Ngan, A. B. Rad, A. K. David and N. Kasabov, Neurocomputing, 2006, 70, 409–419 CrossRef.
  56. R. Lou, R. Ma, K.-t. Lin, A. Ahamed and X. Zhang, ACS Sustainable Chem. Eng., 2019, 7, 10248–10256 CrossRef CAS.
  57. H. Ma, P. Fu, J. Zhao, X. Lin, W. Wu, Z. Yu, C. Xia, Q. Wang, M. Gao and J. Zhou, Molecules, 2022, 27, 7955 CrossRef CAS.
  58. Y. Chen, C. Ma, W. Tang and Y.-C. He, Bioresour. Technol., 2023, 386, 129485 CrossRef CAS.
  59. T.-J. Ao, K. Li, M. A. Mehmood, X.-Q. Zhao, F.-W. Bai, R. Boopathy and C.-G. Liu, Biomass Convers. Biorefin., 2024, 14, 6215–6228 CrossRef CAS.
  60. W. Wang, F. Gu, J. Y. Zhu, K. Sun, Z. Cai, S. Yao, W. Wu and Y. Jin, Ind. Crops Prod., 2020, 150, 112423 CrossRef CAS.
  61. L. Chen, J. Dou, Q. Ma, N. Li, R. Wu, H. Bian, D. J. Yelle, T. Vuorinen, S. Fu, X. Pan and J. Zhu, Sci. Adv., 2017, 3, e1701735 CrossRef.
  62. J.-L. Wen, T.-Q. Yuan, S.-L. Sun, F. Xu and R.-C. Sun, Green Chem., 2014, 16, 181–190 RSC.
  63. Zahoor, M. Madadi, M. Nazar, S. W. A. Shah, N. Li, M. Imtiaz, Z. Zhong and D. Zhu, Ind. Crops Prod., 2023, 200, 116815 CrossRef CAS.
  64. F. Monlau, A. Barakat, E. Trably, C. Dumas, J.-P. Steyer and H. Carrère, Crit. Rev. Environ. Sci. Technol., 2013, 43, 260–322 CrossRef CAS.
  65. K. Singh, S. Mehra and A. Kumar, Green Chem., 2024, 26, 1062–1091 RSC.
  66. T. Kwok, J. Bright, M. Realff and A. Bommarius, BioResources, 2019, 14, 5988–6003 CAS.
  67. J. G. Lynam, N. Kumar and M. J. Wong, Bioresour. Technol., 2017, 238, 684–689 CrossRef CAS.
  68. A. R. R. Teles, E. V. Capela, R. S. Carmo, J. A. Coutinho, A. J. Silvestre and M. G. Freire, Fluid Phase Equilib., 2017, 448, 15–21 CrossRef CAS.
  69. V. Alizadeh, F. Malberg, A. A. H. Pádua and B. Kirchner, J. Phys. Chem. B, 2020, 124, 7433–7443 CrossRef CAS.
  70. Y. Dai, R. Verpoorte and Y. H. Choi, Food Chem., 2014, 159, 116–121 CrossRef CAS.
  71. B. Dou, V. Dupont, P. T. Williams, H. Chen and Y. Ding, Bioresour. Technol., 2009, 100, 2613–2620 CrossRef CAS PubMed.
  72. Y. Chen, D. Yu, W. Chen, L. Fu and T. Mu, Phys. Chem. Chem. Phys., 2019, 21, 2601–2610 RSC.

This journal is © The Royal Society of Chemistry 2025
Click here to see how this site uses Cookies. View our privacy policy here.