Zijing
Zhong
ab,
Babbiker Mohammed Taher
Gorish
ab,
Yue
Bai
ab,
Waha Ismail Yahia
Abdelmula
ab,
Wenqian
Dang
ab and
Daochen
Zhu
*ab
aInternational Joint Laboratory on Synthetic Biology and Biomass Biorefinery, Biofuels Institute, School of the Environment and Safety Engineering, Jiangsu University, Zhenjiang 212013, China. E-mail: dczhucn@ujs.edu.cn
bJiangsu Collaborative Innovation Center of Technology and Material of Water Treatment. Suzhou University of Science and Technology, Suzhou 215009, China
First published on 11th August 2025
Efficient and sustainable lignin extraction is essential for advancing green biorefinery processes. Deep eutectic solvents (DESs), as environmentally benign and tunable media, offer promising alternatives to conventional solvents, yet their rational design and optimization remain highly resource-intensive. In this study, we developed a novel machine learning-guided framework that integrates computational chemistry and data-driven modeling to accelerate the design of DES systems for lignocellulosic biomass pretreatment. A comprehensive dataset of 467 pretreatment conditions was constructed, featuring molecular descriptors and fingerprints to characterize DES chemical structures and key operational parameters. Among several models evaluated, the XGBoost model achieved the best predictive performance (R2 = 0.8259, RMSE = 0.0928, MAE = 0.0672). SHAP analysis provided mechanistic insights, revealing the dominant influence of hydrogen bond donor–acceptor structures and solvent-to-solid ratio on lignin solubilization. The model demonstrated robustness when validated against an independent dataset (R2 = 0.8034, MAE = 0.0661). This work offers a sustainable and mechanistically interpretable strategy for green solvent development, significantly reducing experimental workload and chemical waste, and contributing to advancing biomass valorization technologies.
Green foundation1. This study proposes a sustainable approach to optimizing lignin removal from lignocellulosic biomass by assisting resource-intensive, trial-and-error Deep Eutectic Solvent (DES) experimental methods with an explainable machine learning framework.2. Our XGBoost model (R2 = 0.8259) for predicting lignin removal efficiency in DES systems significantly reduces chemical waste, energy consumption, and raw material usage by minimizing the need for extensive physical synthesis and testing of numerous DES formulations. 3. By providing mechanistic insights through SHAP analysis, this data-driven framework accelerates the rational design of more efficient and environmentally benign DES formulations, directly advancing the green and sustainable development of biomass valorization technologies. |
Deep eutectic solvents (DESs) have gained traction as next-generation green solvents due to their low toxicity, negligible volatility, non-flammability, facile preparation from inexpensive precursors, and intrinsic biodegradability.11 They are typically formed by mixing a hydrogen-bond donor (HBD) (e.g., organic acids, polyols) with a hydrogen–bond acceptor (HBA) (e.g., choline chloride) in defined molar ratios to produce a eutectic mixture whose melting point is drastically lower than that of its components.12,13 In lignocellulose fractionation, ternary DES systems-such as choline chloride/oxalic acid/ethylene glycol-have been shown to preserve up to 53 β-O-4 linkages per 100 aromatic units in the recovered lignin while achieving delignification yields exceeding 80% at 120 °C within four h, thereby maintaining lignin's structural integrity for downstream valorization.14 A recent systematic investigation of over forty choline chloride-based DESs further demonstrated that increasing the HBD-to-HBA ratio and employing shorter-chain HBDs (e.g., glycerol vs. ethylene glycol) can enhance industrial lignin solubility from 50% to over 70% at 100–140 °C.15 Despite these promising results, the enormous diversity of possible DES formulations-spanning combinations of quaternary ammonium salts, carboxylic acids, polyols, sugars, and urea derivatives—and the sensitivity of lignin solubility to process parameters (temperature, reaction time, moisture content, solvent-to-solid ratio) generate a complex, multidimensional design space.16 Synergistic and antagonistic hydrogen-bonding interactions among DES components further complicate the prediction of solvent performance. Traditional trial-and-error screening is therefore insufficient to navigate these nonlinear, multivariate relationships, and there remains a lack of systematic understanding linking specific molecular descriptors (e.g., hydrogen bond strength, polarity, functional group) to lignin dissolution efficiency, limiting the rational design of optimized DES formulations.
Machine learning (ML) has emerged as a transformative tool for accelerating solvent discovery and process optimization in sustainable chemistry. By mining historical experimental datasets, ML models can uncover hidden patterns, build predictive relationships, and dramatically reduce the experimental burden of navigating complex design spaces.17,18 For instance, Taylor et al. critically reviewed the integration of ML with quantum and molecular-scale simulations for ionic-liquid–lignin systems, highlighting how data-driven approaches can predict solvent–lignin interactions with improved accuracy over purely empirical methods.19 Similarly, in lignocellulose pretreatment studies, XGBoost models achieved R2 values between 0.756 and 0.980 in predicting lignin destabilization under varying thermal pretreatment conditions, demonstrating ML's potential to optimize key parameters such as temperature and time.20 Despite these successes, the application of ML to DES-based lignin extraction remains nascent. A recent survey of ML in lignocellulosic biorefineries noted that while artificial neural networks (ANN) dominate bioconversion predictions—appearing in nearly 90% of published studies-few efforts have extended such methods to DES solvent design or provided mechanistic insights into the underlying driving forces.21 There remains no systematic framework linking DES molecular descriptors, biomass composition, and operational parameters to lignin dissolution efficiency. This “black-box” nature of conventional ML models limits their scientific interpretability and impedes the derivation of generalizable design principles for green solvents. To overcome these challenges, there is an urgent need for ML frameworks that integrate multi-scale descriptors, including solvent molecular structure, biomass characteristics, and process variables, and employ explainable AI techniques to reveal how specific features (e.g., hydrogen bond strength, polarity, molar mass) govern biomass fractionation outcomes.
In this study, we developed a machine learning-guided framework to optimize lignin extraction from lignocellulose using DESs (Fig. 1). We trained and compared multiple representative machine learning models by constructing a comprehensive dataset integrating solvent molecular descriptors, process parameters, and extraction outcomes. Furthermore, explainable AI techniques were employed to elucidate the key molecular and process factors influencing lignin dissolution, providing mechanistic insights into DES performance. This work offers a data-driven strategy for the rational design of green solvent systems, advancing the sustainable valorization of biomass toward circular economy and carbon neutrality goals.
![]() | ||
Fig. 1 Schematic illustrating the conventional and ML formulation development methods for fractionating biomass to extract lignin using deep eutectic solvents. |
Risk type | Metric | Previous methods (data & references) | The current study |
---|---|---|---|
Mass of waste | E factor/process mass intensity | High (due to repeated trial-and-error experiments) | Low (ML predicts optimal conditions, reducing failed trials) |
Ecotoxicity | LC50 (aphnia magna, mg L−1) | Moderate (some solvents had high aquatic toxicity) | Low (ML selects DES with benign HBD/HBA components) |
Energy usage | Heating/distillation required? | High (frequent heating/mixing for each trial) | Low (only validated conditions are tested) |
Human toxicity | LD50 (rat, oral, mg kg−1) | Variable (depends on solvent choice) | Optimized (non-toxic DES components prioritized) Safer chemical profiles due to explainable ML guidance. |
Persistence | Half-life (days) | Some solvents were persistent (e.g., ionic liquids) | Lower environmental persistence via greener solvent design. |
Resource depletion | Rare elements used | Possible (if metal catalysts are used) | Minimal (avoids precious metals, uses abundant HBD/HBA) |
It is important to note that although DESs are frequently regarded as green solvents, our model was developed solely to predict lignin removal efficiency without applying explicit environmental constraints to the DES components. Given the broad chemical space of DESs, not all predicted combinations may adhere to green chemistry principles. Future work should integrate environmental and sustainability metrics—such as toxicity, biodegradability, and renewability of components—into the modeling framework to guide the selection of truly green and efficient solvent systems.
To enhance the practical value of the ML model's predictions, the dataset only uses prior features. It excludes experimental outcomes, such as post-treatment lignin content, as input features to avoid data leakage. The feature variables of the dataset were categorized into three groups: biomass particle size and composition (cellulose, hemicellulose, and lignin content), process parameters (temperature, reaction time, solid-to-liquid ratio, DES molar ratio, and severity factor logR0), and chemical structural descriptors of the DES solvent components (molecular descriptors (MDs), molecular fingerprints (MFs)). Due to the varying methods of lignocellulose component determination in different studies, this work selected data points measured using the same detection methods (NREL standard methods23 and modified variants). However, due to factors such as reagent quality and experimental conditions, even under similar process conditions, the lignin fractionation results of DES solvents may still show some discrepancies across different studies. In cases of inconsistent information, the source articles were carefully reviewed to avoid data with missing feature information, and more reliable data were selected based on the evaluation of the experimental protocols used.
The final curated dataset comprises 467 experimental records covering 19 different herbaceous biomass types, including wheat straw, rice straw, switchgrass, and Arabidopsis thaliana mutants. The compositional diversity of biomass inputs is illustrated in Fig. 2a, which shows the range of cellulose, hemicellulose, and lignin contents. These variations reflect the chemical heterogeneity inherent to herbaceous lignocellulose and underscore the broad applicability of the ML framework. The DES formulations were composed of 27 HBAs and 36 HBDs, with varied molar ratios across different combinations. The molar mass distributions of HBA and HBD species are summarized in Fig. 2b. The experimental conditions span a wide range of pretreatment temperatures (Fig. 2c), solid-to-liquid ratios (L:
S) (Fig. 2d), and reaction times (Fig. 2e), capturing the operational diversity commonly used in DES fractionation protocols. The distribution of the target variable, lignin removal yield, is shown in Fig. 2f. To ensure numerical stability and accelerate model convergence, all continuous input features were normalized using min–max scaling before model training. Normalization also improves generalization performance and enhances model interpretability by eliminating scale-based feature dominance. For features involving nonlinear process intensity, the pretreatment severity factor log
R0 was calculated according to eqn (1).
Log![]() | (1) |
For each DES formulation, we encoded both HBAs and HBDs using MDs and MFs, labeling features with an “HBA_” or “HBD_” prefix. Morgan fingerprints were generated with radius = 1 and nBits = 6000, following the original algorithm (for detailed information about the reasons for setting the Morgan fingerprint, see Text S1; Fig. S1). Because a DES is a mixture rather than a single compound, we aggregated the component fingerprints by summing the HBA and HBD bit vectors in proportion to their molar ratios – a procedure commonly used in multi-component cheminformatic studies. However, this summation produces entries greater than 1, which violates the binary “presence/absence” logic of Morgan fingerprints and complicates downstream interpretation. This limitation underscores the need for alternative aggregation schemes (e.g., bitwise OR, feature hashing, or continuous vector embeddings) that preserve the interpretability of substructure counts while accurately reflecting mixture composition.
To overcome the limitations of binary Morgan fingerprints, we adopted count-based Morgan fingerprints (C-MFs), in which each bit can take integer values ≥1 corresponding to the number of distinct atom-centered environments mapped to that bit (Table S3).30 C-MFs preserve substructure count information, enabling straightforward aggregation of HBA and HBD contributions in proportion to their molar ratios and providing a richer, continuous feature space that enhances dynamic feature influence and mitigates low-variance issues common with binary vectors.
MFs provide qualitative representations of molecular structures via binary vectors or sequence encodings, whereas MDs numerically capture structural, physicochemical, and topological properties. When constructing the different QSAR models, both MFs-based and MDs-based approaches exhibit distinct strengths.24,26,31 Their predictive performance hinges on the intrinsic structure–property relationships within the dataset and the selected machine learning algorithm, and it isn't easy to judge which is the optimum molecular representation.32 Prior works have shown that combining MFs (including C-MF) with MDs can yield synergistic gains in predictive accuracy across diverse chemical datasets,32,33 This dual-encoding strategy is well-suited for robust ML modeling of DES performance.
The extensive set of MDs and MFs can easily exceed the number of observations, invoking the “curse of dimensionality”, which undermines model generalization and computational efficiency. Mitigation strategies include increasing data size or reducing feature dimensionality, the latter achieved by either feature selection-removing constant or highly correlated variables- or feature extraction-projecting the data onto a lower-dimensional space (e.g., via principal component analysis, PCA). In our study, constant features in the MF matrix (bits with zero variance) were first eliminated to remove noninformative descriptors. We opted against PCA for fingerprint reduction (Text S2) to preserve the interpretability of substructure counts.
For the >1800 MDs generated by Mordred, many descriptors represent intermediate computational quantities or redundantly correlate with more interpretable features, complicating mechanistic insight. For this purpose, we addressed the high dimensionality of the MDs dataset by performing targeted feature selection based on both domain-specific chemical knowledge and mechanistic insights reported in DES-biomass fractionation literature.34 Specifically, we curated 13 descriptors to describe HBA and HBD, respectively, totaling 26 features. These variables were selected not only for their statistical variability but also for their direct relevance to known physicochemical mechanisms underlying lignin disruption, such as hydrogen bonding potential, acid–base interactions, solvation polarity, and molecular size. The 13 selected MD features represented a broad spectrum of DES molecular properties, including acidity/basicity (e.g., pKa for acidic/neutral species and pKb for basic groups), polarity indices (e.g., polar surface area, polarizability), partition coefficient (logP), molecular weight, and carbon atom count. These descriptors collectively capture the key physicochemical parameters that govern the solvent's ability to disrupt lignin–hemicellulose linkages. Prior mechanistic studies have established that the delignification process by DESs often involves cleavage of ether or ester bonds, which can be catalyzed or facilitated by the presence of solvated H+ or OH− species in the medium.34–36 Therefore, the inclusion of acid–base dissociation constants and the number of proton-donating or -accepting functional groups in the feature set was essential to account for these catalytic aspects.
Moreover, the hydrogen bond network formed between DES components and lignin molecules plays a central role in the dissolution mechanism. Specifically, the capability of the DES to establish strong hydrogen bonds with lignin's hydroxyl-rich aromatic macromolecules is critical for disrupting its supramolecular structure.37 Parameters such as HBD and HBA counts, polar surface area, and polarizability are therefore key indicators of the strength and density of such interactions. LogP values offer additional insight into solvent hydrophobicity, which may influence phase compatibility and interfacial interactions during pretreatment. The number of carbon atoms and the molecular weight of each component further inform solvent properties such as viscosity, diffusivity, and molecular mobility, all of which influence the kinetics of biomass deconstruction.
In conclusion, guided by domain knowledge and prior QSPR work on lignin solubility, we preselected MDs likely to influence dissolution, such as hydrogen-bond donor/acceptor counts, molecular polarizability, and topological polar surface area.11,22,38–40 The specific data of the relevant MDs can be found in Table S4. We then performed agglomerative hierarchical clustering on these MDs alongside biomass and process parameters, using Euclidean distance and average linkage to group strongly correlated features into clusters.41 At each clustering level, we evaluated the impact of cluster-specific feature subsets on lignin-removal prediction accuracy, iteratively pruning the least contributive clusters to arrive at a final, parsimonious feature set for subsequent model training. This combined domain-guided and data-driven filtering approach ensures both robustness and interpretability of our ML framework.
The use of MFs in conjunction with chemically grounded MDs allowed us to construct hybrid models that combine structural specificity with mechanistic interpretability, facilitating both predictive accuracy and explanatory insight into DES-driven lignin removal.
Model development followed a structured and reproducible workflow. The dataset was randomly divided into a training set and a testing set at a ratio of 80 to 20. The training set was used exclusively to construct and tune the models, while the testing set was held out for final evaluation of model generalization. Bayesian optimization was applied in conjunction with five-fold cross-validation to optimize hyperparameters.30 The optimization objective was to maximize average validation performance. During this process, candidate hyperparameter combinations were selected and updated iteratively. The hyperparameter search ranges were defined based on prior knowledge of their impact on model complexity and predictive stability, as detailed in Table S5.
Due to the limited size of the dataset, ten-fold cross-validation was employed to reduce the risk of overfitting and to exploit the available data fully. The data was divided into ten mutually exclusive subsets. In each iteration, one subset served as the validation set while the remaining nine subsets were used for model training. This process was repeated ten times, and the average validation score was calculated to ensure robustness and statistical reliability. This approach improves generalization performance and ensures that the observed results are not artifacts of a specific data split.
The MAE reflects the average magnitude of prediction errors without considering their direction. It is defined as follows:
![]() | (2) |
The RMSE measures the standard deviation of the residuals (prediction errors) and is particularly sensitive to large deviations due to the squared error term:
![]() | (3) |
Compared to MAE, RMSE penalizes significant errors more heavily, thus providing insight into the model's robustness against outliers. Together, MAE and RMSE offer a balanced view of model reliability and resilience across varying error magnitudes.49
The R2 evaluates the proportion of variance in the dependent variable that is captured by the model, serving as a direct measure of model fit:
![]() | (4) |
Choline chloride (ChCl, 98%), lactic acid (85%), and urea (95%) were purchased from Macklin Biochemical Technology Co., Ltd (Shanghai, China). Glycerol (98%) and Nile Red (98%) were obtained from Shanghai Shenggong, and p-nitroaniline (99%) was purchased from J&K Scientific (Beijing, China). All chemicals were of analytical grade and used without further purification.
Groups of variables | Abbreviation of variables | Variables interpretation | Group | |||||||
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |||
Composition of biomass | Cellulose | Cellulose content of raw biomass, % | √ | √ | √ | √ | √ | √ | ||
Hemicellulose | Hemicellulose content of raw biomass, % | √ | √ | √ | √ | √ | √ | |||
Lignin's | Lignin content of raw biomass, % | √ | √ | √ | √ | √ | √ | √ | √ | |
Size | Particle size of raw biomass, mm | √ | √ | √ | √ | √ | √ | √ | √ | |
Technological parameter | Temp | Pretreatment temperature, °C | √ | √ | √ | √ | √ | |||
Time | Pretreatment time, h | √ | √ | √ | √ | √ | √ | |||
HBD![]() ![]() |
Molar ratio | √ | √ | √ | √ | √ | √ | √ | √ | |
L![]() ![]() |
Mass ratio of DES to raw biomass | √ | √ | √ | √ | √ | √ | √ | √ | |
Log![]() |
Severity factors of pretreatment reaction | √ | √ | √ | √ | √ | √ | √ | √ | |
MDs | HBA-Sp | Polarizability OF HBA | √ | √ | ||||||
HBD-Sp | Polarizability OF HBD | √ | √ | |||||||
HBA-pKa/pKb | Acidity coefficient/alkalinity coefficient of HBA | √ | √ | √ | ||||||
HBD-pKa/pKb | Acidity coefficient/alkalinity coefficient of HBD | √ | √ | √ | √ | √ | √ | |||
HBA-MW | Exact molecular weight of HBA | √ | √ | |||||||
HBD-MW | Exact molecular weight of HBD | √ | √ | √ | √ | √ | ||||
HBA-TopoPSA | Topological polar surface area of HBA | √ | √ | √ | √ | |||||
HBD-TopoPSA | Topological polar surface area of HBD | √ | √ | √ | √ | √ | ||||
HBA-nHBAcc | Number of hydrogen bond acceptors of HBA | √ | √ | √ | √ | √ | ||||
HBD-nHBAcc | Number of hydrogen bond acceptors of HBD | √ | √ | √ | ||||||
HBA-nHBDcc | Number of hydrogen bond donors of HBA | √ | √ | √ | √ | |||||
HBD-nHBDcc | Number of hydrogen bond donors of HBD | √ | √ | √ | √ | √ | √ | |||
HBA-S![]() ![]() |
The estimated partition coefficient by to the atoms’ contribution to the molecular surface area of HBA | √ | √ | √ | √ | √ | ||||
HBD-S![]() ![]() |
The estimated partition coefficient by to the atoms’ contribution to the molecular surface area of HBD | √ | √ | √ | √ | |||||
HBA-S![]() ![]() |
Hydrophobic constants of HBA | √ | √ | √ | ||||||
HBD-S![]() ![]() |
Hydrophobic constants of HBD | √ | √ | √ | √ | √ | √ | |||
HBA-nAromAtom | The aromatic atom count of HBA | √ | √ | √ | √ | √ | ||||
HBD-nAromAtom | Aromatic atom count of HBD | √ | √ | √ | √ | √ | ||||
HBA-nRot | Rotatable bond count of HBA. | √ | √ | √ | ||||||
HBD-nRot | Rotatable bond count of HBD. | √ | √ | √ | √ | |||||
HBA-nAcid | Number of acid groups of HBA | √ | √ | |||||||
HBD-nAcid | Number of acid groups of HBD | √ | √ | |||||||
HBA-nBase | Number of alkaline groups of HBA | √ | √ | √ | √ | √ | √ | |||
HBD-nBase | Number of alkaline groups of HBD | √ | √ | √ | √ | √ | ||||
HBA-nC | Number of carbon atoms in HBA | √ | √ | |||||||
HBD-nC | Number of carbon atoms in HBD | √ | √ | √ | ||||||
MFs | C-MFs | DES's count-based Morgan fingerprint | √ | √ | √ | √ | √ | √ |
Our feature engineering process is grounded in both chemical interpretability and statistical rigor, ensuring that the final feature space captures critical information related to DES–lignin interactions. This approach improves model robustness and generalizability while maintaining chemical relevance. Compared with previous studies, such as those by Ge et al.,50 which relied on broad solvent parameters, our method provides a more detailed, molecularly resolved view of solvent–biomass interactions. Specifically, our model considers molecular features like hydrogen bond donor/acceptor counts, polarizability, and acid–base dissociation constants, facilitating more rational solvent design. Furthermore, Sumer et al.51 highlighted the use of Kamlet–Taft parameters and COSMO-RS predictions for assessing solvent-lignin affinity, but their approach was computationally intensive and not easily scalable. In contrast, the descriptors we selected are not only chemically meaningful but also easily computable using tools like RDKit, PaDEL, or AlvaDesc, making our approach more scalable for large molecular libraries.
The feature clusters identified through correlation-based hierarchical clustering were validated against known structure–property trends in open-access chemical databases, such as PubChem, ChemSpider, and the OECD QSAR Toolbox. These clusters, including molecular weight, polarizability, and number of rotatable bonds, align with existing quantitative structure–activity relationship (QSAR) models for solvent design in biomass pretreatment. By integrating data-driven correlation analysis with chemically informed feature selection, our framework offers improved predictive power and chemical transparency compared to existing models. It provides a foundation for developing interpretable and generalizable models, which can aid not only in lignin removal predictions but also in broader biorefinery process optimization, including the solubilization of hemicellulose and the generation of lignin-derived aromatic platform chemicals.
Feature group | Model | MAEtest | RMSEtest | R 2 test |
---|---|---|---|---|
All data in the table are results obtained through ten-fold cross-validation; the bolded indicates the optimal overall performance. | ||||
0 | XGBoost | 0.0905 ± 0.0078 | 0.1216 ± 0.0121 | 0.6982 ± 0.0662 |
CatBoost | 0.0913 ± 0.0073 | 0.1215 ± 0.0122 | 0.7024 ± 0.0477 | |
RF | 0.0948 ± 0.0096 | 0.1250 ± 0.0113 | 0.6814 ± 0.0680 | |
SVR | 0.1045 ± 0.0078 | 0.1381 ± 0.0134 | 0.6157 ± 0.0588 | |
ANN | 0.1321 ± 0.0144 | 0.1666 ± 0.0159 | 0.4356 ± 0.1108 | |
1 | XGBoost | 0.0708 ± 0.0062 | 0.973 ± 0.0069 | 0.8076 ± 0.0355 |
CatBoost | 0.0743 ± 0.0067 | 0.0997 ± 0.0093 | 0.7999 ± 0.0296 | |
RF | 0.0830 ± 0.0088 | 0.1100 ± 0.0990 | 0.7540 ± 0.0510 | |
SVR | 0.0870 ± 0.0115 | 0.1190 ± 0.0184 | 0.7058 ± 0.0950 | |
ANN | 0.1108 ± 0.0141 | 0.1433 ± 0.0183 | 0.5875 ± 0.0705 | |
2 | XGBoost | 0.0741 ± 0.0082 | 0.1023 ± 0.0116 | 0.7886 ± 0.0391 |
CatBoost | 0.0758 ± 0.0074 | 0.1026 ± 0.0126 | 0.7887 ± 0.0340 | |
RF | 0.0826 ± 0.0077 | 0.1106 ± 0.0112 | 0.7527 ± 0.0439 | |
SVR | 0.0957 ± 0.0106 | 0.1309 ± 0.0170 | 0.6488 ± 0.0886 | |
ANN | 0.1156 ± 0.0116 | 0.1532 ± 0.0192 | 0.5266 ± 0.0961 | |
3 | XGBoost | 0.0688 ± 0.0041 | 0.0953 ± 0.0055 | 0.8163 ± 0.0267 |
CatBoost | 0.0734 ± 0.0083 | 0.0996 ± 0.0136 | 0.7984 ± 0.0488 | |
RF | 0.0823 ± 0.0086 | 0.1100 ± 0.0101 | 0.7554 ± 0.0429 | |
SVR | 0.0987 ± 0.0115 | 0.1379 ± 0.0225 | 0.6072 ± 0.1247 | |
ANN | 0.1183 ± 0.0126 | 0.1546 ± 0.0184 | 0.5210 ± 0.0704 |
Among single molecular representations, MDs (feature group 1) performed better than MFs (feature group 2). This is expected, as MDs encode physicochemical properties like acidity, polarity, molecular size, and hydrogen bonding, which directly relate to lignin solubilization mechanisms. These properties exhibit greater variance and lower sparsity, making them more suitable for practical machine learning models. MFs, though useful for molecular similarity screening, remain sparse and high-dimensional, which limits their effectiveness in regression-based modeling of complex biochemical processes.
The combination of MDs and MFs (feature group 3) aimed to exploit complementary information from both representations. Tree-based models (XGBoost, CatBoost, and RF) generally performed well with this configuration, showing modest improvements in R2_test and reductions in MAE_test and RMSE_test compared to models using MDs alone. This suggests that, although MDs capture most predictive information, MFs provide additional structural signals that refine predictions. Models like XGBoost and RF, which have built-in feature selection and can handle high-dimensional data, particularly benefited from this combination.30 In contrast, for SVR and ANN, the addition of MFs increased the total feature count from 35 to 194, leading to decreased performance. In SVR, the increased dimensionality led to overfitting, with the kernel-based projection in high-dimensional spaces resulting in hypersurface models that generalized poorly. ANN performance was similarly reduced due to increased feature sparsity and noise disrupting gradient descent convergence.
Despite these challenges, MFs remain valuable from an interpretability perspective. By encoding functional groups like hydroxyl, carbonyl, or amide groups, MFs provide chemical insights into how these groups in HBAs and HBDs influence lignin extraction. Such qualitative insights are difficult to derive from continuous MDs alone, which focus on quantitative aspects. While MDs offer superior predictive performance due to their direct statistical suitability, MFs contribute structural insights that enhance model interpretability. This combined approach can improve prediction accuracy for robust models like XGBoost, although caution is needed when using them in models sensitive to high-dimensional noise. These findings highlight the importance of aligning molecular representation strategies with model characteristics to achieve both predictive power and chemical transparency in machine learning-guided solvent design.
As shown in Fig. 4a, models trained with various feature groups showed distinct predictive performances. XGBoost and CatBoost exhibited high accuracy, with predictions clustering closely along the ideal diagonal line, indicating high predictive accuracy. RF models showed a more dispersed pattern but remained symmetrical, suggesting stable generalization. Conversely, SVR and ANN models displayed a wider scatter, especially at the extremes, suggesting lower precision and generalization compared to tree-based models.
XGBoost consistently outperformed other models, achieving the highest R2 values and the lowest RMSE and MAE across both training and test sets. In feature group 4, XGBoost achieved an R2 of 0.83 on the test set, outperforming all other configurations, demonstrating its superior generalization capability. This was particularly notable as XGBoost performed effectively even with a relatively small dataset, highlighting its robustness in handling high-dimensional chemical descriptors and unbalanced data distributions, a challenge in materials modeling.18,33
Feature group 4 consistently outperformed all other sets across the five algorithms tested, highlighting its superiority over the more extensive feature group 3, which included the complete set of MDs and MFs. This enhanced performance stems from a targeted reduction of redundant descriptors—specifically, HBA-Sp, HBD-Sp, and HBA-MW—identified through hierarchical clustering. The elimination of these polarizability-related descriptors, which exhibited strong correlations with molecular weight (as shown in Fig. 4b), led to improvements in model efficiency without sacrificing predictive accuracy. Notably, while both polarizability and molecular weight provide insights into molecular structure, molecular weight captures a broader range of chemical information. This is reflected in the evaluation results (Table S6), where models retaining only polarizability performed worse than those using feature group 4, underscoring its lower informational value in this context. Further improvements were achieved by removing acidic group counts that were correlated with pKa and pKb, thereby reducing feature redundancy and enhancing computational efficiency. These findings demonstrate that informed feature selection—combining statistical analysis with chemical domain knowledge—is a powerful approach for optimizing predictive models in biomass solvent screening.34,37
The presence of logR0 in feature group 4 was critical, as it has a functional correlation with reaction temperature and time, helping models capture their synergistic effect. Contrastive modeling (Table S7) confirmed that log
R0 significantly enhanced model performance by capturing the temperature–time interaction, outweighing potential artifacts from cross-correlation. Beyond feature group 4, further feature reduction in groups 5 to 7 led to performance degradation for most models, highlighting the importance of balancing dimensionality reduction with information retention. Features in group 4, such as pKa, log
P, hydrogen bonding counts, and polar surface area, are known to influence solvation capacity and lignin solubility, crucial for DES-mediated lignin dissolution.52–54 Previous studies have emphasized the role of polarity and hydrogen bonding in lignin deconstruction, and our approach quantifies their impact on predictive model accuracy.34,37
The superior performance of feature group 4 across all models validates the effectiveness of our feature refinement strategy for DES-based biomass pretreatment. This method combines hierarchical clustering with chemical interpretability to retain the most informative features, ensuring high model accuracy while maintaining mechanistic relevance. The strategy can be adapted to other modeling tasks, such as optimizing organocatalytic reactions or estimating enzymatic hydrolysis efficiency across different biomass types.
This modular approach contrasts with prior studies, which relied on predefined descriptor sets without structural refinement.39,50,55 Our pipeline is adaptive and allows for tailored feature reduction across different systems, with the features retained in feature group 4 demonstrating relevance in other fields such as green solvent screening, polymer solubility prediction, and pharmaceutical crystallization. Thus, this strategy offers a promising framework for data-driven molecular design across diverse applications. Complete evaluation metrics for all machine learning models and feature groups are provided in Table S8.
To validate the consistency of SHAP-based findings, we also examined the F-score values in the XGBoost model (Fig. 5b). The F-score quantifies how frequently a feature is used as a decision node across the ensemble of decision trees. The top 20 features identified by F-score largely overlap with those highlighted by SHAP, with process-related and biomass-related features accounting for 51.0% and 28.6% of the top-ranked variables, respectively. While slight differences exist in ranking, both methods converge on a standard set of dominant features, supporting the robustness of the model interpretation. In addition, to verify the reliability of the results obtained by the K-fold variant of the SHAP method, we analyzed the optimal model using the traditional SHAP method (Fig. S2). It can be seen that the top 20 features contributing to the model are consistent with those obtained by the K-fold variant of the SHAP method. The traditional SHAP method only utilized 92 data points in the test set, thus appearing sparser. Some differences in the ranking of feature importance indicate the impact of different dataset divisions on modeling.
It is particularly noteworthy that the SHAP dependence analysis reveals non-monotonic relationships between several key process parameters and lignin removal efficiency, underscoring the limitations of assuming “more is better” in biomass pretreatment. As shown in Fig. 6a, the SHAP values for reaction time plateau after approximately 4 hours, indicating that extending the duration beyond this point offers limited additional benefit. A similar pattern is observed for reaction temperature (Fig. 6b), where SHAP values peak between 130 °C and 140 °C and decline thereafter. The same non-linear trend emerges in the logR0 dependence plot (Fig. 6c), reinforcing the notion that excessive reaction severity may hinder rather than enhance delignification.
The decline in lignin removal efficiency can be linked to polycondensation reactions occurring under high-severity conditions.56–59 Some studies in the datasets have observed and documented a decrease in the delignification rate of biomass under elevated temperatures or extended reaction times, attributing this trend to structural alterations in lignin under harsh conditions that promote its re-deposition on the biomass surface. Under the influence of acidic or basic DES components and elevated temperatures, cleavage of ether and ester linkages within the lignin macromolecule can lead to the formation of new C–C bonds. This re-polymerization increases the molecular weight of lignin fragments, reduces solubility, and promotes re-deposition onto cellulose surfaces, ultimately decreasing the net removal efficiency. Similar observations have been reported in acidic organosolv and IL-based pretreatments, highlighting a general trade-off between depolymerization and recondensation at high temperature and acidity.37,60–62
The L:
S (liquid-to-solid) ratio emerged as a key parameter influencing lignin removal, as indicated by SHAP analysis. Higher L
:
S ratios consistently showed positive SHAP values (Fig. 6a), suggesting improved solvent penetration, increased surface contact, and enhanced mass transfer. This observation aligns with experimental studies reporting that excess solvent volume can alleviate diffusion limitations during biomass delignification.63 The HBD
:
HBA molar ratio also plays a critical role in tuning DES properties. Higher HBD content generally reduces viscosity and enhances solvent mobility, thereby facilitating molecular diffusion (Fig. 6b). In addition, HBDs influence the proton-donating ability of the DES, which can alter acidity and hydrogen bonding behavior. Variations in HBD
:
HBA ratios—even within similar component classes like choline chloride–acid systems—can lead to markedly different delignification outcomes.11
Biomass composition features revealed more complex SHAP patterns (Fig. 6d and i). A high initial lignin content (>25%) was associated with negative SHAP contributions, possibly due to structural resistance and limited porosity in high-lignin feedstocks such as bamboo.64 These materials often contain tightly cross-linked lignin–carbohydrate complexes (LCCs), which may hinder solvent accessibility. In contrast, cellulose content above 43% correlated with improved lignin removal. This may reflect a more open biomass structure or better dispersion of lignin across the cellulose matrix, facilitating solvent access and dissolution.
The SHAP dependence plots further highlight optimal operational thresholds for effective delignification. Conditions such as logR0 > 3, temperature > 120 °C, time > 2.5 h, HBD
:
HBA > 3.5, and L
:
S > 10 were consistently associated with positive SHAP values. These parameter ranges offer practical guidance for DES pretreatment optimization, supporting efficient lignin removal while minimizing solvent use and energy demand in line with green chemistry objectives.
Compared to process parameters and biomass composition, molecular structure features such as MDs and MFs contribute less to overall model performance but are crucial for capturing subtle chemical effects. Key descriptors, particularly HBD-pKa/pKb, which reflect the proton-donating or -accepting capacity of the hydrogen bond donor, were highly influential, with low pKa/pKb values associated with improved lignin removal. This suggests DES pretreatment not only solubilizes lignin but also facilitates bond cleavage between lignin and hemicellulose, supporting the view of DESs as mild acid–base catalytic systems that cleave bonds in the lignin structure.65 Kwok et al.,66 similarly found that delignification efficiency is governed more by chemical cleavage mechanisms than solubility.
Other MDs and MFs, although ranked lower in overall feature importance, contributed interpretable and chemically meaningful insights. For example, the descriptor HBD-nRot (number of rotatable bonds in the hydrogen bond donor) had a value range between 0 and 5, and its negative SHAP contribution (Fig. 7a and c) suggests that longer alkyl chains reduce DES effectiveness, likely due to increased viscosity and steric hindrance. This trend aligns with experimental observations: DESs composed of ChCl-formic acid (C1) exhibit better lignin solubilization than those with longer-chain HBDs such as butanoic acid (C4).67 Teles et al.68 similarly reported that short alkyl chains in HBDs and HBAs enhance proton transfer capacity by increasing electron density around hydroxyl oxygen atoms, thereby strengthening hydrogen bonding.
Molecular weight (MW) of the HBD also showed a complex relationship with lignin removal, as indicated by the vertical spread of SHAP values in Fig. 7c. While high MW typically corresponds to higher viscosity and lower diffusivity, a subset of data showed unexpectedly positive SHAP contributions. This was associated with the MF-390 feature, corresponding to an H2O substructure from bound water in hydrated oxalic acid (Fig. 7d). Compared to the anhydrous form, the hydrated DES displayed improved lignin removal, likely due to lower viscosity and enhanced molecular mobility. Even small amounts of bound water can reduce DES viscosity and increase polarity, facilitating lignin solubilization.69 Conversely, low SHAP values within the MW distribution (Fig. 7e and f) were linked to MF-321 and MF-2999, representing unsaturated moieties such as CC or aldehyde (–CHO) side chains. These substructures may interfere with effective DES–lignin interactions, potentially through electronic or steric effects.
Other descriptors also provided chemical context. HBD-nC, related to molecular size and carbon chain length, showed a negative SHAP impact (Fig. 7b), possibly due to coupled effects with nRot and MW. Slog
P, representing the octanol–water partition coefficient, also exhibited an inverse relationship with lignin removal, suggesting that lower polarity reduces compatibility with lignin's polar functionalities. In contrast, the topological polar surface area (TopoPSA), associated with polarity and hydrogen-bonding capacity, showed a positive correlation with lignin extraction efficiency. This aligns with previous studies highlighting the importance of solvent polarity in dissolving polyphenolic and aromatic lignin structures.50 Collectively, these results underscore that even lower-ranked structural features contribute valuable mechanistic insights, supporting improved model generalization and rational design of DESs through an enhanced understanding of structure–property relationships.
Compared to MDs, the structural representations provided by MFs offer a more intuitive connection to discrete chemical substructures. However, this representation approach presents several inherent limitations. First, the radius of substructure encoding (typically 1–2 atomic layers in circular fingerprints) may be too small to capture chemically relevant motifs, resulting in fragments that are difficult to interpret. Conversely, increasing the fingerprint radius can result in over-specific representations that closely resemble entire molecules, significantly reducing the generalizability and statistical variance across the dataset. Second, MFs are often characterized by low variance, especially in datasets with structurally similar molecules. This diminishes the individual information gain of each feature and limits their importance in models that combine MFs with MDs.
Despite challenges in interpretability, MFs significantly enhance model accuracy. As shown in Fig. S3, their cumulative contribution, measured by Gain in the XGBoost model, reduces prediction error more effectively than F-score, which only reflects feature usage frequency. Gain quantifies the reduction in squared error attributable to a feature, making it sensitive to interaction effects and marginal contributions of low-variance inputs. Table S1 summarizes the impact of individual MFs on lignin removal predictions, highlighting their role in differentiating structurally similar DES solvents under identical process conditions. These limitations are reflected in the SHAP analysis. In the model that incorporates both MDs and MFs (feature group 3), MFs exhibited more complex and dispersed SHAP value distributions compared to when they were used alone (feature group 2). As illustrated in Fig. S4, fingerprints with identical binary values showed scattered SHAP contributions across different samples. This suggests that in the presence of MDs, MFs do not act as primary explanatory variables but instead contribute through nonlinear interaction effects with other features. Consequently, individual MFs are challenging to interpret in isolation, mainly when the prediction arises from feature synergies rather than direct effects.
Specific MFs displayed expected chemical behavior. For example, MF_4272, representing CH2 units, complements the nRot descriptor, and both negatively impact lignin extraction due to increased viscosity or steric hindrance. MF_2311, linked to –OH or –COOH substructures, was relevant but challenging to isolate due to functional group overlap. MF_4234, indicating the absence of chloride anions, was negatively associated with lignin removal, aligning with the known role of Cl− in choline chloride-based DES systems, where it aids solubilization by forming hydrogen bonds with lignin and disrupting lignin–carbohydrate complexes.37,70
These insights, which are difficult to extract from MDs alone, demonstrate the complementary value of MFs in capturing molecular topology and functional group presence. Their interaction with process variables and physicochemical descriptors allows for more nuanced predictions, emphasizing the importance of hybrid molecular representations. The SHAP-based interpretation of the XGBoost model quantifies the contributions of process parameters, biomass composition, and DES molecular features to lignin removal. While process and compositional variables dominate globally, chemically meaningful MDs and specific MFs provide essential mechanistic insights, supporting solvent design strategies at both the formulation and molecular levels. This integrative modeling approach offers a rational and interpretable pathway for optimizing DES-based biomass pretreatment systems.
For prediction, we utilized the optimized XGBoost model trained entirely on the original dataset and employed feature group 4 as input variables. As shown in Fig. 8a, the predicted values for the blind dataset align closely with the experimental values across most data points. The ChCl: urea-based DES exhibited firm agreement, with minimal deviation from the diagonal reference line, suggesting that the model accurately captures the physicochemical behavior of this solvent system. Predictions for ChCl: lactic acid were also satisfactory, though slightly more scattered, likely due to the use of only 85% pure lactic acid in the experimental formulation. Impurities may have introduced deviations from previously reported performance values (as training set data).
Notably, significant discrepancies were observed for ChCl: glycerol at extreme process conditions (e.g., 160 °C or L:
S = 5), with model predictions overestimating lignin removal. These deviations are highlighted with red circles in both the scatter and residual plots (Fig. 8a and b). In contrast to the model's upward trend, the experimental data showed that ChCl: glycerol DES peaked in efficiency at around 130 °C, with further heating resulting in diminished performance. This behavior, while partially captured by the model, was not accurately predicted under high-severity conditions.
Several factors may contribute to these deviations. From a machine learning perspective, every model operates within a defined Domain of Applicability (DoA), and predictions made outside the training feature distribution are inherently less reliable. In this case, the L:
S ratio in the training dataset ranged from 9 to 30, whereas the blind test dataset included values as low as 5. Two of the four outlier points correspond to L
:
S = 5, suggesting that extrapolation beyond this range led to error. A broader and more balanced training set may mitigate this issue in future iterations.
From an experimental standpoint, discrepancies may stem from the thermal instability of glycerol at temperatures exceeding 150 °C, where partial decomposition has been reported.71 Such degradation could alter the hydrogen bonding network or viscosity of the DES, reducing its delignification efficiency. Furthermore, the hygroscopic nature of DESs, particularly those containing glycerol, could introduce variability in solvent properties due to unintended water uptake. Although care was taken to avoid external water addition, condensation was observed during sealed reactor operation, indicating that environmental humidity may have altered solvent polarity and flow characteristics during the reaction.72
Despite these challenges, the SHAP summary plot for the blind dataset (Fig. 8c) confirms that the relative importance of input features remains consistent with prior observations. This suggests that the model maintains interpretive coherence and stable predictive logic even when applied to previously unseen data.
Overall, the XGBoost model achieved an R2 of 0.8034 on the blind test set, with a MAE of 0.0661. This translates to an average deviation of approximately 6–7 percentage points, lower than the 10-pp error reported in a previous lignin removal model based on regression approaches.22 Given the intrinsic variability of biomass and the experimental uncertainty inherent in fractionation processes, these results demonstrate that the model possesses reliable generalization capacity and can serve as a decision-support tool for solvent screening and process optimization in lignin-first biorefinery systems.
To address the black-box nature of machine learning models, we employed the SHAP method to interpret feature contributions. The results revealed that process conditions such as temperature, severity factor, and L:
S ratio were dominant predictors of lignin removal. Additionally, key molecular descriptors, including HBD-pKa/pKb, the number of rotatable bonds (nRot), molecular weight, S
log
P, and polar surface area, were shown to play mechanistically consistent roles, reinforcing the model's chemical interpretability. Substructure-level contributions captured by MFs further elucidated the role of molecular topology in solvent performance, although their interpretability was more context-dependent. A blind test using an experimentally obtained external dataset containing 42 previously unseen data points confirmed the generalization ability of the model, with an R2 of 0.8034 and MAE of 0.0661. While most predictions closely matched the experimental values, a few deviations, particularly in the ChCl: glycerol system under extreme conditions, highlighted the limitations of extrapolating beyond the domain of applicability. Nonetheless, the SHAP analysis on the blind test set demonstrated consistent feature attributions, further validating the model's robustness.
Overall, this study demonstrates that the data-driven XGBoost model, trained on rationally engineered features, can reliably predict lignin removal outcomes in DES-based biomass pretreatment systems. Compared to traditional trial-and-error experimentation, this approach significantly reduces time, cost, and material usage. More importantly, the incorporation of SHAP interpretation enables mechanistic insights and parameter optimization, thereby bridging predictive modeling with fundamental understanding. The framework presented here provides a transferable platform for accelerating DES formulation, guiding experimental design, and advancing green and efficient lignocellulose biorefinery strategies. Looking ahead, this interpretable ML framework can be further extended to predict other biomass component behaviors (e.g., hemicellulose solubilization), solvent recyclability, or synergistic effects in multi-component DESs. Such developments will accelerate the transition toward data-driven and sustainable solvent engineering.
Supplementary information is available. See DOI: https://doi.org/10.1039/d5gc02370j.
This journal is © The Royal Society of Chemistry 2025 |