Stephen Yaw
Owusu
*
Department of Chemistry, Missouri S&T, Rolla, MO 65409, USA. E-mail: sadnd@mst.edu
First published on 9th June 2025
This project is among the pioneering works that incorporate machine learning (ML) modeling into the development of biomass-derived sodium-ion battery anodes for sustainable energy storage technologies. It was conceptualized and executed to satisfy a desire to use computational techniques to fill the research gap in a paper authored by Meenatchi et al. in 2021. The authors asserted that an activated orange peel-derived hard carbon (AOPDHC) can be used as an anode for sodium-ion batteries, yet the evidence for this claim was lacking. This work therefore sought to utilize ML to verify the claim by investigating the reversible capacities of AOPDHC at different initial coulombic efficiencies (ICE) and current densities. Data used to train the algorithms were mined from literature and applied in a 4:
1 training-to-testing data split. Models that gave good correlations between experimental and predicted capacities for some assumed unknowns were used to predict the reversible capacities of AOPDHC. The maximum capacity obtained for AOPDHC was 341.1 mA h g−1 at a current density of 100 mA g−1 and an ICE of 48% and the minimum capacity was 170.3 mA h g−1 at a current density of 100 mA g−1 and an ICE of 43%. Lastly, the modeling found ICE to be a very important factor that influences the reversible capacities of hard carbon anodes for sodium-ion batteries, which matches literature findings, and possibly validates the modeling procedure. This study is of utmost importance since biomass-derived hard carbons are versatile, cost-effective, environmentally friendly and sustainable.
Sustainability spotlightThis study helps to advance knowledge on how to computationally verify the potential use of biomass-derived hard carbon anodes for battery applications-exemplified with an activated orange peel-derived hard carbon (AOPDHC). The modeling procedures described in this work not only accelerate the development of novel biomass-derived materials but also provide new insights for the development of sustainable energy storage technologies. This effort is in tune with the UN Sustainable Development Goal 7: to provide affordable, reliable, sustainable, and modern energy for all. The paper mainly revolves around machine learning modeling and a few kinetic studies which can eventually create an energy-dominant circular bioeconomy and reduce waste and exposure to hazardous substances generated from conducting numerous trial-and-error experiments. |
Hard carbon stands out among the various anode materials for SIBs due to their structure, which facilitates the adsorption and desorption of sodium ions. They are prepared by heating thermosetting carbon-containing precursors under an inert atmosphere (pyrolysis).12 Hard carbons are non-graphitizable, and disordered materials. Due to their excellent sodium storage performance, alongside their low cost, low voltage and high capacity, hard carbon anodes appear to be the most likely anode to be commercialized.13 The cost of the anode can be further reduced by using biomass as the hard carbon precursor. Biomass is obtained from organic substances such as plants, animals, and their waste products. Compared to other hard carbon precursors, biomass-derived hard carbons are highly abundant, cheap, renewable, and mostly rich in heteroatoms, which is essential for battery applications.14 The preparation process of hard carbons is often optimized by tuning the synthesis parameters or changing the biomass precursor used. Biomass derived hard carbons prepared at pyrolysis temperatures between 1200 °C and 1400 °C have been identified to exhibit improved structural properties beneficial to sodium-ion storage performance.12,15
Traditional experimental methods of analyzing the electrochemical properties of hard carbons include measuring the surface area, ratio of defective to graphitic carbon, ICE, and other material properties, and correlating them with their reversible capacities at different charge densities. However, these strategies are laborious, time-consuming and costly to perform. It is therefore advantageous to use machine learning (ML), deep learning and data mining techniques to study the electrochemical properties of novel hard carbons. ML is a subdivision of artificial intelligence used to analyze large datasets. It has been extensively used in scientific research to study and establish structure–property–performance relationships, which can eventually cause significant advancements in the fabrication of novel materials.16,17
The ML approach has been demonstrated in this work using AOPDHC as an anode for sodium ion batteries. AOPDHC was chosen primarily to fill the research gap in a paper authored by Meenatchi et al.18 Additionally, orange peels are cheap, abundant, environmentally safe, and sustainable. Compared to other biochar-based materials such as those derived from rape straw pyrolysis,19 AOPDHC presents a significantly lower surface area (60.16 m2 g−1) as against HC from rape straw (2046.92 m2 g−1) at the same pyrolysis temperature (700 °C). This high surface area of rape straw-derived HC could hinder their applicability for sodium-ion batteries as a low surface area is required to reduce the solid electrolyte interphase (SEI), and improve ICE.20
Seven ML algorithms have been used in this study to predict the reversible capacities of AOPDHC as an electrode material. Material characterization data was used as input features and reversible capacity values as response or target. Features input into the models as its training conditions were ICE, pyrolysis temperature, current density, surface area, pore volume, interlayer spacing (d002), crystallite sizes (La and Lc), annealing time, heating rate and the ratio of defective to graphitic carbon. The study combined statistical and mathematical analysis to evaluate which algorithm best fits our dataset in terms of prediction accuracy. The best models were then used to predict the reversible capacities of the unknown sample (AOPDHC) at different current densities and ICEs.
Presently, a minimal number of studies have been reported, which utilize ML algorithms for investigating the performance of sodium-ion batteries. Amongst the few, this work presents a significant difference and advancement. Tianshuang et al. recently used ML to predict the discharge performance of hard carbon materials for sodium-ion batteries.21 Though it was an extensive study, they didn't incorporate pyrolysis temperature and annealing time of the hard carbons as input features. Also, current density and ICE were missing from the features for their modeling. This presents a significant limitation since these factors are known to strongly influence battery capacity performance.20 Here, the limitation has been catered for by incorporating all these features into the modeling of reversible capacities for AOPDHC. A different work conducted by Yang et al. used ML models to predict the specific capacitance of biomass-derived carbon materials and compared the predicted results to their corresponding experimental values in literature.22 Their work is similar to this one as they didn't perform actual laboratory experiments to augment the modeling results. However, it is still a little beneath this work due to some significant differences in the validation approaches. In this study, the validation procedure is significantly improved by employing both random sampling and cross-validation techniques. Additionally, some known experimental values were assumed to be unknown to the algorithms and predicted by the models for validation purposes. Furthermore, just like the study conducted by Tianshuang et al., Yang et al.'s work omitted some key features such as interlayer spacing (d002), crystallite sizes (La, and Lc), ICE, current density, and ID/IG from the modeling, which limits its reliability and applicability to some extent. These limitations have been catered for in this study. Another notable difference in this work compared to the others is that models that gave good predictions to reversible capacities unknown to the algorithms were used to predict the actual unknowns (AOPDHC) in addition to the test and score results obtained from the modeling. Hence, this work provides new insights and approaches to test the usage of biomass-derived anodes for sustainable battery technologies.
Finally, feature datasets were investigated for their contribution and impact on the reversible capacities of AOPDHC. This was done through model accuracy analysis and was illustrated by further visualizations using shapley additive explanations analysis (SHAP). The SHAP findings were further authenticated using feature ablation. The findings from this project will hopefully provide a better understanding of the relationship between the various factors that influence the reversible capacities of biomass-derived hard carbons and will rapidly and accurately guide future experimental or computational studies through quick optimizations in the hard carbon's preparation and application process.
Data for 170 samples are used for all analysis in this work. 141 samples were used to train the models, and the remaining were used to validate the electrochemical data obtained for AOPDHC. Data used for this study was mined through a combination of small experimental datasets from several publications as shown in Table 1.
No. | Hard carbon precursor | Abbreviation | Reference | Samples in dataset |
---|---|---|---|---|
1 | Epoxy phenol novalac | EPNHC | 25 | 4 |
2 | Navel orange | NOHC | 26 | 3 |
3 | Natural cotton | HCT | 27 | 3 |
4 | Corn cobs | — | 28 | 3 |
5 | Banana peel | BPPG | 29 | 6 |
6 | Cedarwood bark | CBC | 30 | 3 |
7 | Orange peel | AOPDHC | 18 | 6 |
8 | Pomegranate peel | PGPC | 31 | 3 |
9 | Cellulose | CHC | 32 | 4 |
10 | Corn cobs | CC | 33 | 4 |
11 | Mangosteen shell | MHC | 34 | 19 |
12 | Rice husk | RHHC | 35 | 3 |
13 | Agar/urea/graphene oxide | CA | 36 | 4 |
14 | Cellulose nanocrystals | — | 37 | 4 |
15 | Polyurea-Si | PUA@Si | 20 | 4 |
16 | Sugarcane-bagasse | SCA | 38 | 1 |
17 | Phenolic resin | — | 39 | 2 |
18 | Resorcinol-formaldehyde | RFHC | 40 | 4 |
19 | Phenolic resin | — | 41 | 1 |
21 | Biomass starch | — | 42 | 6 |
22 | Biomass | — | 43 | 6 |
23 | Eucalyptus wood | EHC | 44 | 8 |
24 | Date palm | 45 | 10 | |
25 | Camellia seed shell | TS | 46 | 6 |
26 | P-doped carbon nanofibers | CFs | 47 | 2 |
27 | Sycamore fruit seed | SFS | 48 | 4 |
28 | Maple tree | MAHC | 49 | 3 |
29 | F-doped hard carbon | F-HC | 50 | 3 |
30 | Corn straw piths | — | 51 | 4 |
31 | N-doped hard carbon | — | 52 | 3 |
32 | Walnut shell | WAHC | 53 | 6 |
33 | N-doped hard carbon | — | 24 | 3 |
33 | P-doped olive kernel | OHC | 54 | 3 |
34 | P-doped sisal fiber | PSHC | 55 | 1 |
35 | Asphalt/pecan shells | — | 56 | 2 |
36 | Bamboo | HCB | 57 | 4 |
37 | Corn starch | SCHC | 58 | 2 |
38 | Sawdust | HC | 59 | 2 |
49 | Petroleum asphalt | PHC | 60 | 3 |
50 | Shaddock peel | — | 61 | 4 |
51 | Phenolic resin | — | 62 | 3 |
52 | Hydroxymethylfurfural | — | 63 | 1 |
Input features | Median | Mean | Standard deviation |
---|---|---|---|
Reversible capacity (mA h g−1) | 262.80 | 252.52 | 93.47 |
I D/IG | 1.02 | 1.84 | 7.68 |
Surface area | 41.91 | 126.21 | 192.62 |
Current density (mA g−1) | 30.00 | 72.10 | 149.25 |
Pyrolysis temperature | 1100.00 | 1138.50 | 318.15 |
ICE | 70.00 | 66.22 | 18.18 |
Interlayer spacing (A°) | 3.84 | 3.85 | 0.17 |
Crystallite size (Lc) | 1.58 | 2.75 | 3.04 |
Crystallite size (La) | 4.29 | 5.66 | 3.37 |
Annealing time (h) | 2.00 | 2.11 | 0.79 |
Heating rate (°C min−1) | 5.00 | 4.22 | 1.54 |
Pore volume (cm3 g−1) | 0.04 | 0.12 | 0.24 |
Though a few of the input features (e.g. ID/IG) have a high standard deviation relative to their means, attempting to manipulate the data largely will affect the accuracy of study. In this study, the accuracy of the data and its' findings were prioritized compared to a good-looking statistic. Given this, all data were retained rather than replacing them with a new data that is different from the actual experimental values. Additionally, in dealing with outliers in a dataset for ML modeling, several factors must be considered. This includes their potential impact on the analysis. Since a factor such as ID/IG critically affects the performance of hard carbons for battery applications, it is better to maintain them. Furthermore, attempting to remove all outliers can cause a significant reduction in the amount of data used for the study. This can affect the overall reliability of the modeling, since it is generally not acceptable to train ML algorithms with a very small dataset.
Aside from all these factors, outliers contribute to the weakness of some models in giving accurate predictions. There are other models that can effectively handle outliers and prevent them from significantly affecting the accuracy of the modeling. For this reason, some models performed better than others in the ML predictions. In the end, the best modeling results were used to draw conclusions for this work.
Linear regression ML model is an algorithm that models the relation between a dependent (target) and one or more independent variables (predictor). The model assumes that this relationship is linear. Due to the linearity it assumes, it is particularly useful and easy to interpret when analyzing a small dataset.66
Unlike the linear regression algorithm, decision trees (DT) or trees can fit complex datasets. It learns from layers of problems to arrive at relevant conclusions. Its working principle is to utilize an entire dataset and divide it by a true or false designation of a certain test condition. To get the best training process, the dataset is further divided by a true/false designation continuously for as many times as possible.67 Decision tree is good for predictions and classification since it is not affected by data scaling.68,69 For example, Cosgun et al., used DT to optimize biomass growth and lipid yield conditions for producing renewable biofuel from microalgae, which guided their new experimental work.70
Random Forest (RF) also utilizes a decision tree, just like trees (it optimizes the decisions and improves on the results of simple trees by using multiple trees for its predictions). During the modeling with RF, each tree is established with independent and randomly selected samples. This randomness reduces the possibility of overfitting, which leads to more accurate predictions.71 Feature selection is not required for this model, and this makes it particularly useful for processing high-dimensional data. Besides, unbiased estimation is used in training this model. This unbiased estimation enables it to have a strong model generalization ability.72
For gradient boosting algorithm, simple decision trees are combined with larger ones to make predictions. Since this model makes minimal assumptions about the data, it works better compared to random forest when dealing with large and complex data. The minimal assumptions made also cause it to have a reduced mean square error and improved R2 value compared to random forest.73 AdaBoost uses adaptive learning technique where weak learners are manipulated to favor previously misclassified data. This method makes adaboost less susceptible to overfitting compared to other algorithms. Though the individual learners may be weak, they can join to form a stronger learner if the performance of each data is better than random guessing. As a result, adaboost works best when handling a noisy data full of outliers.74
Support vector machine (SVM) model makes decisions on training data in a manner that results in maximizing the decision border margins in the featured space. By doing so, classification errors are minimized, and a better generalization ability is obtained making this model useful for both small and complex dataset.75
K-nearest neighbor algorithm classifies data points by finding the most common point to them. It basically makes predictions of data points based on the values of its neighbors and is useful for small datasets. However, it tends to be overly sensitive to unnecessary data.76
This procedure is followed to obtain data for all samples with unknown reversible capacities. For all analysis, data output from the algorithms are based on the average of a random sampling procedure, which was set to twenty repeated runs.
This technique is useful since obtaining experimental data can be hindered by a lack of resources, time, effort, funding, safety and many other factors. Computational works in data science come in handy in these circumstances as it fill the gaps where it is practically impossible to obtain actual experimental data. To further enhance the validation approach of this work, both random sampling and cross-validation techniques (K-fold = 10) were employed.
The correlation between the predictions and actual values indicates that KNN and SVM are poor algorithms for this study as their R2 values were below 0.5. This may result from the unbiased nature of KNN modeling, which makes it very sensitive to unnecessary dataset features.81 The non-suitability of SVM on the other hand may be because this model assumes a balanced class,82 which is not the case in this study. Comparatively, Fig. 3 shows a better correlation between the experimental and predicted values for random forest, gradient boosting, adaboost, and linear regression models used for the training dataset, with their R2 values being greater than 0.5. These analyses were based on a random sampling technique using a balanced 80% training and 20% validation data split.
According to Table 3, which indicates the coefficient of determination values (R2), and the root mean squared error (RMSE) for each model, gradient boosting, adaboost and random forest are best in predicting the reversible capacities of the training data set.
Model | MSE | RMSE | MAE | MAPE | R 2 |
---|---|---|---|---|---|
Adaboost | 5252.123 | 72.472 | 50.370 | 0.263 | 0.402 |
Random forest | 5338.342 | 73.064 | 53.134 | 0.289 | 0.392 |
Gradient Boosting | 4680.299 | 68.413 | 47.120 | 0.255 | 0.467 |
Tree | 7077.600 | 84.128 | 63.688 | 0.319 | 0.194 |
SVM | 6092.398 | 78.054 | 57.319 | 0.315 | 0.307 |
KNN | 6687.577 | 81.778 | 59.193 | 0.324 | 0.239 |
Linear Regression | 7650.113 | 87.465 | 67.640 | 0.351 | 0.129 |
A cross-validation approach was also conducted on the predictions to ensure the accuracy of random sampling results, and its findings are given in Table 4.
Model | MSE | RMSE | MAE | MAPE | R 2 |
---|---|---|---|---|---|
Adaboost | 5554.037 | 74.525 | 49.970 | 0.241 | 0.361 |
Random forest | 5530.483 | 74.367 | 52.507 | 0.260 | 0.364 |
Gradient Boosting | 4788.885 | 69.202 | 45.886 | 0.228 | 0.449 |
Tree | 6833.056 | 82.662 | 59.094 | 0.272 | 0.214 |
SVM | 6327.576 | 79.546 | 57.126 | 0.283 | 0.272 |
KNN | 6456.778 | 80.354 | 56.420 | 0.283 | 0.257 |
Linear Regression | 7349.207 | 85.728 | 65.972 | 0.320 | 0.155 |
Comparing both validation techniques, it is observed that cross validation (K-fold = 10) gave similar outcomes as the random sampling technique.
To get a better view of how each model performs in terms of predicting reversible capacity values that are approximately the same as those obtained experimentally, data for each model was plotted independently and their correlation coefficients were determined as shown in Fig. 3.
Based on the R2 values obtained, Fig. 3 shows a better predictability potential for the unknown using linear regression compared to the remaining models. However, since it failed to give good predictions during the model training, it couldn't be relied upon. As a result, the findings from Fig. 3 lead to the conclusion that gradient boosting is the best algorithm for this study.
As seen from Fig. 4, the maximum capacity obtained for AOPDHC was 341.1 mA h g−1 at a current density of 100 mA g−1 and an ICE of 48% and the minimum capacity was 170.3 mA h g−1 at a current density of 100 mA g−1 and an ICE of 43%. Even at low current density (30 mA g−1), gradient boosting could still predict a capacity greater than 250 mA h g−1, which is like the experimental results obtained from most of the literature used here. Though the model's performance in predicting the data in the training set was average, they performed better when predicting the reversible capacities of the actual samples, which were assumed to be unknown. The data for AOPDHC were modeled simultaneously with the assumed unknowns, and since good correlations were obtained between the experimental and predicted data for the assumed unknowns, it suggests that the data obtained for the actual unknown (AOPDHC) is also reliable, regardless of the average performance of the model with the training dataset.
Computational modeling from this study via machine learning could therefore address the research gap left by Meenatchi, et al. in their manuscript18 and proves that AOPDHC can be used as an anode for sodium-ion batteries. Though there were a few outliers, it could be seen from Fig. 4 that for each current density, improvement in ICE resulted in an increase in the reversible capacities. This is expected since for hard carbons, a higher ICE means less energy is lost during the initial charge cycle due to irreversible reactions, which translates to the storage and retrieval of more energy from the battery. This trend is like the observations made with other hard carbon systems for sodium-ion batteries.20,83 The linear relationship seen with the models further confirms the reliability of these predictions since it is known from literature that increasing ICE leads to an increase in reversible capacity values.36,84
Fig. 5 ranks the importance of each feature in predicting the reversible capacities of AOPDHC via four modeling techniques (gradient boosting, random forest, adaboost and linear regression) using the training dataset whose data summary is given in Table 2. These models were chosen since they gave good predictions of the assumed unknowns. Since some models performed better than others, it is useful to investigate how each model arrives at their predictions. The reason for the differences seen in the ranking by each model may be because each model has a different way of analyzing the data to arrive at their predictions. Hence, the rankings can provide a better understanding of the features that each model prioritizes in arriving at their conclusions. To further validate the SHAP results, feature importance via ablation analysis was conducted as a follow-up. In view of this, only rankings that are similar between the two techniques are used in drawing conclusions and analyzing how each model interprets feature relevance.
It is seen from Fig. 5 that, except for adaboost, all models ranked ICE as the most important factor that influences the reversible capacity predictions. This is supported by literature as a low ICE, often caused by the poor reversibility of sodiation/desodiation reaction and the decomposition of electrolytes to form SEI in the first cycle, has a direct negative effect on the battery's capacity and vice versa.83
The second most important factor per the rankings of all models, except adaboost, was current density. This feature was ranked first by adaboost, and these observations signify the importance of current density to the reversible capacity predictions. The findings here are supported by literature, which has experimentally proven that high current densities lead to low reversible capacities.85
The boosting algorithms (gradient boosting and adaboost) considered interlayer spacing as an important factor, whilst linear regression ranked it among the least influential factors. Also, linear regression ranked crystallite size (Lc) over interlayer spacing, whereas for all remaining models, it was ranked the worst factor that influenced the capacity predictions. The rankings observed here with linear regression could be because the interlayer spacing values for all hard carbons were similar with a lower standard deviation relative to their mean. This may hinder the ability of the model to find a linear relationship between the input (interlayer spacing) and the output (reversible capacity). This observation differs from that seen with crystallite size (Lc) since that parameter has values with a higher standard deviation relative to their mean. The boosting algorithms on the other hand improves the performance of weak learners to build a better predictive model and therefore are powerful even in modeling complex, non-linear relationships.86
Adaboost ranked ID/IG as the second most influential factor in predicting the reversible capacities, whereas it was ranked as an average influencing factor for gradient boosting. This could be because in improving weak learners to create better predictive models, adaboost starts by building short trees as compared to gradient boosting that dives deeper by starting with building leaves.77 As a result, the high standard deviation of the ID/IG data may have little influence on the adaboost predictions.
The two ranking techniques have different principles and hence give slightly different results. This difference could be because SHAP results not only rank features based on their importance but also based on how much each feature contributes to the models' predictions. Regardless, the overall findings are similar, and the effect of key features on each factor is identified.
Overall, it is reported that surface area has a direct relationship with ICE value, which was ranked as the most influential factor by most of the models, while pyrolysis temperature directly affects the graphitization degree of the carbon materials as reflected in their ID/IG.25 Due to the interrelationships between all the features considered in reversible capacity predictions, it is recommended to take all these into consideration when preparing hard carbons for sodium-ion batteries regardless of their rank in this study.
Footnote |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d5su00360a |
This journal is © The Royal Society of Chemistry 2025 |