 Open Access Article
 Open Access Article
      
        
          
            Haruka 
            Tobita
          
        
      a, 
      
        
          
            Yuki 
            Namiuchi
          
        
      b, 
      
        
          
            Takumi 
            Komura
          
        
      a, 
      
        
          
            Hiroaki 
            Imai
          
        
       a, 
      
        
          
            Koki 
            Obinata
          
        
      c, 
      
        
          
            Masato 
            Okada
a, 
      
        
          
            Koki 
            Obinata
          
        
      c, 
      
        
          
            Masato 
            Okada
          
        
       c, 
      
        
          
            Yasuhiko 
            Igarashi
c, 
      
        
          
            Yasuhiko 
            Igarashi
          
        
       *b and 
      
        
          
            Yuya 
            Oaki
*b and 
      
        
          
            Yuya 
            Oaki
          
        
       *a
*a
      
aDepartment of Applied Chemistry, Faculty of Science and Technology, Keio University, 3-14-1 Hiyoshi, Kohoku-ku, Yokohama 223-8522, Japan. E-mail: oakiyuya@applc.keio.ac.jp
      
bFaculty of Engineering, Information and Systems, University of Tsukuba, 1-1-1 Tennodai, Tsukuba 305-8573, Japan. E-mail: igayasu1219@cs.tsukuba.ac.jp
      
cGraduate School of Frontier Sciences, The University of Tokyo, 5-1-5 Kashiwanoha, Kashiwa 277-8561, Japan
    
First published on 17th May 2023
Organic energy storage has attracted a lot of interest in enhancing performance and reducing the consumption of resources. If performance predictors are prepared, the exploration of new compounds can be accelerated without consumption of time, energy, and effort. In the present work, a new straightforward capacity predictor is constructed for the exploration of organic anode-active materials. Sparse modeling for small data (SpM-S) combining machine learning (ML) and our chemical insights was used to construct linear regression models of specific capacity. In our previous work, two predictors (models G1 and G2) were prepared using small datasets. However, the descriptors and prediction accuracy of these models were not validated. In the present work, a new improved model (model G3) has been constructed with the addition of new data. These three models were studied in terms of data science: namely, prediction accuracy, validity of the descriptors, amount of training data used, and effect of ML algorithms. The straightforward, generalizable, and interpretable model G3 can be applied to explore new organic anode-active materials. Moreover, these data-scientific approaches to model construction and validation can be used to explore new energy-related materials even with small data.
Organic anode-active materials exhibit high specific capacity compared with conventional graphite.1–9 In previous work, conductive polymers with redox reactions in the range of 2.0–0.5 V vs. Li/Li+ were studied as a classical organic anode-active material.15–18 Tarascon et al. reported a new scheme for the lithium alkoxylation of carbonyl groups in π-conjugated molecules.19 Sun et al. found the uptake of multiple lithium ions (Li+) in a π-conjugated framework,20i.e. superlithiation, drastically enhancing specific capacity.21–29 Although π-conjugated molecules have potential for superlithiation, not all such compounds show high specific capacity. In recent years, known compounds with high specific capacity were introduced into polymers and covalent organic frameworks to enhance performance.30–32 The exploration and discovery of new compounds depending only on professional experience encounter limitations. A more specific design strategy is required to discover new anode-active materials efficiently. If the correlations between molecular structure and capacity are elucidated, a predictor can be constructed to accelerate the exploration of new compounds. Redox potentials were calculated to design organic electrode-active materials by computational chemistry.33,34 The reactivity of organic anode-active materials with multiple lithium ions was studied by calculation.20,28 However, specific capacity is not easily predicted by computational chemistry alone because various factors, such as conductivity, size, and shape of the particles, are related to capacity. Therefore, we have focused on machine learning (ML) to extract the significant factors and construct the capacity predictors.
Data-driven approaches have been rapidly developed in recent materials science.35–40 ML has been used to predict the structures and functions of molecules and materials. In general, bigger training data is preferred to construct more accurate predictors. Big data sufficient for conventional ML algorithms is not easily prepared based on experimental studies in the laboratory. New ML schemes applicable to small data have been studied in recent years.10,41–43 In addition, automated, robotic, and combinatorial methods are used to obtain big training data efficiently.44–47 However, not all experiments including synthesis and characterization are integrated in an automated system. Although training data can be collected from the literature, the reported values include differences and errors depending on the experimental conditions of an individual research group. Experimental scientists need a new methodology to use ML for small data. Our group has developed SpM-S, a new scheme of ML for small data.10–14 Sparse modeling (SpM) is a general concept to describe whole high-dimensional data using a limited number of significant descriptors extracted by ML. In SpM-S, the extraction of the descriptors using ML is followed by further selection based on our chemical insight. Combination with our chemical insight contributes to avoiding overtraining caused by the small data and improving generalizability.10,14 Therefore, SpM-S provides straightforward, interpretable, and generalizable linear regression models using a limited number of descriptors. Our group constructed performance predictors for organic cathode-active and anode-active materials using SpM-S.48–50 Although two predictors (G1 and G2) for the specific capacity of an anode were prepared in our previous work (Fig. 1a–d),48,49 their prediction accuracy was not sufficient. Moreover, the validity of the predictors and extracted descriptors were not studied in terms of data science. In the present work, a new improved model G3 was prepared with the addition of new experimental data (Fig. 1c and d). The validity of the prediction models G1–G3 was studied in terms of prediction accuracy, extracted descriptors, amount of training data, and ML algorithm (Fig. 1e).
|  | ||
| Fig. 2 Relationship between the estimated and measured capacity in the training (black) and test (red) data for models G1 (a) and G2 (b). | ||
| No. | Specific capacity/mA h g−1 | G1 | G2 | G3 | Ref. | No. | Specific capacity/mA h g−1 | G1 | G2 | G3 | Ref. | 
|---|---|---|---|---|---|---|---|---|---|---|---|
| a The molecular structures of 1–54 are displayed in Scheme S1 in the ESI. b The specific capacity refers to the training and test datasets in our previous work.48,49 c The specific capacity was measured in the present work (Fig. S1 in the ESI). d The differences in the specific capacity are caused by the differences in the current density. The former and latter values are the measured capacity in the datasets G1 and G2, respectively.48,49 | |||||||||||
| 1 | 0 | G1 | 48 | 28 | 490 | G2 | G3 | 49 | |||
| 2 | 0 | G1 | 48 | 29 | 6 | G2 | 49 | ||||
| 3 | 0 | G1 | 48 | 30 | 178 | G2 | G3 | 49 | |||
| 4 | 0 | G1 | 48 | 31 | 30 | G2 | G3 | 49 | |||
| 5 | 0 | G1 | 48 | 32 | 798 | G2 | G3 | 49 | |||
| 6 | 19 | G1 | 48 | 33 | 55 | G2 | 49 | ||||
| 7 | 732/0 | G1 | G2 | 48, 49 | 34 | 513 | G2 | G3 | 49 | ||
| 8 | 126/221 | G1 | G2 | G3 | 48, 49 | 35 | 109 | G2 | G3 | 49 | |
| 9 | 0 | G1 | 48 | 36 | 56 | G3 | # | ||||
| 10 | 478/28 | G1 | G2 | G3 | 48, 49 | 37 | 105 | G3 | # | ||
| 11 | 0 | G1 | 48 | 38 | 0 | G2 | 49 | ||||
| 12 | 0 | G1 | 48 | 39 | 277 | G2 | G3 | 49 | |||
| 13 | 84 | G1 | 48 | 40 | 0 | G2 | 49 | ||||
| 14 | 135/64 | G1 | G2 | G3 | 48, 49 | 41 | 201 | G3 | # | ||
| 15 | 178/1147 | G1 | G2 | G3 | 48, 49 | 42 | 277 | G2 | G3 | 49 | |
| 16 | 0 | G1 | 48 | 43 | 141 | G2 | G3 | 49 | |||
| 17 | 355 | G2 | G3 | 49 | 44 | 134 | G3 | 49 | |||
| 18 | 175 | G3 | 49 | 45 | 15 | G3 | 49 | ||||
| 19 | 24 | G3 | # | 46 | 267 | G3 | # | ||||
| 20 | 105 | G2 | G3 | 49 | 47 | 73 | G3 | 49 | |||
| 21 | 0 | G2 | 49 | 48 | 318 | G3 | 49 | ||||
| 22 | 142 | G2 | G3 | 49 | 49 | 63 | G3 | # | |||
| 23 | 405 | G3 | 49 | 50 | 229 | G3 | # | ||||
| 24 | 227 | G2 | G3 | 49 | 51 | 133 | G3 | 49 | |||
| 25 | 91 | G3 | # | 52 | 279 | G3 | 49 | ||||
| 26 | 310 | G2 | G3 | 49 | 53 | 23 | G3 | # | |||
| 27 | 0 | G2 | 49 | 54 | 273 | G3 | # | ||||
The explanatory variables (xn) were the potential descriptors related to capacity prepared based on our chemical insight (Table 2). The following parameters were used as xn (Table 2):48,49 the energy levels (E) of LUMO (x1: ELUMO0), four energy levels higher than the LUMO (x2–x5: ELUMOj, j = 1–4), the absolute values of the differences in the energy levels (x6–x15: ΔELUMOj–k: j, k = 0–4), molecular weight (x16), expected maximum (theoretical) capacity (x17), theoretical specific capacity for reaction with one Li+ (x18), the number of carboxy groups (x19), the number of carbonyl groups (x20), the number of conjugated carbons (x21), the number of occupied orbitals (Norb) lower than the work function of lithium and energy level (E) E = 0 (x22: Norb, ELUMO0 ≤ E < ΦLi, x23: Norb, ELUMO0 ≤ E < 0), the sum of absolute values of E for the orbitals in the range from ELUMO0 to E = 0 (Σ|E|, ELUMO0 ≤ E < 0, x24), Hansen solubility-(similarity-)parameter (HSP) distance between the target compound and electrolyte solution (x25), melting point (x26), the number of sulfur (S) atoms in the heteroaromatic rings (x27), dipole moment (x28), the minimum and maximum values of the partial charge density (x29, x30), HSP dispersion (δD), polarity (δP), and hydrogen-bonding (δH) terms (x31–33, respectively), the number of nitrogens and oxygens in the heteroaromatic rings (x34 and x35, respectively), the ratio of the number of heteroatoms (N, S, O) to the total number of carbon, N, S, and O (x36). In SpM-S, the significant descriptors were extracted by ML and then selected in combination with our experience and chemical insight. Linear regression models were constructed using the selected descriptors. After predictors G1 and G2 were introduced, the validity of predictor G3 and the advances in the processes of these predictors were studied in terms of data science.
| No. | Explanatory variable xn | Unit | G1c | G2c | G3c | 
|---|---|---|---|---|---|
| a DFT calculation values. b HSP calculation values. c x n values shown in bold and italics were used as descriptors in models G1–G3 with positive and negative correlations, respectively. | |||||
| 1 | E LUMO0 | eV | G1 | G2 | G3 | 
| 2 | E LUMO1 | eV | G1 | G2 | G3 | 
| 3 | E LUMO2 | eV | G1 | G2 | G3 | 
| 4 | E LUMO3 | eV | G1 | G2 | G3 | 
| 5 | E LUMO4 | eV | G1 | G2 | G3 | 
| 6 | E LUMO1-0 | eV | G1 | ||
| 7 | E LUMO2-0 | eV | G1 | ||
| 8 | E LUMO3-0 | eV | G1 | ||
| 9 | E LUMO4-0 | eV | G1 | ||
| 10 | E LUMO2-1 | eV | G1 | ||
| 11 | E LUMO3-2 | eV | G1 | ||
| 12 | E LUMO3-1 | eV | G1 | ||
| 13 | E LUMO4-3 | eV | G1 | ||
| 14 | E LUMO4-2 | eV | G1 | ||
| 15 | E LUMO4-1 | eV | G1 | ||
| 16 | Molecular weight | g mol−1 | G1 | G2 | G3 | 
| 17 | Expected maximum capacity | mA h g−1 | G1 | ||
| 18 | Capacity reacted with 1 Li+ | mA h g−1 | G2 | ||
| 19 | Number of carboxy groups | — | G1 | G2 | |
| 20 | Number of carbonyl groups | — | G3 | ||
| 21 | Number of conjugated carbons | — | G1 | G2 | G3 | 
| 22 | N orb, ELUMO0 ≤ E < ΦLi | — | G1 | G2 | |
| 23 | N orb, ELUMO0 ≤ E <0 | — | G1 | G2 | G3 | 
| 24 | Σ|E|, ELUMO0 ≤ E <0 | eV | G1 | G2 | |
| 25 | HSP distance | — | G1 | G2 | G3 | 
| 26 | Melting point | °C | G1 | G2 | |
| 27 | Number of S | — | G2 | ||
| 28 | Dipole moment | Debye | G2 | G3 | |
| 29 | Minimum of charge density | — | G2 | G3 | |
| 30 | Maximum of charge density | — | G3 | ||
| 31 | HSP-δD | — | G2 | G3 | |
| 32 | HSP-δP | — | G2 | G3 | |
| 33 | HSP-δH | — | G2 | G3 | |
| 34 | Number of N | — | G2 | ||
| 35 | Number of O | — | G2 | ||
| 36 | Ratio of heteroatoms | — | G2 | G3 | |
| y′ = 64.6x23 + 67.3x25 – 98.2x26 + 109.5 | (1) | 
The test data including compounds A–M was prepared using literature values (Table 3 and Scheme S2 and Table S2 in the ESI†).21–29 As predictor G1 needs the melting point (x26) to calculate y′, only nine compounds (A, B, C, E, F, G, H, L, M) with melting point data were used for the test (test dataset G1, Table S2 in the ESI†). Predictor G1 had an RMSE of 629 mA h g−1 for the test data (red circles in Fig. 2a). The black and red plots are not in the diagonal line of the y–y′ plots representing the relationship between the predicted and measured values. A couple of new potential compounds, such as benzodithiophene, were successfully found using predictor G1 in a limited number of experiments.48 However, the predictor needs improvement for the following reasons. The measured capacity is higher than the estimated value, as indicated by the red arrow (Fig. 2a). This fact means that the capacity is underestimated by model G1. The underestimation is caused by the unbalanced small training data because nine of the 16 compounds had a specific capacity of 0 (Table 1). In addition, the melting point (x26) is not always available for unknown new compounds. Therefore, model G1 is not easily applied to the practical exploration of new compounds.
| y′ = 20.4x4 – 307.6x16 + 303.2 x22 – 9.13x23 + 12.4 x25 + 40.3x35 +218.9 | (2) | 
The RMSE for the test data including 13 compounds A–M was 338 mA h g−1 (red circles in Fig. 2b and Table 3 and Scheme S2 and Table S4 (test dataset G2) in ESI†).21–29 The black and red plots approach the diagonal line of the y–y′ plots compared with those in model G1. A new potential active material with high specific capacity and cycle stability, namely 5-formylsarytilic acid, was found using predictor G2.49 However, predictor G2 still needs an improvement in accuracy.
We visually extracted seven xn (n = 2, 16, 21, 25, 33, 36) from the weight diagram (the left-hand axis in Fig. 3a) and then studied their validity as descriptors. The positive correlation of x25 (HSP distance) and negative correlation of x33 (HSP-δH) imply that rigid molecular frameworks with low solubility to the electrolyte enable a stable redox reaction leading to high specific capacity. The positive correlation of x36 (ratio of the heteroatoms) implies that charge localization in the molecules promotes the introduction of Li+. These xn (n = 2, 25, 33, 36) are consistent with our chemical insight. Although the positive correlation of x2 (ELUMO1) is not directly explained, the positive correlation of the LUMO levels was used as the results of ML in model G2.49 In the present work, x2 is also adopted as a descriptor in model G3. Further studies including a calculation study are needed to elucidate the correlation between the LUMO level and capacity. The positive correlation of x16 (molecular weight) and the negative correlation of x21 (number of conjugated carbons) are not simply consistent with our chemical insight. In principle, the correlation of these descriptors is inverse. A higher specific capacity (mA h g−1) is achieved by compounds with a lower molecular weight. More conjugated carbons enhance superlithiation, leading to an increase in specific capacity. Therefore, these xn (n = 16, 21) are not selected for model G3. On the other hand, two xn (n = 20, 28) are added as descriptors according to our chemical insight. The positive correlation of x20 (the number of carbonyl groups) means an increase in the reactivity of Li+. In addition, carbonyl groups were reported to be reaction sites in previous work.3–6 The positive correlation of x28 (dipole moment) means charge localization of the molecule enhances the introduction of Li+. In this manner, six xn (n = 2, 20, 25, 28, 33, 36) were selected as descriptors in combination with ES-LiR and our chemical insights. Predictor G3 was described by (eqn (3)) using six xn with RMSE 144 mA h g−1 for the training data (black circles in Fig. 3b).
| y′ = 164.6x2 + 58.0x20 + 116.8x25 + 98.5x28 – 280.1x33 + 296.9x36 + 229.9 | (3) | 
When five-fold cross validation was performed using (eqn (3)) in training dataset G3, the average RMSE values were 194 ± 12.6 mA h g−1 for the training and 218 ± 113 mA h g−1 for the test data. Model G3 showed an RMSE of 366 mA h g−1 for the test data, including compounds A–M (red circles in Fig. 3b, Table 3 and Scheme S2 in the ESI†). Although the RMSE value of model G3 is smaller than that of model G2 for the training dataset (the black circles in Fig. 2b and 3c), model G3 shows a larger RMSE value than model G2 for the test dataset (the red circles in Fig. 2b and 3c). The relationship between the estimated and measured capacity of model G3 more accurately represents the trend of high and low capacity compared with that of models G1 and G2 (Fig. 2b and 3b), because more plots are on the diagonal line in the true-error plots. The overall accuracy and generalizability of the prediction model are evaluated not only by the RMSE values but also by the true-error plots. These results imply that model G3 can be used for an exploration of new unknown compounds.
The validity of the extracted descriptors was studied in terms of data science. The averaged absolute values of the coefficients were calculated for each xn in 100 models with the smallest CVE values (Fig. 3c). The averages were larger than 35 for the visually extracted xn (n = 2, 16, 21, 25, 33, 36) from the weight diagram. The chart quantitatively supports the validity of the visually extracted xn from the weight diagram (Fig. 3a). However, the selected xn (n = 20, 28) based on our chemical insight were not supported by the chart (Fig. 3c). In addition, the chart indicates that xn (n = 16, 21) are potential descriptors. The validity of the six selected descriptors is not fully supported by ES-LiR alone. In general, ES-LiR has the following two problems which need to be solved. CVE is used to evaluate the prediction accuracy of the models. As this CVE-based model selection causes overfitting the training data, a true model is not always obtained.53 The other problem is the visual and qualitative extraction process of the descriptors from the weight diagram displaying the coefficients of the models in order of lowest CVE. The weight diagram represents not only a model with a specific CVE, such as the lowest one, but also multiple models with low CVE in the ranking. The visual effect of the weight diagram depends on the threshold of the CVE ranking defined by researcher. A more quantitative scheme including reliability is needed to extract the descriptors more appropriately.
Here reliability assessment and subsequent extraction of descriptors based on Bayesian model averaging (BMA) were carried out in the data.54 Bayesian inference was applied to a linear regression model in our previous work.55 In Bayesian inference,56 the likelihood of each linear regression model using various descriptors is expressed by a probability value, assuming that noise is added to each of the experimental data. This evaluation method based on probability value approaches a true model, avoiding overtraining in training data compared with that based on the CVE value.57 The model with the highest probability value can be selected to explain the experimental data. BMA is introduced in the selection process because the influence of the training data is significant. All possible models for each descriptor are integrated with weighting by the probability values explaining the experimental data. Then, the probability that each xn is a descriptor is calculated (Fig. 3d). This ES-BMA method provides more quantitative information in the extraction processes of the descriptors, whereas the descriptors are visually extracted from the weight diagram of ES-LiR (Fig. 2 and 3a–c). ES-BMA analysis indicates that the selected descriptors xn (n = 2, 20, 25, 28, 33, 36) have a probability higher than 0.8. Therefore, ES-BMA supports the validity of the descriptors in model G3.
In this manner, the appropriate descriptors were extracted and selected in model G3 by ES-LiR in combination with our chemical insight. The validity of the model and its descriptors is supported by ES-BMA. These results imply that a straightforward and interpretable linear predictor can be constructed in small data using ES-LiR and ES-BMA in combination with our chemical insight.
In training dataset G1′, six xn (n = 1, 21, 23, 25, 28, 36) were visualized and extractable based on the weight diagram of ES-LiR and a chart displaying the averaged absolute values of the coefficients in the 100 models with the smallest CVE (the left-hand axes in Fig. 4a and c). The probability from ES-BMA indicates the potential descriptors xn (n = 1, 23, 25, 28) (the left-hand axis in Fig. 4e). However, xn (n = 2, 20, 23) were not extractable by ES-LiR and/or ES-BMA in dataset G1′. In training dataset G2′, xn (n = 1, 2, 4, 16, 21, 28, 36) were extractable based on the weight diagram of ES-LiR and the averaged coefficients (the left-hand axis in Fig. 4b and d). The ES-BMA analysis indicates potential descriptors xn (n = 1, 2, 20, 28, 29) with a probability higher than 0.8 (Fig. 4f). However, xn (n = 25, 23) were not extractable by ES-LiR and/or ES-BMA in dataset G2′. These analyses imply that the data sizes in datasets G1 and G2 were insufficient to extract the descriptors.
The effect of the data was studied by another method (Table 4). Dataset G3 containing 36 y was reduced in six random patterns (Fig. S5–S9 in the ESI†). The weight diagrams were prepared by ES-LiR using the reduced datasets containing 35, 34, 33, 30, and 27 y to study the extractability of the descriptors. The number of extractable xn (Nx) in the six xn (n = 2, 20, 25, 28, 33, 36) of model G3 was counted in each weight diagram (Figs. S5–S9 in the ESI†). The average Nx (Nx,ave) of the six weight diagrams was calculated in the reduced datasets (Table 4). In addition, the numbers in the weight diagram (Nwd) satisfied with Nx = 6 and Nx ≥ 5 are summarized in Table 4. The extractability of xn distinctly decreases for y lower than 33 (Table 4). When the data size is y = 30 or 27, the extractable xn from the weight diagram depend on the datasets. The results support model G3 being constructed on a sufficient amount of training data y = 36.
| y in the reduced dataset | 35 | 34 | 33 | 30 | 27 | 
|---|---|---|---|---|---|
| N ave | 1 | 2 | 0 | 0 | 0 | 
| N wd|Nx = 6 | 3 | 4 | 3 | 2 | 0 | 
| N wd|Nx ≥ 5 | 4.17 | 4.83 | 4.17 | 3.50 | 3.00 | 
|  | ||
| Fig. 5 RMSE of the prediction models constructed with SpM-S, LASSO, and ML-R in the training (gray) and test (red) datasets. | ||
| Footnote | 
| † Electronic supplementary information (ESI) available: Methods, molecular structures, charge–discharge measurements, datasets, reference weight diagrams. See DOI: https://doi.org/10.1039/d3ya00161j | 
| This journal is © The Royal Society of Chemistry 2023 |