Open Access Article
Yuka
Kitamura
a,
Yuki
Namiuchi
b,
Hiroaki
Imai
a,
Yasuhiko
Igarashi
*b and
Yuya
Oaki
*a
aDepartment of Applied Chemistry, Faculty of Science and Technology, Keio University, 3-14-1 Hiyoshi, Kohoku-ku, Yokohama 223-8522, Japan. E-mail: oakiyuya@applc.keio.ac.jp
bInstitute of Engineering, Information and Systems, University of Tsukuba, 1-1-1 Tennodai, Tsukuba 305-8573, Japan. E-mail: igayasu1219@cs.tsukuba.ac.jp
First published on 4th June 2025
Exfoliated nanosheets have attracted considerable interest as two-dimensional (2D) building blocks. In general, the yield, size, and size distribution of the exfoliated nanosheets cannot be easily controlled or predicted because of the complexity in the processes. Our group studied the prediction models of the yield, size, and size distribution based on the small experimental data available. Sparse modeling for small data (SpM-S) combining machine learning (ML) and chemical insight was used for the construction of predictors. In SpM-S, the weight diagram visualizing the significance of explanatory variables plays an important role in variable selection to construct the models. However, the processes of variable selection were not validated in a data-scientific manner. In the present work, the significance of data size, visualization method, and chemical insight for variable selection was studied to validate the processes of model construction. The data size had a lower limit to extract appropriate descriptors. The weight diagram had an appropriate visualizing range for variable selection. Chemical insight as domain knowledge supplemented the limitation caused by the data size. These studies indicated that SpM-S can be applied to construct predictors, straightforward linear regression models, for the controlled syntheses of other 2D materials, even based on small data.
Data-driven approaches have been used in a broad range of chemistry and materials science areas.19–27 For instance, the combination of big data, machine learning (ML), and a robotic system has been studied to develop fully automated AI chemists.28–32 These AI-oriented methods are supported by the availability of a sufficient size of data for the ML. However, a sufficient size of data is not always available for all experimental systems. For example, big data is not efficiently collected from conventional experimental works including batch processes. Specific methods are thus required to apply ML to small data.
ML for small data has been increasingly studied in recent years.33–45 Specific approaches, such as transfer learning, have been developed for the use of small data.33–45 However, the interpretability and generalizability are lower for modeling based on complex modeling algorithms. Recent reports have indicated the significance of domain knowledge and the use of simple regression models.43–45 Our group has focused on sparse modeling (SpM), a method for describing whole high-dimensional data by a small number of significant descriptors.46–48 SpM has already been applied in a variety of fields, such as image compression and materials science.15–18,49–54 We have studied SpM for small data (SpM-S) combining ML and domain knowledge.10,45 The method was applied to controlled the synthesis of nanosheets and the exploration of electrode active materials based on small data.15–18,51–54 In SpM-S, the descriptors are extracted from a small training dataset using a ML algorithm, and an exhaustive search with linear regression (ES-LiR), as mentioned later (Fig. 1c–e). Then, the descriptors are further selected based on our domain knowledge as chemists (Fig. 1f). A straightforward linear regression model is then constructed using the selected descriptors. In our previous work,45 the prediction results of SpM-S combining linear regression and our chemical insight were compared with those obtained from other linear and nonlinear algorithms, such as least absolute shrinkage and selection operate (LASSO) and neural network regression (NN-R), in terms of the accuracy, interpretability, and generalizability, especially for small data. Although nonlinear algorithms, such as NN-R, generally exhibit high expressive power, they tend to overfit small chemical experimental datasets because of the insufficient generalizability.45 However, the processes remain unclear with problems persisting regarding the variable selection, one of the significant steps for modeling (problems (i)–(iii) in Fig. 1d–f) regarding the required data size (problem (i)), visualization method of the weight diagram (problem (ii)), and the significance of domain knowledge (problem (iii)). The present study aimed to solve these problems to improve the understanding of SpM-S. The results indicated that similar predictors could be constructed by SpM-S based on small experimental data for various other 2D materials.
The explanatory variables (xn: n = 1–41), such as the physicochemical parameters of the guests and media, were the related physicochemical parameters selected by our chemical insights (Table 1). In the total 41xn, the selected xn were used as the potential descriptors for y1–y3. The datasets contained the following numbers of y and xn (Table 1): 30y1 and 11xn (n = 2, 4, 5, 8, 10, 14, 16–18, 36, 40) (dataset I), 48y2 and 18xn (n = 1, 3–5, 14–21, 30–32, 34, 36, 40) (dataset II), 54y3 and 15xn (n = 4, 8, 10, 13, 14, 16–18, 21, 30–32, 36, 40, 41) (dataset III) (Fig. 1c and d). In our previous works, the descriptors were extracted from xn by SpM using ES-LiR (Fig. 1d and e). Then, the descriptors were further selected with the assistance of our chemical insights (Fig. 1e and f). The linear regression models eqn (1)–(3) were constructed using the selected two to eight xn.16–18
| y1 = 35.00x3 − 32.33x5 + 34.07 | (1) |
| y2 = −0.159x3 − 0.096x4 + 0.257x7 − 0.017x8 − 0.018x10 + 0.028x13 − 0.050x14 + 0.061x18 + 0.267 | (2) |
| y3 = −0.0599x7 + 0.0802x9 + 0.0699x20 − 0.0681x28 − 0.0623x37 + 0.266 | (3) |
| n | Parameters | x n for |
|---|---|---|
| a Literature data. b Calculation data. c Experimental data. | ||
| Dispersion media | ||
| 1 | Molecular weight | y 1, y2, y3 |
| 2 | Molecular lengthb | y 1 |
| 3 | Melting pointa | y 1, y2, y3 |
| 4 | Boiling pointa | y 1, y2, y3 |
| 5 | Densitya | y 1, y2, y3 |
| 6 | Relative permittivitya | y 1, y2, y3 |
| 7 | Vapor pressurea | y 1, y2, y3 |
| 8 | Viscositya | y 1, y2, y3 |
| 9 | Refractive indexa | y 1, y2, y3 |
| 10 | Surface tensiona | y 1, y2, y3 |
| 11 | Heat capacityb | y 1, y2, y3 |
| 12 | Entropyb | y 1, y2, y3 |
| 13 | Enthalpyb | y 1, y2, y3 |
| 14 | Dipole momentb | y 1, y2, y3 |
| 15 | Polarizabilityb | y 1, y2, y3 |
| 16 | HSP-dispersionb | y 1, y2, y3 |
| 17 | HSP-polarityb | y 1, y2, y3 |
| 18 | HSP-hydrogen bondingb | y 1, y2, y3 |
![]() |
||
| Guest molecules | ||
| 19 | Molecular weight | y 1, y2, y3 |
| 20 | Polarizabilityb | y 1, y2, y3 |
| 21 | Dipole momentb | y 1, y2, y3 |
| 22 | Heat capacityb | y 1, y2, y3 |
| 23 | Entropyb | y 1, y2, y3 |
| 24 | Enthalpyb | y 1, y2, y3 |
| 25 | Molecular lengthb | y 1 |
| 26 | Layer distancec | y 1, y2, y3 |
| 27 | Layer distance expansionc | y 3 |
| 28 | Composition (x)c | y 1, y2 |
| 29 | Interlayer densityc | y 1, y2 |
| 30 | HSP-dispersion termsb | y 1, y2, y3 |
| 31 | HSP-polarity termsb | y 1, y2, y3 |
| 32 | HSP-hydrogen bonding termsb | y 1, y2, y3 |
![]() |
||
| Guest-medium combinations | ||
| 33 | Δ polarizability (=x15 − x20)b | y 3 |
| 34 | Δ polarizability (=|x33|)b | y 1, y2, y3 |
| 35 | Δ dipole moment (=x14 − x21)b | y 3 |
| 36 | Δ dipole moment (=|x35|)b | y 1, y2, y3 |
| 37 | Product of dipole moment (=x14 × x21)b | y 3 |
| 38 | Δ heat capacity (=x11 − x22)b | y 3 |
| 39 | Δ heat capacity (=|x38|)b | y 1, y2, y3 |
| 40 | HSP distanceb | y 1, y2, y3 |
![]() |
||
| Host | ||
| 41 | Bulk sizec | y 3 |
The extractability of the descriptors was then studied using the reduced datasets. Weight diagrams were prepared by ES-LiR. Linear regression models were prepared by all the possible combinations of xn (n = 1, 2, …, j), i.e., total 2j − 1 combinations, on each dataset with five-fold cross-validation (Fig. 2b). As pointed out, Hastie et al. suggested “ten-fold cross-validation achieves an acceptable trade-off between bias and variance,”55 and this has since become standard practice.56 However, both 5- and 10-fold cross-validation are generally recognized as appropriate choices due to their superior stability compared to leave-one-out cross-validation (LOOCV). Despite its theoretical appeal, LOOCV exhibits high variance in performance estimates and is thus less reliable for model selection.57 In our study, five-fold cross-validation was used in terms of its computational efficiency and as standard practice. After the models were sorted in ascending order of cross-validation error (CVE), i.e., CVE ranking, the values of the coefficients for each regression model were visualized in the weight diagram (Fig. 2c). Conventional ML algorithms require tuning the hyperparameters to optimize the models.58,59 Whereas, as ES-LiR just prepares all the possible linear regression models, the modeling method has no hyperparameters to be tuned compared with other ML algorithms. The contribution of each xn was color-coded by the magnitude of the coefficients with their positive and negative values. The more deeply colored xn with warmer and cooler colors have potential as more significantly contributed descriptors with the positive and negative correlations, respectively. The more densely colored xn correspond to the more frequently used descriptors, implying a significant contribution to y. Weight diagrams were prepared for all the reduced datasets (Fig. S1–S3 in the ESI†). Based on the weight diagrams, we extracted xn as the descriptors with reference to the deepness and density of the color (Fig. 2c and d). Here xn in the already constructed models eqn (1)–(3) were assumed to be the true ones (Fig. 2e). If the visually extracted xn from the weight diagram was found in the already constructed models, the extracted xn here could be regarded as the correct ones. In contrast, the extracted xn that were not found in the constructed models were regarded as incorrect ones (Fig. 2d and e). After the weight diagrams were prepared for the six different reduced datasets at each data size (N) (Fig. S1 in the ESI†), the numbers of correctly and incorrectly extracted xn (nc and ni, respectively) were counted and are summarized in Fig. 2f–h. The mean and standard deviation of nc and ni were calculated for the six different datasets.
When N was reduced, nc decreased and ni increased (Fig. 2f–h). Here the threshold N to extract the correct descriptors (Nmin) is defined as follows: the average ni is less than two and nc is more than 80% of the true nc before the data reduction. The threshold data size to extract the correct xn, Nmin, was 20 for y1 (the original data size: N0 = 30), 45 for y2 (N0 = 48), and 45 for y3 (N0 = 54) (red-colored areas in Fig. 2f–h). These results indicate that the correct xn can be extracted from the weight diagrams based on datasets with Nmin < N. Nmin can be regarded as the minimum required data size to extract the correct descriptors from the weight diagram. As Nmin < N0 was achieved for y1, y2, and y3, the original datasets already had a sufficient data size for the model construction. Therefore, the generalizable xn could be extracted from the weight diagrams even based on the small dataset at Nmin < N. These results support the prediction models for the yield, size, and lateral size, and so eqn (1)–(3) were constructed with a sufficient size of data. Moreover, this data-reduction method can be applied to validate the sufficiency of the data size for the variable selection in small data.
The relationship between the CVE rank and CVE value was determined to visualize the increasing trend of the CVE value (Fig. 3b). Then, the weight diagrams were prepared within the different CVE ranks as the thresholds (Fig. 3c). The CVE values gradually increased with lowering the rank and then jumped near the bottom (Fig. 3d, h and l). As the datasets contained 11, 18, and 15xn for y1, y2, and y3, respectively, the total number of exhaustively constructed models (2j − 1 combinations) was 2.0 × 103 for y1, 2.6 × 105 for y2, and 3.3 × 104 for y3. Whereas the correct xn (n = 18, 40) were clearly visible in the weight diagram within the CVE rank 1.0 × 102 for y1 (Fig. 3e), the weight diagrams became unclear in the ranks 1.0 × 103 and 2.0 × 103 (Fig. 3f and g). The correct xn (n = 18, 40) could not be extracted from these unclear weight diagrams. Clear weight diagrams were observed in the ranks 1.0 × 104 for y2 and 1.0 × 103 for y3 (Fig. 3i, j and m). The visibility was lowered in the weight diagrams in the CVE ranks lower than 1.0 × 105 for y2 and 1.0 × 104 for y3 (Fig. 3k, n and o).
Based on these results, it could be seen that the visibility of the weight diagrams and extractability of the descriptors were changed by the range of the CVE rank. The weight diagrams within the CVE ranks about the top 10%, namely 102 for y1, 104 for y2, and 103 for y3, allowed a clear extraction of the descriptors. The CVE ranks achieving one standard error rule were calculated to be 2.3 × 102 for y1, 4.6 × 104 for y2 and 2.0 × 103 for y3. In the present work, the top 10% of the CVE rank was coincident with one standard error rule. Here, one standard error rule was used to estimate the range of the CVE ranks for visualization. The scheme means that all models having a CVE within one standard deviation of the minimum CVE were considered, resulting in the selection of approximately the top 10% of the CVE rankings. However, this coincide was not necessary. In general, the one standard error rule is used to optimize the regularization parameter (λ) in ML.60 When a larger penalty term is set, the one standard error rule is used to optimize λ instead of the minimum CVE value. These facts imply that a similar scheme can be applied to estimate the threshold for distinctly increasing the CVE by one standard error rule. In contrast, the visibility becomes unclear when the range was expanded to the CVE rank over the top 50%. Whereas the correct descriptors could be extracted from the clear weight diagram, the unclear weight diagram caused an extraction of the wrong descriptors and an oversight of the correct ones. As clear weight diagrams with the top 10% rank were used in our previous works,16–18 appropriate descriptors were extracted for the construction of the models in eqn (1)–(3).
In ES-BMA, the probability of a descriptor (p) being the significant descriptor was 0.5 for all xn at the initial state (Fig. 4a). The descriptors were not extracted because p = 0.5. The probability p of each xn was calculated using the prediction accuracy and coefficient of the 2n − 1 models prepared by ES-LiR (Fig. 4b and c). In ES-BMA, we consider the uncertainty for all 2n combinations of variables and introduce a method for quantitatively evaluating the confidence level of variable selection using a weighted average of the model posterior probabilities, which is called Bayesian model averaging (BMA) (Fig. 4b).61 This method enables evaluating the confidence level of the variable selection and quantifying the importance evaluation of the features, whereas the processes quantitatively depend on the visibility of the weight diagram for ES-LiR. Furthermore, this approach quantitatively assesses the plausibility of the descriptors under the assumption of uniform prior knowledge without relying on the expertise of chemists. The summation over all combinations of indicator vectors can be calculated using the result of the exhaustive search, which is called ES-BMA.61
![]() | ||
| Fig. 4 Variable selection using ES-BMA. (a) Probability (p) before ES-BMA. (b) BMA based on small data. (c) p after ES-BMA. (d) p of each xn for y1 (d), y2 (e), y3 (f) after ES-BMA. | ||
Fig. 4d–f shows the p of each xn. The descriptors for y1 and y2 were not extracted from the probability of each xn because p was almost 0.5 (Fig. 4d and e). On the other hand, x10, x13, x21, x32, x36, x40, and x41 for y3 showed p > 0.5 (Fig. 4f). Four descriptors x10, x21, x32, and x41 of five correct xn in eqn (3) were extractable by the p value based on ES-BMA. These results imply that the data size was insufficient to extract the true descriptors only using ES-BMA without the domain knowledge. The combination of ES-LiR and domain knowledge facilitated the extraction of the descriptors. In SpM-S, the domain knowledge contributed to being able to extract the descriptors and construct the models.
In SpM-S, professional experience and chemical insights, as domain knowledge, are mainly used in the processes of variable selection based on the weight diagram. Although the weight diagram indicated the strong contribution of certain variables, some variables were not used as the descriptors for modeling based on our chemical insight. For example, this scheme provides a more accurate prediction model for the specific capacity of organic anode active materials of lithium-ion batteries.53 On the other hand, some chemically significant descriptors were not extractable only from the weight diagram. In such a case, the descriptors were manually added for the model. For example, the yield prediction model was constructed by this scheme.16 However, it is not easy to quantify the physical significance, not just the correlation, of the variables based on chemical insights. In ES-BMA, such physical meaning is represented by the probability value (p). However, p is not estimated from the small data, as shown in Fig. 4d and e. The development of a new quantitative method is required to extract and select the more significant descriptors quantitatively.
The accuracy, generalizability, and interpretability of the sparse linear models based on small data were compared with those of other nonlinear algorithms in our previous works.45 The results imply that nonlinear models have concerns about overtraining linked to the training data and a lowering of the generalizability, particularly in the case of small data. In such a case, linear models are preferable to describe the whole trend of the data. We recognize the importance of frameworks, such as sure independent screening and sparsifying operator (SISSO),62 which integrates sure independent screening (SIS) with LASSO-based variable selection to efficiently manage ultra-high-dimensional descriptor spaces. However, in our current study, the dimensionality of the descriptor space was limited to several tens of dimensions, enabling an exhaustive search approach rather than requiring dimensionality reduction using SIS. Furthermore, as demonstrated in a recent work,63 a Monte Carlo-based approximate exhaustive search method could be employed for moderately high-dimensional scenarios. Our ongoing research efforts are directed toward integrating SIS and exhaustive search strategies to enhance the descriptor selection efficiency and effectiveness, particularly in high-dimensional and correlated descriptor spaces.
Footnote |
| † Electronic supplementary information (ESI) available: Datasets, weight diagrams. See DOI: https://doi.org/10.1039/d5na00215j |
| This journal is © The Royal Society of Chemistry 2025 |