Xiaolan
Fu‡
a,
Jiaqian
Wang‡
b,
Xiaojuan
Hu
*b,
Wenwu
Xu
*a,
Sergey V.
Levchenko
c and
Zhong-Kang
Han
*b
aDepartment of Physics, School of Physical Science and Technology, Ningbo University, Ningbo, 315211, China. E-mail: xuwenwu@nbu.edu.cn
bSchool of Materials Science and Engineering, Zhejiang University, Hangzhou, 310027, China. E-mail: xiaojuanhu@zju.edu.cn; hanzk@zju.edu.cn
cSkolkovo Institute of Science and Technology, Bolshoy Boulevard 30/1, Moscow, 121205, Russia
First published on 4th March 2025
We propose an interpretable AI approach integrating hybrid DFT, symbolic regression, and data mining to predict chalcopyrite (ABX2) bandgaps. Key factors, including atomic size, molar volume, and electron affinity, are identified, offering insights into bandgap-composition relationship and guiding high-performance materials design.
Bandgaps of compounds are typically obtained through experimental measurements,7,8 which often require significant time and material resources, especially for large-scale compounds. As a more efficient alternative, density functional theory (DFT) calculations have become effective means for determining the bandgaps of semiconductor materials.9–12 However, DFT calculations with standard approximations to the exchange–correlation functional (local density, generalized gradient, and meta-generalized-gradient approximations) often significantly underestimate bandgap values.13 On the other hand, the much more accurate many-body perturbation theory method GW is too computationally expensive for intermediate-throughput screening. To achieve good accuracy in bandgap calculations while keeping computational cost not too high, hybrid functionals, such as Heyd–Scuseria–Ernzerhof (HSE06) functional, have been proposed.14,15 Despite having moderate computational cost, these methods are still too expensive for high-throughput exploration of materials space containing thousands of candidates.16,17 The thriving development of AI has driven the integration of machine learning with theoretical calculations to accelerate the discovery of materials with desirable properties by extracting informative insights from large datasets.18,19 For instance, Gladkikh et al. trained an alternating conditional expectation (ACE) model to predict the bandgaps of ABX3 perovskites, and used a support vector machine (SVM) model to determine the gap type.20 Tawfi et al. employed feedforward neural network (FNN), SVM, relevance vector machines (RVM), and random forest (RF) models to predict the electronic properties of mixed 2D materials.21 However, these models lack interpretability and cannot explain the physical relationship between the predicted bandgap and the chemical composition.22–24 The accuracy and interpretability of machine learning models are both crucial for the rational design of materials.25,26 In this letter, we introduce an interpretable AI approach that combines data mining and compressed-sensing-based symbolic regression, incorporating feature importance assessment in relation to the target properties of materials. This approach not only facilitates the high-throughput screening of materials with targeted bandgap values, but also elucidates the underlying physical mechanisms driving bandgap variations. Using this strategy, we have pinpointed the key factors influencing the bandgap values of ternary chalcopyrite crystals.
To investigate the composition-bandgap relationship for ABX2 crystals, we calculated the bandgap values of 122 crystals (Table S1, ESI†) using the hybrid functional HSE06 including spin–orbit interaction. The data for ab initio calculations were selected via an active-learning cycle (see ESI† for computational details). This cycle continued until bandgap values predicted by the model aligned with those calculated by HSE06. The ABX2 ternary chalcopyrite crystals (Fig. S1, ESI†) are defined by their composition, where the A site is occupied by transition metals or alkaline-earth metals Be, Mg, Ca, Cu, Zn, Pd, Ag, Cd, and Au, the B site is filled by post-transition metals or metalloids including Al, Si, Ga, Ge, In, Sn, Sb, Tl, Pb, and Bi, and the X site consists of pnictogens or chalcogens N, P, S, As, Se, Sb, and Te. We allocated 80% of these 122 datasets for model training, reserving the remaining 20% exclusively for testing the predictive accuracy of the models. A complete list of materials features used for training AI models is provided in Table S2 (ESI†).
Among the features listed above, number of valence electrons (NVE) is determined by the number of electrons in an element's outermost shell. Thermal conductivity (TC) values are measured at 25 °C, and electronegativity (EN) is reported on the Pauling scale. Additionally, molar volume (MV) is measured under standard atmospheric pressure at 298 K.
We initially conducted principal component analysis (PCA) to evaluate the potential linear relationship between primary features and HSE06-calculated bandgap values (GHSE). Each principal component (PC) represents a linear combination of 36 primary features. Within the training dataset, the leading two PCs accounted for only 25.12% and 21.00% of the variance, respectively, as shown in Fig. 1(a). To capture the essence of the original dataset adequately, where cumulative explained variance exceeds 90%, including more than seven PCs becomes necessary. However, this contradicts the goal of efficient dimensionality reduction. Fig. 1(b) further highlights this challenge, showing a weak correlation between PC1, PC2, and GHSE. These results indicate that neither individual features nor their simple linear combinations correlate well with GHSE, emphasizing the need for models capable of unveiling nonlinear relationships. The complexity of the bandgap-composition relationship is reflected in the diversity of band structure types within the considered class, including deep- and shallow-impurity-like semiconducting structures, wide-gap semiconductors, and metals (Fig. S2 and S3, ESI†). Consequently, we employed the sure independence screening plus sparsifying operator (SISSO) approach to find a nonlinear relationship between primary features and GHSE.27–29 In SISSO, target property is expressed as a linear combination of derived features, which are obtained as (possibly) non-linear mathematical expressions involving primary features. In this study, we focus on prediction of the bandgap value. Considering the diversity of the band structure type, it would be interesting to develop models for predicting other features of the band structure describing its topology. However, this is beyond the scope of this paper.
A training dataset of 60 systems was initially selected from over 600 candidates for HSE06 calculations to determine their GHSE. The developed SISSO model was then employed to predict GHSE for the remaining candidates, with those exhibiting the smallest or largest predicted GHSE systematically added to the training dataset. This iterative refinement process continued until the distribution ranges of calculated and predicted GHSE closely matched, resulting in an optimized dataset of 122 systems. Fig. 2(a) and (b) show heatmap of the Pearson's correlation coefficient matrix among the 36 primary atomic features and the bandgap distribution within the training dataset. In SISSO, to prevent overfitting as model complexity increases, we employ twenty-fold cross-validation to determine the optimal model dimensionality (i.e., the number of derived features in the linear combination) (Table S3, ESI†). The average RMSE of the twenty-fold cross-validation as a function of model dimensionality is shown in Fig. 2(c). This reveals that the five-dimensional model possesses a balance between high accuracy and simplicity, thus identifying it as the optimal SISSO model. The distribution of deviations between the predicted and calculated GHSE values for this model is shown in Fig. 2(d), with most errors concentrated below 0.2 eV. The components of the optimal SISSO model, including its coefficients and importance scores, are collected in Table S4 (ESI†). These complex formulas of the model highlight the intricate relationships between the primary features and bandgap.
The most significant descriptor component, d1, has the highest importance score, calculated as the relative increase of fitting RMSE after this component is removed (see ESI,† for details). It indicates that B elements with larger molar volumes (MVb), and larger atomic radii of atoms at the X site (ARCc) typically correspond to smaller bandgaps in ABX2 ternary chalcopyrite crystals. A larger MVb results in the expansion of the crystal lattice, which reduces overlap of atomic orbitals, thereby decreasing the bandgap. Larger atomic radii at the X site have a similar effect. Conversely, thermal conductivity of A-site elemental crystal (TCa) results in the increase of band gap in ABX2. Since A elements are metals at normal conditions, their thermal conductivity is determined by the presence of nearly free electrons. With right ligands these electrons can make strong covalent or ionic bonds, which increases the bandgap. These insights demonstrate that the SISSO model can effectively unravel the underlying physical mechanisms governing the properties of chalcopyrite crystals.
SISSO aids in identifying critical descriptor components by selecting important primary features and feature combinations relevant to GHSE values. However, it does not directly elucidate which combinations of features are likely to result in too low or too high bandgap. To deepen our understanding of the underlying mechanisms and the significance of primary features, we employ a modified version of SGD approach.30 Here, only the primary features from the SISSO model and SISSO-predicted GHSE values are used for SGD analysis.
The best subgroups found are shown in Table S5 (ESI†), along with the degenerate propositions. The degeneracy is determined by the condition that the number of common data points in the original subgroup and the subgroup obtained by replacing a proposition with another one is greater than 99% of the larger of the two subgroups. The selector of the best subgroup that minimizes GHSE with cutoff 1 eV (i.e., under condition GHSE < 1 eV) is defined as follows: (ARCc ≥ 98 pm), (EAa ≤ 125.6 kJ mol−1), (EAa ≥ 53.7 kJ mol−1), and (MVb ≥ 11.93 cm3 mol−1). The subgroup contains ∼24% of the whole dataset samples (Fig. 3a). The selector is well consistent with the SISSO model, indicating that larger ARCc and MVb lead to smaller GHSE. This is explained by reduction of atomic orbital overlap when the lattice is expanded due to larger volume of the unit cell. On top of this, SGD reveals that the electron affinity of atom A is an important feature that controls whether bandgaps are lower than 1 eV. In particular, the electron affinity should not be too low (EAa ≥ 53.7 kJ mol−1) or too high (EAa ≤ 125.6 kJ mol−1). The latter condition is degenerate with AWa ≤ 112.41 a.m.u. and ENa ≤ 2.20. Lowering the cutoff to 0.5 eV results in the best subgroup (AWb ≥ 72.64 a.m.u.) AND (TCa < 120 W m−1 K−1) AND (IEc ≤ 999.6 kJ mol−1). The first proposition is degenerate with MVb ≥ 13.65 cm3 mol−1, which is similar to the proposition MVb ≥ 11.93 cm3 mol−1 in case of a 1 eV cutoff, indicating that a larger size of atom B consistently leads to a smaller bandgap in ABX2. The proposition TCa < 120 W m−1 K−1 is also consistent with a straightforward interpretation of the SISSO model. Lower thermal conductivity of elemental A metals implies more tightly bound and localized electrons that do not form strong bonds. The proposition IEc ≤ 999.6 kJ mol−1 rules out X = N and P, which could make strong covalent bonds with A, resulting in increased bandgap.
The selector for the best subgroup that maximizes bandgaps above 2 eV is defined as follows: (AWa > 112.41 a.m.u.), (HFb ≥ 6.30 kJ mol−1), (MVb ≤ 18.27 cm3 mol−1), and (IEc ≥ 941 kJ mol−1). Increasing the cutoff to 2.5 eV yields best subgroup (ARCc < 103 pm), (EAa > 125.6 kJ mol−1), (HFb ≥ 7 kJ mol−1), and (MVb ≤ 18.18 cm3 mol−1). Consistently, these subgroups contain conditions opposite to those for the subgroups with smaller gaps. A smaller atomic radius of X and a smaller molar volume of B result in shorter interatomic distances, improving orbital overlap and therefore increasing bandgap. The degenerate propositions AWa > 112.41 a.m.u. and EAa > 125.6 kJ mol−1 select only one element (Au), whose high electron affinity makes possible formation of strong covalent bonds with neighboring atoms. Similarly, degenerate propositions HFb ≥ 7 kJ mol−1 and EAb ≥ 42.5 kJ mol−1 select atoms that can form strong localized bonds, resulting in larger gap (Table S5, ESI†).
Relying solely on elemental and atomic features, the SISSO model enables us to predict the bandgap for a wide array of chalcopyrite crystals. We predicted the GHSE values for 506 new chalcopyrite crystals, and the results are shown in Fig. 4. Based on the high-throughput screening results, we have identified chalcopyrite materials with bandgaps in a range suitable for tandem solar cell and other applications.31,32 In Table S6 (ESI†) we list 43 materials with direct HSE06 bandgaps in the range 0.6–2 eV.
In summary, we have developed a unified model that quantitatively describes bandgap variations in chalcopyrite crystals by integrating HSE06 calculations with AI. This model not only accurately describes composition-bandgap relationship, but also elucidates the underlying physical mechanisms responsible for these variations. Using data mining approach SGD, we identified combinations of features that result in a reduction or an increase of bandgap. SGD finds that size of atom X, molar volume of elemental B crystal, and electron affinity of atoms A and B are important for maximizing or minimizing the gap. The geometric features describe orbital overlap, which directly affects the bandgap. The electron affinity describes the ability of atoms form stronger covalent bonds with neighboring atoms, which leads to higher bandgaps. These findings underscore the transformative potential of interpretable AI algorithms in paving new paths for the rational design of high-performance materials.
This work was financially supported by the National Key R&D Program of China (2023YFA1506902, 2022YFA1505500), National Natural Science Foundation of China (22302173), and the Fundamental Research Funds for the Central Universities. The development of the subgroup discovery method was supported by Russian Science Foundation, grant number 24-13-00317.
Footnotes |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d5cc00259a |
‡ These authors contributed equally. |
This journal is © The Royal Society of Chemistry 2025 |