Tathagata Biswas*,
Adway Gupta
* and
Arunima K. Singh*
Department of Physics, Arizona State University, Tempe, Arizona 85281, USA. E-mail: arunimasingh@asu.edu
First published on 17th March 2025
In recent years, GW-BSE has been proven to be extremely successful in studying the quasiparticle (QP) bandstructures and excitonic effects in the optical properties of materials. However, the massive computational cost associated with such calculations restricts their applicability in high-throughput material discovery studies. Recently, we developed a Python workflow package, pyGWBSE, to perform high-throughput GW-BSE simulations. In this work, using pyGWBSE we create a database of various QP properties and excitonic properties of over 350 chemically and structurally diverse materials. Despite the relatively small size of the dataset, we obtain highly accurate supervised machine learning (ML) models via the dataset. The models predict the quasiparticle gap with an RMSE of 0.36 eV, exciton binding energies of materials with an RMSE of 0.29 eV, and classify materials as high or low excitonic binding energy materials with classification accuracy of 90%. We exemplify the application of these ML models in the discovery of 159 visible-light and 203 ultraviolet-light photoabsorber materials utilizing the Materials Project database.
The Materials Genome Initiative (MGI)4,5 was proposed in 2011 to enable the discovery, manufacturing, and deployment of advanced materials twice as fast and at a fraction of the cost compared to traditional methods. To achieve the MGI objectives one of the key strategies adopted was to harness the power of data and computational tools jointly with experimental investigations.4,5 Since then a new paradigm for accelerated materials discovery has emerged by designing new compounds in silico using first-principles calculations and then performing experiments on the computationally designed candidates.6–8 Several open-source databases have been developed to aid the accelerated material discovery goal such as the Materials Project,9 Aflowlib,10 C2DB,11 ESP,12 NoMaD,13 OQMD14 etc. The availability of such large data has opened up an emerging paradigm, the application of machine learning (ML) and other data science methods for material discovery, thus, making material discovery essentially a big-data problem.15–20 ML accelerated material discovery has made a revolutionary impact in applications ranging from organic and solid-state LEDs, batteries, ferroelectric, high-κ dielectric, hydrogen storage, high-entropy alloys, and thermoplastics to shape memory alloys.16,19,21
The fundamental challenge of applying this well-established material discovery paradigm of combining first-principles computations and data science methods to applications where light–matter interaction is a key phenomenon is the unavailability of large datasets that are accurate enough in comparison to experimental observations. While some first-principles methods such as GW-BSE (Bethe–Salpeter equation) formalism for simulating excitonic effects can produce optical properties with sufficient accuracy they are computationally very expensive. Thus it is not surprising that the largest database of GW-BSE computed absorption spectra is that of ∼300 spectra of two-dimensional (2D) materials.11,22
The unavailability of large-scale data of first principles computed excited state properties can also be partially attributed to the lack of open-source computational tools to perform such high-throughput computations. Recently, the authors have developed pyGWBSE,23 a python workflow package that enables automated high-throughput GW-BSE simulations using VASP, one of the most widely used first-principles atomistic modeling software, and made it available through open-source licensing. While there have been recent efforts in the development of similar workflow codes, they either suffer from issues like the absence of BSE capabilities24 and database integration,25 or use other ab initio modeling software for the excited-states calculations26
In this study, we demonstrate the applicability of ML models in predicting accurate excited state properties such as quasi-particle gap (QPG) and exciton binding energy (EBE). To accomplish this goal, we generated the largest database of first-principles computed QP and excitonic properties of bulk materials computed and curated using pyGWBSE. This database contains static dielectric constants, effective masses, QP bandstructure, absorption spectra, and several other excited state properties such as EBE, integrated absorption coefficients, etc. of more than 350 bulk materials, and new materials are being added to the database continuously. We find that among the various ML regression algorithms, Random Forest Regression has the best performance for predicting QPG within an RMSE of 0.36 and EBE within an RMSE of 0.29 eV. We also find ML models that accurately classify materials to have integrated absorption coefficient (IAC),2 anisotropy in absorption coefficient (AAC),2 and excitonic binding energies suited for a good photoabsorber with an accuracy of ∼90%. Lastly, we apply the ML models developed in this work to identify promising materials that can absorb visible and ultraviolet (UV) radiation for photovoltaic or photocatalytic applications based on their QP and excitonic properties from a list of ∼7000 materials for which only the ground state properties are available in the Materials Project database.
The diversity of our GW-BSE computed database is unique in terms of all the aforementioned characteristics. The only other database that hosts GW-BSE computed properties is that of Hasstrup et al.11 which is limited to two-dimensional (2D) materials and therefore restrictive in terms of crystal systems and chemical compositions.
In the following two sections, Sections 2.1 and 2.2, we present the accuracy of various ML algorithms in predicting the QP and excitonic properties of materials. In particular, we focus on the QPG and the EBE of materials. We also discuss the most important features that were used in these models and their physical significance. In Section 2.3 we present ML models for classifying materials as high or low excitonic binding energy materials. Finally, Section 2.4 demonstrates how the ML models developed in this work allow the discovery of visible-light and UV-light photoabsorber materials by utilizing existing materials database without the need for any explicit GW-BSE simulations.
![]() | ||
Fig. 2 (a) Random forest regression model predicted QPG plotted against GW computed QPG. The training and test set data points are shown as + symbols and circle symbols, respectively. The distribution of the QPGs corresponding to the entire dataset is also shown in the background as a bar plot. To show the scale used for the histogram we have shown the highest value as reference in the figure. (b) The five most important features used to predict the QPG along with their percentage importance are shown as a bar plot. See Section II of the ESI† for a detailed description of the features used in this study. The training and test set data points in (a) are colored based on the DFT computed bandgaps, EDFTg, which is the most important feature in the prediction. The EDFTg is denoted by the color bar. |
For the ML prediction of the QPG, we tested four regression methods – the kernel ridge regression (KRR), the random forest regression (RF), the support vector machine (SVM), and the multi-layer perceptron (MLP)27,28 methods. A 10-fold cross-validation was employed to ensure randomness in the training and test datasets. Note that while our dataset consists of GW-BSE calculations of nearly 350 materials about 35 of them were not included in the ML model development as they had unphysical values for some of the features considered in this study. The model obtained from the RF method performed best for the QP bandgap prediction with an R2 score of 0.98 and RMSE of only 0.36 eV. Fig. 2(a) compares the RF model predicted and GW computed QP gaps where the training set is shown with the ‘+’ symbols and the test set via the circle symbols. The model obtained from the MLP also led to a very similar R2 score as the one obtained from the RF method. Table S1 in the ESI† compares the performance of the models obtained from all four methods.
It is noteworthy that most of the previous studies of ML-based bandgap predictions have been limited to a particular material class such as studies of MXenes by Rajan et al.29 or perovskites by Pilania et al.30 In contrast, in this work, we have a dataset that includes materials without any such restrictions on chemical compositions or materials classes. Despite working with a more diverse set of materials, our RF model is capable of similar/better accuracy as the earlier studies.29,30 In practical applications, we seldom have perfectly crystalline materials and we often see intrinsic defects (vacancies, Frenkel defects, Schottky defects, etc.) as well as extrinsic defects (substitutions, interstitials, inclusions, etc.) and often polycrystalline materials with planar defects like grain boundaries. These factors are known to modulate the bandgaps of materials 10–15% from the pure single-crystal materials that are typically studied through simulations. Thus, from a practical application point of view, an error in the bandgap of 0.3 eV for a material that has a bandgap of 2 eV or larger is accurate enough to develop and test these materials in the laboratory. However, for materials with a much smaller gap (closer to 0.5 eV) our method as well as experiments more advanced and rigorous studies are required to achieve desired bandgaps. Hence such materials need to be carefully evaluated computationally prior to experimental investigation.
Fig. 2(b) shows the most essential features for the QP gap predictions. The number of features used and their importance in the random forest algorithm is computed by calculating the Gini importance (see Methods Section 4.2 for more details). Fig. 2(b) shows that the DFT computed bandgap is the most important feature in the QP gap prediction with 92% importance. This is expected as the Kohn–Sham eigenvectors form the initial ansatz for the QP eigenvectors. DFT bandgaps are routinely used as a starting point in the determination of GW computed QP gaps. Furthermore, it is extremely promising that the macroscopic average dielectric constant (εavg) emerges as the second most important feature. While it is well known that the full frequency-dependent dielectric matrix in the plane wave basis εq(G,G′,ω) is an important ingredient in the GW calculations, its determination is expensive and unsuitable for high-throughput computation. Here, by showing that the QPG gap can be predicted with a reasonable accuracy by using only the easily calculable static macroscopic dielectric constant εavg = 1/limq→0(εq(G = 0,G′ = 0,ω = 0)−1), our model emerges as an extremely useful tool in future high-throughput material discovery studies. Three of the 5 features shown in Fig. 2(b) have very low total importance, less than 1%. We examine the relevance of these features by comparing the RMSE values of RF models that include 1 to 9 of the most important features. An RF model obtained by including just the most important feature, i.e. the DFT gap, gives quite a large RMSE of 0.72 eV. Fig. S1 in the ESI† shows that the RMSE values for the QP gap prediction decrease from 0.44 eV to 0.36 eV when the number of features included in the model increases from 2 to 5, and thereafter it remains almost constant for up to 9 features. Thus, these three other features are also important for accurate QP gap predictions, despite their low, <1%, contribution to the total feature importance.
![]() | ||
Fig. 3 (a) RF model predicted EBE plotted against GW-BSE computed EBE in the training (+ symbols) and test set (circle symbols). The distribution of EBEs corresponding to the entire dataset is also shown in the background as a bar plot. To show the scale used for the histogram we have shown the highest value as reference in the figure. (b) The eight most important features according to their Gini importance and their % importances are shown as a bar plot. See Section II of the ESI† for a detailed description of all the features used in this study. The training and test set data points in (a) are colored based on the average dielectric constant, εavg, which is the most important feature in the EBE prediction. The εavg value is denoted by the color bar. |
Fig. 3(b) shows the most important features and their % importance as computed using the Gini importance method. The most important features include properties like dielectric constants, effective masses of electrons and holes, and atomic packing fractions that are also considered in well-known physical theories of EBE. For instance, the average dielectric constant and the hole-effective mass features are also included in the Wannier–Mott (WM) model. In the WM model, the , where
is the reduced effective mass of an electron and a hole, e is the charge of an electron, ℏ is the reduced Planck's constant and ε is the dielectric constant of the material. EBEWM, can thus be obtained from ground state properties without explicit BSE simulations.
In the case of Wannier–Mott (WM) excitons, the Coulomb attraction between e–h pairs is screened to a larger extent resulting in an exciton wavefunction that is spread over multiple unit cells and has low EBE. Since a majority of the materials in our dataset have a low (<1 eV) EBE, the Wannier–Mott model applies to them.
In Fig. 4 we compare the RMSE accuracy of both the WM model and the ML model as a function of the EBE. This RMSE as a function of EBE has been calculated by considering only materials with EBE in a 1 eV window around a certain EBE value. Our results show that the ML model has a much lower RMSE than the WM model. Furthermore, while consistently poorer than the ML model, the WM works comparatively well at low EBE but fails dramatically in the high EBE region. This is not surprising, since the materials that have very high EBE in the range of 4–5 eV are expected to exhibit Frenkel or Charge Transfer (CT) type excitons. CT excitons are more localized with very strong Coulomb attraction between e–h pairs and therefore have high EBE. The EBE of a CT exciton is given by the expression, where rCT is the separation between the electron and hole of an exciton or radius of exciton wavefunction. Unlike the WM model, this model can not be used to predict the EBE of solid-state materials using ground-state DFT computed properties, as rCT can not be computed without solving the BSE. CT excitons are usually localized in a length scale of the order of the size of a unit cell of materials and are also expected to have smaller rCT for materials with tighter packing efficiency. Thus one can assume that rCT = fV−1/3, where f is a dimensionless proportionality constant and V is the volume of the unit cell, allowing the estimation EBECT from ground state properties without the need of BSE simulations.
In Fig. 4 we present the RMSE accuracy of EBE obtained from the CT model (f = 0.5) as a function of the EBE. In comparison to the ML model, the CT model is consistently poorer with high RMSE values. However, as expected, it performs better than the WM model in the high EBE region.
Overall, the ML model performs much better in any energy window in comparison to the WM or CT model. We think this superior predicting capability comes from the inclusion of additional material properties not present in the WM model such as packing fraction and range of electron and hole effective masses and dielectric constants. By including such attributes our ML model is capturing the physics of not only the low EBE excitons but also the higher EBE regime where the CT model is more perhaps applicable than the WM model. Therefore, one can in principle build a more general empirical model for excitons based on the properties revealed by our ML model.
Among excitonic properties, low EBEs are preferred in applications where free e–h pairs are desired for example in photocatalytic materials.2 In addition, two other parameters derived from BSE obtained absorption spectra—the integrated absorption coefficient (IAC) in the solar wavelength range of interest and anisotropy in absorption coefficient (AAC)—are useful to quantify the potential of a material for solar-energy absorption, for example in photovoltaics and photocatalysts. The methods section describes the calculation of IAC and AAC from the frequency-dependent absorption spectra. In a previous study, we have established that low EBE materials are those that have EBE smaller than 0.2 eV, high (visible/UV)-light IAC materials have an IAC larger than >10.5 × 104 cm−1 eV, and high AAC materials have AAC ≥0.8.
Fig. 5 shows the results of classification obtained by the RF method in the form of confusion matrixes. Fig. 5(a) shows the confusion matrix for classifying materials as low EBE. Fig. 5(b) shows the matrix for classifying materials as high UV-light IAC and Fig. 5(c) for classifying materials as high AAC materials. For the EBE classification, only 21 of the 305 materials were classified incorrectly resulting in a high classification accuracy of 93%. The classification models for the IAC and AAC resulted in an accuracy of 94% and 83% for the 355 materials in the dataset, respectively. Additionally, we employed the AdaBoost, stochastic gradient descent (SGD), and MLP27,28 methods for the classification. However, the RF performed best among the four methods. ESI Table S2† shows the comparison between the four methods.
The feature set used for IAC and AAC classification was selected following a similar strategy employed for the QP gap and EBE prediction described earlier. In the ESI Fig. S2 (IAC) and S3 (AAC)† we have shown the most important features along with their % importance. In the case of IAC prediction mean dielectric constant (67.1%) and DFT computed bandgap (4.6%) emerge as the two most important features. The emergence of these two properties as the most important features can be understood from the fact that for a high absorption in the visible spectrum, 1.7–3.5 eV, one needs to have a material with a QP gap in the same range and also needs to have significant absorption coefficient (∝ε(ω)) in that energy range. As we have seen in Fig. 2 for the QP gap prediction mean dielectric constant and DFT computed bandgap are the two most important features, it is not surprising that they are also equally important for the IAC predictions. Moreover, the high importance of the mean dielectric constant also signifies a high degree of correlation between the static dielectric constant of a material and frequency-dependent dielectric function. In the prediction of anisotropy in visible light absorption (AAC) we find that the range of dielectric constant (max{εx, εy, εz} − min{εx, εy, εz}) is the most important feature (58.8%). All the other important features in the prediction of AAC have importance <5%. Therefore, one can identify a material with a high degree of anisotropy in visible light absorption by looking at the anisotropy in the static dielectric constant, which once again highlights the importance of static dielectric constants in the excitonic properties. The Materials Project (MP)9 database currently holds ∼150000 materials, but only ∼7000 (4.7%) of them have computed static dielectric constants. We believe that the static dielectric constant using DFT is quite inexpensive to calculate but is a crucial parameter to understand material applicability for a wide variety of electronic and optoelectronic applications and therefore it would be useful to compute and curate it for more materials in existing materials databases.
We perform such screening on ∼7083 materials that have static dielectric constants and DFT computed band structure available in the MP database. We find that out of 7000 materials, only 159 passed the criterion of low EBE, high IAC, and high AAC in the visible-light region, 1.7–3.5 eV. Regarding the UV-light absorption, 3.5–4.2 eV, we found 237 materials passed the aforementioned criteria.
Fig. 6 examines the chemical compositions of the materials in the starting set of ∼7000 materials (outer rings) as well as ones that emerged as suitable for visible-light applications, Fig. 6(a), and UV-light, Fig. 6(b), respectively, inner rings. We find that most of the visible-light materials are either pnictides or chalcogenides. Note that almost half of the starting set of materials, 47%, are oxides but only 2 oxides pass through the screening. This is not surprising since oxides mostly have larger bandgaps than the visible light region. More oxides, six, are found in the screened materials for the UV light region. Most of the materials for UV absorption also belong to pnictides or chalcogenides. Furthermore, we find that almost half of the selected chalcogenides are tellurides. In the ESI Tables S6 and S7† we list all the screened materials for both visible and UV absorption respectively along with their ML-predicted QP gap and EBE values. In Tables S6 and S7,† we have also reported whether a screened material has been already synthesized before and has an ICSD ID and the computed value of energy above hull. We find that the majority of these materials 193 out of 234 materials have been already synthesized. Moreover, we find that 168 out of 193 previously synthesized materials have computed energy above the hull value of 0 eV. An examination of the screened materials shows that several of these materials have been studied in the context of photoabsorption-related applications, for example, GeTe,33 AlSb,34 SnSe,35 etc. Thus it is likely that the other screened materials can be promising novel materials for photoabsorption-related applications.
To quantify the fraction of incident light that can be absorbed by a material in a desired frequency range, we can compute the integrated absorption coefficient, IAC. IAC is obtained by integrating the BSE computed frequency-dependent absorption coefficient, αint.23 In the case of light polarization along x axis,
![]() | (1) |
To assess whether a material has a preference for absorbing light of certain polarization we calculate the anisotropy in absorption coefficient (AAC), αanisoint. αanisoint is defined as the ratio of min(αintx,αinty,αintz) and max(αintx,αinty,αintz).
The most important features were determined by computing the Gini Importance45 method. The Gini Importance of each feature is calculated as the decrease in node impurity weighted by the probability of reaching that node. Furthermore, for the prediction of QPG and EBE, we chose the minimum number of features that were needed to obtain an RMSE value converged within 0.01 eV.
QP | Quasiparticle |
EBE | Exciton binding energy |
DFT | Density functional theory |
Footnote |
† Electronic supplementary information (ESI) available: The performance of various ML algorithms applied in this study, a detailed description of the features used in ML models, and a list of materials shortlisted for visible and UV light-based applications along with their ML-predicted properties. See DOI: https://doi.org/10.1039/d5ra01285f |
This journal is © The Royal Society of Chemistry 2025 |