Open Access Article
Md Habibur Rahman
and
Arun Mannodi-Kanakkithodi
*
School of Materials Engineering, Purdue University, West Lafayette, IN 47907, USA. E-mail: amannodi@purdue.edu
First published on 19th May 2026
The remarkable tunability of chalcogenide semiconductors offers exciting opportunities for applications in solar cells, while also posing significant challenges in exploring their vast compositional space. Although many ternary and quaternary chalcogenides exhibit excellent optoelectronic properties, their performance is often limited by poor defect tolerance and unfavorable doping behavior. Composition engineering is as a promising strategy to simultaneously optimize bulk stability, electronic structure, and defect physics in these materials. In this work, we applied a data-driven framework integrating high-throughput density functional theory (DFT) computations with descriptor-based and structure-based machine learning models to design novel multinary chalcogenide semiconductor alloys ideal for solar absorption and other optoelectronic applications. Within a pre-defined chemical space of zincblende-derived A2BCX4 and ABX2 compounds, we performed hybrid HSE06 calculations with spin–orbit coupling to generate a large dataset of the optimized lattice parameters, decomposition energy, band gap, theoretical maximum PV efficiency, and point defect formation energies for thousands of compounds. Random forest regression models utilizing composition-weighted elemental features were trained and deployed to predict properties for nearly half a million possible compositions, leading to the identification of ∼1200 stable compounds with desired optoelectronic properties. Crystal graph-based machine learning force field (MLFF) models were additionally trained on the DFT dataset to enable rapid energy prediction and geometry optimization of new bulk and defect-containing configurations. This workflow led to the identification of several promising compounds, with a notable example being Cu2Ca0.5Cd0.5SnS4, which satisfies conditions of thermodynamic stability, photovoltaic-suitable band gap, and intrinsic defect tolerance. The entire computational workflow and dataset have been packaged and released as ChalcoDB, an online tool publicly available via the nanoHUB platform, thus facilitating community access and ready simulations and predictions for chalcogenide compounds.
Broader contextThe demand for sustainable photovoltaic materials has intensified research into multinary chalcogenide semiconductors, which offer tunable optoelectronic properties and potential for earth-abundant compositions. However, their vast compositional space and complex defect chemistry present significant exploration challenges. Recent advances integrating high-throughput density functional theory (DFT) with machine learning (ML) enable accelerated screening of candidate materials. ML-accelerated DFT provides a great avenue for engineering composition, structure, ionic ordering, and defect behavior in chalcogenide semiconductor alloys. Here, we present a framework powered by high-throughput simulations, descriptor-based regression models, and ML force fields, for discovering novel chalcogenide compounds that are stable, defect-tolerant, and show promising optoelectronic properties. We discuss essential data-driven insights and the prospect of new chalcogenide alloys for solar absorption. |
Although Cu2ZnSnS4 (CZTS) solar cells have been considered a potential alternative, their efficiency still lags behind that of CdTe and CIGS devices. This performance gap is largely attributed to the presence of deep intrinsic point defects that serve as nonradiative recombination centers, reducing both minority carrier lifetime and overall device efficiency, something that has been extensively explored and reported in the literature.15–27 Studies have identified defects such as S vacancy (VS) and cation anti-site substitution (e.g., SnZn) which show low formation energies and often introduce deep defect states detrimental to charge transport.28 Anion vacancies in CZTS, while sometimes electrically neutral, can still act as problematic defect centers by enabling electron capture and bipolaron formation, where Sn4+ is reduced to Sn2+. This stabilization of neutral VS contributes to recombination losses, lowering the open-circuit voltage (Voc) of CZTS solar cells.15,28–30
A study from Minbashi et al.31 demonstrated that controlling defect types and densities in Cu2ZnSn(S,Se)4 solar cells is crucial for improving PV performance. “Benign” defects with trap densities below 1016 cm−3 and “harmful” defects above this threshold have been identified from their work, showing that reducing harmful defects led to a record power conversion efficiency (PCE) of 19.0%. Strategies such as Na nanocrystal synthesis, stoichiometry control (e.g., Cu-poor and Zn-rich conditions), and double cation substitution (e.g., Cu with Ag and Zn with Cd) effectively minimized recombination losses and improved device efficiency. Additionally, the incorporation of elements such as Ag and Ga was shown to suppress stacking faults and grain boundaries, which otherwise act as electron traps and degrade device performance.31
By comparison, CIGS solar cells demonstrate lower deep-level defect densities, which contributes to their superior efficiencies.12 Incorporating Se in place of S within CIGS produces a reduced bandgap that suppresses harmful defect formation.13 Despite having comparable zincblende-based tetrahedral coordination structures, CdTe, CIGS, and CZTS display substantially different defect chemistry.11 Intrinsic point defects in CZTS and CIGS are further complicated by the emergence of secondary phases during fabrication, including ZnS or Cu2SnS3 in CZTS systems, which alter absorber layer electronic behavior.29 Therefore, controlling secondary phase development and refining chalcogenide absorber stoichiometry are essential for improved material performance.1,32,33 High-throughput experimental and computational strategies are required to screen and determine optimal compositions and processing parameters that reduce defect levels and enhance optoelectronic characteristics in ternary and quaternary chalcogenides.9
Composition engineering via alloying at the cation or anion sites plays a critical role in enhancing the efficiency and stability of CZTS-based solar cells, as demonstrated for example in the study on Cu2Zn1−xBaxSnS4 (CZBTS) quinary alloy thin films.34 Total substitution of Zn with Ba changes the bandgap from 1.48 eV to 1.92 eV. The incorporation of Ba not only induced a structural transition from the kesterite (tetragonal) phase to a trigonal phase, but also improved the crystallinity and reduced defect densities in the films. Ba substitution mitigated cation disorder by avoiding the formation of detrimental Cu–Zn antisite defects, which are known to limit the performance of CZTS solar cells. Overall, the tailored bandgap and enhanced structural properties render CZBTS a promising absorber layer in tandem solar cells, showcasing the potential of composition engineering to optimize properties of chalcogenides for efficient solar energy harvesting.34
In recent work from our group,35 we employed high-throughput density functional theory (DFT) computations to study A21+B2+C4+X42− quaternary chalcogenides, considering a set of monovalent cations A, divalent cations B, tetravalent cations C, and chalcogen anions X.35 Similarly, we also simulated A1+B3+X22− ternary compounds by considering monovalent cations A, trivalent cations B, and chalcogen anions X.35 A total of 540 compounds with two types of cation ordering were simulated using the hybrid HSE06 functional with spin–orbit coupling (SOC) to yield their optimized structures, stability, and optoelectronic properties. We identified 45 stable compounds with theoretical maximum PV efficiency exceeding 30%, indicating strong potential as single-junction solar cell absorbers. Further analysis of point defects in two promising candidates revealed susceptibility to harmful anti-site substitutional defects, a known challenge in multinary chalcogenide compounds. It was concluded that further composition engineering through targeted alloying at cation or anion sites is necessary to enable effective multi-objective optimization across a vast combinatorial compositional space and achieve the desired defect tolerance.
Here, we built upon our recent work and applied a combination of high-throughput DFT and different machine learning (ML) approaches to perform multi-objective composition engineering within the ABX2 and A2BCX4 chemical spaces. Via controlled mixing at the A, B, C, and X sites, we leveraged a variety of bandgap and stability “bowing”36–38 effects to achieve optimal thermodynamic stability and optoelectronic properties. For A2BCX4 compounds, we considered monovalent cations A ⊂ {Na, K, Rb, Cs, Cu, Ag}, divalent cations B ⊂ {Mg, Ca, Sr, Ba, Zn, Cd}, tetravalent cations C ⊂ {Sn, Ge, Zr}, and chalcogen anions X ⊂ {S, Se, Te}, allowing fractional occupancies to enable systematic mixing at all cation and anion sites. Through combinatorial analysis, this led to a vast space of 476
280 possible compounds, including 648 pure compounds and diverse mixed compositions across A, B, C, and X sites. Similarly, for the ABX2 series, we considered monovalent cations A ⊂ {Na, K, Rb, Cs, Cu, Ag}, trivalent cations B ⊂ {Al, Ga, In}, and chalcogen anions X ⊂ {S, Se, Te}, with fractional occupancies leading to 56
700 potential compositions, including 108 pure compounds. All pure compositions adopted two types of cation ordering, namely Kesterite-type and Stannite-type; upon alloying, there are many more possibilities for ionic order and disorder.
By performing DFT computations for a representative subset of this massive chemical space, predictive ML models39–41 could be trained for screening across all possible compounds at a fraction of the cost of full DFT.42–53 Such models utilize unique compositional or structural descriptors that uniquely characterize each material and feed them as the input to algorithms such as neural networks or random forest regression to yield the properties of interest as the output. We thus established a “DFT-ML” framework for predicting the structural, electronic, and optical properties of any given pure or alloyed A2BCX4 or ABX2 compounds. During data generation, a series of bulk and defect calculations were performed at the HSE06 level of theory, ensuring high training data fidelity and reliable subsequent ML predictions.
Composition-based property predictor models were first trained using random forest regression by employing descriptors derived from the weighted averages of readily available elemental properties of all cation and anion species that made up a given composition. Then, a machine learning force field (MLFF) model based on the Materials 3-body Graph Network (M3GNet)48 architecture was trained by sampling thousands of structures and the corresponding energies, atomic forces, and stresses from DFT geometry optimization runs, thus achieving the ability to efficiently optimize any new structures of screened compositions. Finally, some of the most promising compounds were studied using DFT accelerated by the MLFF, especially to simulate and understand their defect physics.
All ML training followed the standard best practices in data science, including rigorous hyperparameter optimization, training-test splits, and cross-validation to ensure robust model performance. Once trained, these models enabled rapid screening across thousands of candidates, effectively identifying the compounds with bulk thermodynamic stability, PV-suitable bandgaps, and strong optical absorption characteristics. This integrated DFT-ML framework offers the following key capabilities (Fig. 1):
1. A high-throughput computational workflow to generate an HSE06 + SOC (high-fidelity) dataset of semiconductor properties, along with data visualization tools for extracting qualitative trends.
2. On-demand prediction of the energy of decomposition, electronic bandgap, and theoretical maximum PV efficiency for any A2BCX4 or ABX2 chalcogenide composition (including alloyed variants).
3. An MLFF trained to predict atomic forces and energies at hybrid functional accuracy for both ordered and disordered chalcogenide crystals, including both bulk and defect configurations, enabling efficient geometry optimization via gradient-based methods.
4. Accelerated simulations of point defects in the most promising candidates using MLFF-accelerated DFT to evaluate their defect tolerance and dopability.
The overall computational workflow is as follows: 532
000 defined compounds → 6763 representative DFT calculations → thermodynamic, optoelectronic, and defect properties → train RF models and the MLFF → RF screening of 532
000 compounds → select top candidates → MLFF-accelerated DFT defect studies.
) and stannite (I
2m), which differ in terms of Cu and Zn ionic ordering relative to Sn. The kesterite phase is derived from the chalcopyrite structure of CuInS2 (CIGS), where In is replaced by Zn and Sn, whereas the stannite phase originates from a CuAu-like ordering.35 In prior work, we investigated the chalcopyrite CIGS structure, referred to as “Ordering 1”, and a stannite-type ternary structure denoted as “Ordering 2”, where In replaces Zn and Sn in the stannite CZTS. A comprehensive discussion of the effect of these structural orderings on the properties of interest can be found in our previous publication.35 Because cation alloying blurs the distinction between the prototypical kesterite and stannite frameworks, it is no longer meaningful to strictly label the resulting structures as one or the other. Therefore, for each composition we considered at least two nominal cation orderings, which we denote as Ordering I (kesterite) and Ordering II (stannite) for the A2BCX4 family, and similarly as Ordering I and Ordering II for the corresponding ABX2 variants.
In this work, we considered two large chemical spaces of ternary and quaternary chalcogenide semiconductors. For the quaternary A2BCX4 family, the full enumeration of compositions and cation orderings yields 476
280 possible compounds → from this space, we performed DFT calculations on 6146 carefully selected and representative compounds. Likewise, for the ternary ABX2 family, the total chemical space contains 56
700 compounds → from which 617 representative compounds were computed using DFT. The final DFT dataset consists of 6763 compounds. Importantly, the DFT-calculated subsets are designed to provide a balanced and representative coverage of the full chemical spaces in terms of elemental chemistry, site mixing, and cation ordering. The detailed construction of the chemical spaces and the sampling strategy are provided in the SI.
Geometry optimization with HSE06 was performed on top of the PBEsol-optimized configurations. Static HSE06 calculations with spin–orbit coupling (henceforth referred to as HSE + SOC) were finally performed on the HSE06-optimized geometries to accurately determine the decomposition energy, electronic bandgap, optical dielectric constant, and optical absorption spectrum of all 6763 compounds. Additionally, we simulated some selected compounds (314 in total) using a larger 3 × 3 × 2 supercell containing 288 atoms using Γ point only. For these calculations, geometry optimization was performed using PBEsol and then using static HSE06, but because of computational expense, no HSE + SOC calculations were performed. These structures were additionally used for performing selected native defect calculations, as will be explained in the following sections. A complete statistical overview of the dataset from different DFT functionals is presented in Table 1. For a comprehensive summary of all computational parameters, the reader is referred to Table S1.
| Functional | Compounds | Snapshots | System | Supercell | Properties |
|---|---|---|---|---|---|
| HSE | 6763 | ∼90 000 |
Bulk | 2 × 2 × 1 | ΔHform, ΔHdecomp |
| HSE | 314 | ∼314 | Bulk | 3 × 3 × 2 | ΔHform, ΔHdecomp, Egap |
| HSE + SOC | 6763 | N/A | Bulk | 2 × 2 × 1 | ΔHform, ΔHdecomp, Egap, εoptical, SLME |
| HSE | 314 | 961 (q = +2) | Native defects | 3 × 3 × 2 | EfD(q), CTLs (q = +2 to −2) |
| 970 (q = +1) | |||||
| 981 (q = 0) | |||||
| 971 (q = −1) | |||||
| 965 (q = −2) |
The sequential PBEsol → HSE relaxation protocol (2 × 2 × 2) is justified by the small geometry differences between the two levels of theory: across all compounds, lattice constants differ by a mean of only 0.27% (std. dev. 0.76%) (see Fig. S2), and internal atomic coordinates show correspondingly small displacements (<0.02 Å), consistent with the known behavior of these functionals for zincblende-derived chalcogenides. As a representative benchmark, detailed comparison of the PBEsol and HSE relaxed structures for Ag1Al0.5Ga0.5S2 (Table S2) shows that lattice constants agree to within 1% and mean bond lengths differ by 0.03 Å. The slight structural differences also show that ultimately, it is meaningful to perform the higher-fidelity HSE geometry optimization for accurate results, with the initial PBEsol optimization providing notable acceleration. SOC effects on equilibrium geometry are negligible (<0.1% change in lattice parameters) for this class of materials, while the electronic structure corrections are significant. The mean SOC bandgap correction is −0.18 eV, with Te- and Se-containing compounds showing larger corrections (−0.19 eV) than sulfides (−0.15 eV). The static HSE + SOC calculation on the HSE-optimized geometry therefore accurately captures both structural and electronic properties while remaining computationally tractable.
ΔHform and ΔHdecomp respectively show the energy required for any compound to decompose to elemental species A/B/C/X and to binary A–X, B–X, and C–X phases, and are calculated using the equations below:
![]() | (1) |
![]() | (2) |
In eqn (1), species i refer to the ions at A/B/C/X sites and xi represents the fractions in which they appear in the compound, whereas in eqn (2), Ei are the energies of all binary phases (A2X, BX, etc.) and xi account for mixing fractions. The final term in eqn (2) is the mixing entropy in alloyed compositions at a particular temperature T. Natoms is the number of atoms in the compound so that the formation energy is per atom, and Nfu is the number of formula units in the compound, where one formula unit is defined as ABX2 for ternary compounds and AB0.5C0.5X2 for quaternary compounds. As an example, for the mixed composition Ag2Ca0.5Sr0.5Sn0.5Ge0.5S2Te2, the ΔHform and ΔHdecomp will be given by the following (E(X) is the DFT energy per formula unit or per atom of any system X):
| ΔHform = (E(Ag2Ca0.5Sr0.5Sn0.5Ge0.5S2Te2) − 2E(Ag) − 0.5E(Ca) − 0.5E(Sr) − 0.5E(Sn) − 0.5E(Ge) − 2E(S) − 2E(Te))/8 |
Edecomp = E(Ag2Ca0.5Sr0.5Sn0.5Ge0.5S2Te2) − [0.5 (E(Ag2S) + E(Ag2Te)) + 0.25 (E(CaS) + E(CaTe) + E(SrS) + E(SrTe) + E(SnS2) + E(SnTe2) + E(GeS2) + E(GeTe2))] + kBT [2 × (0.5 ln 0.5) + 8 × (0.25 ln 0.25)]. Additional details are provided in the SI. |
To extend our analysis from stability and optoelectronic properties to defect physics, we selected a total of 314 compounds spanning the A2BCX4 and ABX2 chemistries and generated thousands of possible native point defects in them (namely, vacancies, self-interstitials, and anti-site substitutions), resulting in over 30
000 unique defect configurations. Out of these, approximately 1000 well representative defect structures—comprising 194 vacancies, 152 self-interstitials, and 654 anti-site substitutions—were selected for geometry optimization using PBEsol followed by a static HSE06 calculation. Table S3 lists all the defects that were simulated in different compounds. Defect simulations used a 288-atom 3 × 3 × 2 supercell for any given pure or alloyed compound. Geometry optimization was performed using Gamma-point only, and symmetry-breaking atomic displacements were systematically introduced to avoid energy constraints during relaxation.62–64 Each point defect was simulated in five distinct charge states (q = +2, +1, 0, −1, −2) to ultimately compute charge-dependent defect formation energy (Ef) and charge transition levels (CTLs). The Ef and CTLs were calculated using the following equations:
| Ef(Dq,EF) = E(Dq, A2BCX4/ABX2) − E(A2BCX4/ABX2) + µ + q(EF + EVBM) + Ecorr | (3) |
![]() | (4) |
To determine the accessible chemical potential ranges for each element during defect calculations, we used the Doped package to construct chemical potential phase diagrams.62 This method rigorously considers the thermodynamic stability of the target compound against all known competing phases in the corresponding chemical space. For instance, in the case of Cu2ZnSnS4, we included Cu2S, ZnS, SnS2, Cu, Zn, Sn, and S as competing phases to enforce realistic stability conditions. The Doped package solves the resulting set of linear inequalities to ensure that the chosen chemical potentials maintain phase stability while satisfying the necessary stoichiometry constraints.
A multi-fidelity pre-screening strategy could further reduce the overall computational cost. Analysis of a subset (∼1600 compounds) where both PBEsol and HSE + SOC ΔHdecomp are available shows that a PBEsol-based filter (ΔHPBEsoldecomp > 0.20 eV per f.u.) eliminates ∼470 compounds—94% confirmed unstable at HSE + SOC—with a false-negative rate of only 6.2%. A tiered PBEsol → HSE → HSE + SOC approach could reduce the HSE + SOC calculation burden by ∼25–30% in future high-throughput campaigns.
From Fig. 2(a), it is seen that only a minority of all compounds occur in the region of bulk stability (ΔHdecomp < 0 eV) and PV-suitable bandgaps. We find that out of the 6763 total compounds, 4464 exhibit ΔHdecomp > 0 eV, indicating that they are not stable against decomposition to competing phases. 974 compounds in total (occupying the shaded region in Fig. 2(a)) show both ΔHdecomp < 0 eV and 1.0 < Egap < 2.0 eV, which accounts for 15% of the entire DFT dataset. In addition to meeting the Egap and stability conditions, we find that a total of 588 compounds exhibit SLME > 30%. Furthermore, it was ensured during all geometry optimization runs that the crystal structure remains orthogonal (α ≈ β ≈ γ ≈ 90°), especially during alloying where strain effects and symmetry breaking could lead to significant structural distortions even when ΔHdecomp < 0 eV. To address this, we carefully analyzed the lattice parameters and filtered out distorted structures that deviated significantly from orthogonality. After this refinement, we identified 541 stable compounds with the desired bandgaps and SLME > 30%, some of which are listed in Table 2.
| Semiconductor (ordering) | SLME (%) |
|---|---|
| Na0.5Cu1.5CdGeSe4 (Ordering I) | 32.68 |
| Cu2ZnGeSSe3 (Ordering I) | 32.66 |
| Cu2MgSnS2Se2 (Ordering I) | 32.65 |
| Ag2SrSnSe2Te2 (Ordering II) | 32.65 |
| AgAl0.5Ga0.5Te2 (Ordering I) | 32.65 |
| Na0.5Ag0.5InTe2 (Ordering I) | 32.65 |
| Cu2CaSnSSe2Te (Ordering II) | 32.65 |
| CuAlSeTe (Ordering I) | 32.65 |
| K0.5Cs0.5Cu0.5Ag0.5BaSnTe4 (Ordering II) | 32.65 |
| K0.5CuAg0.5ZnSnS4 (Ordering II) | 32.65 |
| CuGaTe2 (Ordering I) | 32.64 |
| Ag2CaSnTe4 (Ordering II) | 32.64 |
| KCuMgSnSe4 (Ordering II) | 32.64 |
| Ag2MgSnSSe3 (Ordering I) | 32.64 |
| Ag2CaSnS2Te2 (Ordering II) | 32.64 |
| Cu2CdGeS2Se2 (Ordering I) | 32.64 |
| Cu0.5Ag1.5MgGeSe4 (Ordering I) | 32.63 |
| Cu2Ba0.5Cd0.5SnS4 (Ordering I) | 32.63 |
| K0.5Cu1.5ZnSnS4 (Ordering II) | 32.63 |
| KCu0.5Ag0.5BaSnTe4 (Ordering II) | 32.62 |
![]() | ||
| Fig. 3 Defect formation energy (Ef) vs. Fermi level (EF) under Ag-rich conditions for (a) AgAl0.5Ga0.5Se2 in Ordering I, and (b) Ag2Ca0.5Cd0.5ZrSe4 in Ordering II computed using the HSE06 functional. | ||
The initial random forest training set of 1650 compounds was selected via stratified sampling across all cation and anion site occupancies, ensuring balanced coverage of the chemical space.35 This set included all 540 pure (non-alloyed) compositions plus ∼1050 systematically chosen alloys spanning single- and multi-site mixing. Active learning67 was subsequently used to expand the DFT dataset to 6763 compounds. The MLFF (M3GNet48) was trained on snapshots from the HSE geometry optimization trajectories of all 6763 bulk compounds plus defect configurations across 314 compounds, including intermediate relaxation steps to capture the full potential energy surface.
Random forest models68,69 utilized an 80
:
20 train–test split and 5-fold cross-validation. Grid-based search was used to tune the hyperparameters such as the number of trees, maximum depth, features per split, and the minimum sample counts required for node splitting and leaf formation. This procedure balanced generalization and interpretability while allowing us to extract feature importance rankings from the fitted trees. Parity plots in Fig. S5 compare random forest predictions with the HSE + SOC ground truth for the three properties of interest, also showing uncertainties in prediction based on standard deviations across different trees. Test set root-mean-squared errors (RMSEs) are 0.31 eV for ΔHdecomp, 0.25 eV for Egap, and 4.27% for SLME, corresponding to roughly 90–95% accuracy across the observed ranges of the dataset. The larger SLME error is consistent with earlier work53 that relied solely on composition-based descriptors.
To further broaden the training data in regions where the model exhibited large errors and high prediction uncertainty, we implemented an active learning (AL) loop as illustrated in Fig. S6(a). After each iteration, the random forest ensemble was retrained on the expanded dataset and used to rank unexplored compositions based on an upper-confidence-bound acquisition function, UCB(X) = µ(x) + κσ(x), where µ(x) is the predicted SLME and σ(x) the ensemble variance. The top-ranked candidates were subjected to additional HSE06 + SOC calculations, the new results were incorporated into the dataset for model retraining, and the cycle was repeated. In total, approximately 20 AL rounds contributed an additional 5113 targeted HSE06 + SOC calculations, as pictured in Fig. S6(b), finally resulting in the grand dataset of 6763 HSE + SOC data points presented and discussed earlier. The AL optimization was performed using the SLME value as the objective because of two reasons: (a) among all the models, the SLME predictor initially showed the weakest performance compared to those trained on ΔHdecomp and Egap, and (b) the SLME is ultimately the most important property for designing novel solar absorber materials. Multi-site mixing introduced structural distortions in some compositions, producing outliers that proved problematic for the ML models. To ensure robust training, we removed structures that were heavily distorted after DFT optimization, along with data points exhibiting very high ΔHdecomp > 3 eV per p.f.u., large Egap > 4 eV, and very low SLME (<1%). Incorporating all the new data points reduced the SLME test error to 3.75%, while the test errors for ΔHdecomp and Egap fell to 0.25 eV and 0.22 eV, respectively. The final random forest models trained on the expanded dataset are presented in Fig. 4(a–c). The complete curated dataset is provided in the SI.
:
20
:
20 train–validation–test split. All training protocols were performed on a single NVIDIA A100 GPU (80 GB Memory).
Fig. 5(a) and (b) present violin plots of the formation-energy distribution, ΔHform, respectively for the entire datasets of bulk and defect configurations used for training the model. Across the total set of 22
000 bulk structures, ΔHform ranges from +100 to −1800 meV per atom, whereas for the set of 1000 defect structures, it spans from −700 to −1500 meV per atom. A few extreme high-energy outliers were removed, but moderately unstable configurations were retained to improve the robustness of the MLFF model. The neutral charge-state (q = 0) model was trained on (1) snapshots from the 2 × 2 × 1 supercell bulk dataset; (2) snapshots from the 3 × 3 × 2 supercell bulk dataset; and (3) snapshots from the 3 × 3 × 2 supercell defect dataset. Fig. 5(c) and (d) show parity plots comparing DFT and MLFF predictions of ΔHform and atomic forces respectively (parity plots for the remaining charge states are provided in the SI). The RMSE corresponding to ΔHform is below 7 meV per atom, while the force RMSE is below 0.45 eV Å−1. Using this model, thousands of unexplored chalcogenide compounds can now be optimized with near-HSE accuracy in minutes. Geometry optimizations with the trained MLFF employed the FIRE algorithm in ASE.71 Following the MLFF-accelerated defect workflow of Mosquera-Lois et al.,70 MLFF geometry optimizations were run for a maximum of 100 ionic steps with a target mean force tolerance of 10−5 eV Å−1. In practice, all relaxations reached the 100-step limit without satisfying this strict force criterion, with the maximum residual forces converging to approximately 0.01 eV Å−1. However, as shown in Fig. S11, the total energy converges within the first 20–30 steps and remains effectively unchanged thereafter (<0.1 meV per atom variation), indicating that the structures are energetically well-converged despite not meeting the formal force threshold.
To handle charged defects, separate M3GNet MLFF models were trained for each charge state (q = +2 to −2), with each model learning the energy-force-stress mapping from DFT data computed at that specific charge state. The charge-state dependence is implicitly captured through the training data rather than an explicit charge encoding in the model architecture. As shown in Fig. S7, all charge-state models achieve comparable accuracy (RMSE of 7–9 meV per atom). Benchmark MLFF relaxation trajectories for representative defects in Cu2Ca0.5Cd0.5SnS4 (Fig. S11) confirm that the MLFF-relaxed energies converge to within ∼0.5–0.7 eV of DFT reference values across all charge states, with no systematic bias in the sign of the deviation, validating the charge-state-resolved training approach.
To evaluate the accuracy and utility of the MLFF model, we performed the following tests:
1. Energy optimization of bulk structures: we applied the MLFF to optimize the geometries of 50 selected compounds from the test set, and compared the predicted formation energies with the DFT optimized reference energies. Fig. 6(a) shows DFT-optimized vs. MLFF-optimized ΔHform, yielding an RMSE of ∼6 meV per atom.
2. Lattice parameter prediction: for the same set of 50 compounds, Fig. 6(b) shows a parity plot for the DFT-optimized and MLFF-optimized lattice constant values, demonstrating agreement within ±1.5% (RMSE <0.1 Å).
3. Computing defect formation energies: Fig. 6(c) presents the defect formation energy diagrams for VAl and AgSe under Ag-rich conditions in AgAl0.5Ga0.5Se2 (Ordering I). The solid lines correspond to full HSE06 calculations while the dashed lines represent optimized energies from MLFF predictions. The MLFF successfully reproduces the overall EF dependence and relative energetics of different charge states. This consistency indicates that the MLFF captures the essential defect thermodynamics, including the correct donor or acceptor character reflected in the slope of each segment.
000 possible compounds. Fig. 7(a) presents a visualization of the predicted ΔHdecomp vs. the Egap, while Fig. 7(b) shows the SLME plotted against Egap. 127
816 compounds were predicted to have ΔHdecomp < 0, indicating thermodynamic stability. Among these, 1201 compounds simultaneously exhibit ΔHdecomp < 0 and SLME > 30%, some of which are presented in Table 4. Fig. 8(a) shows the screening hierarchy used in this work to identify compounds that satisfy multiple property objectives, culminating in selected defect simulations to evaluate defect tolerance and dopability. An examination of the relative occurrences of different chemical species across the 1201 compounds with optimal properties is pictured in Fig. 8(b).
Cu is the dominant A-site cation in both ternary and quaternary systems, while Sn predominates at the C-site in quaternary compounds. Selenides occur more frequently than sulfides or tellurides. The infrequent presence of group I (alkali) and group II (alkaline earth) metals suggests that these elements may be more effective as dopants for property enhancement, particularly for defect tolerance, rather than as primary cation site occupants. Random forest predictions of ΔHdecomp, Egap, and PV efficiency for all 532
000 chalcogenide compounds are provided in the SI as CSV files, along with available DFT (HSE + SOC) data—including lattice parameters, ΔHdecomp, Egap, εoptical, and optimized structures.
We next used the MLFF models for different charge states to optimize the geometries of native point defects in one of the top candidate compounds, Cu2Ca0.5Cd0.5SnS4, which shows EDFTgap = 1.32 eV compared to EMLgap = 1.38 eV, and SLMEDFT = 31.05% compared to SLMEML = 30.92%. To generate defect configurations, we used an automated workflow based on the ShakeNBreak protocol64 to introduce systematic bond distortions. The Doped package was used to automatically determine the chemical potential limits for each defect using the CompetingPhasesAnalyzer module. All native point defects (vacancies, interstitials, and antisites) were structurally optimized across five charge states (−2, −1, 0, +1, and +2) using the MLFF models. Subsequently, single-point HSE06 calculations were performed on the MLFF-optimized structures to obtain accurate final defect formation energies.
Fig. 9(a) shows the optimized crystal structure of Cu2Ca0.5Cd0.5SnS4 and Fig. 9(b) shows the calculated formation energies for the lowest-energy native defects in this compound under S-rich conditions. VCu exhibits the lowest formation energy across most of the bandgap region, indicating that a Cu vacancy is the dominant intrinsic acceptor, consistent with the p-type conductivity commonly observed in Cu-based chalcogenides. The VCa and VCd defects show higher formation energies closer to the valence band. Among donor-type defects, the CaCu cation anti-site substitution defect has the lowest formation energy (Ef ≈ 0.8 eV at the VBM), and does not show any charge transition level inside the bandgap region. The dominance of low-energy VCu acceptors and the absence of shallow donor defects suggest that Cu2Ca0.5Cd0.5SnS4 will exhibit moderately p-type conductivity, with n-type doping likely challenging to achieve.
![]() | ||
| Fig. 9 (a) Optimized crystal structure of Cu2Ca0.5Cd0.5SnS4. (b) Defect formation energies as a function of Fermi level, computed for Cu2Ca0.5Cd0.5SnS4 under S-rich chemical potential conditions. | ||
Since the MLFF-optimized geometries serve as input for final HSE + SOC single-point calculations, consistency between the MLFF potential energy surface and the HSE reference is critical. PBEsol-trained models would yield relaxed geometries that may not correspond to minima on the HSE surface, particularly for defect structures where exchange-driven electron localization effects are significant. To directly illustrate this, Fig. S9 presents parity plots of MLFF-predicted vs. DFT-computed atomic forces for both PBEsol- and HSE-trained models on neutral defect configurations. The HSE-trained MLFF achieves a force RMSE of 0.030 eV Å−1, lower than that of the PBEsol-trained model (RMSE = 0.051 eV Å−1), demonstrating that the HSE potential energy surface is more accurately captured by the MLFF and yields superior force predictions for defect geometry optimization.
While detailed defect formation energy diagrams are presented only for selected case studies for the sake of brevity, the entire defect dataset spans over 1000 native point defect configurations across 314 compounds, each computed using a 3 × 3 × 2 supercell with PBEsol geometry optimization followed by static HSE06 calculations. Representative defect formation energy diagrams for 18 randomly selected compounds under cation-rich chemical potential conditions are provided in Fig. S10. General trends include (i) Cu/Ag vacancies are consistently the lowest-energy acceptors, supporting p-type conductivity; (ii) anion-related defect energetics are sensitive to the anion species, with Te-containing compounds showing higher formation energy defects; (iii) anti-site defects involving B/C-site cations most frequently introduce deep levels in the Egap; and (iv) several compounds exhibit high minimum defect formation energies (>3 eV) across the entire Egap, indicating strong defect tolerance.
To complement the optoelectronic screening, we computed electronic band structures from HSE + SOC for four selected top-ranked candidates: Cu2ZnGeSSe3, Cu2MgSnS2Se2, AgAl0.5Ga0.5Te2, and Cu2Ca0.5Cd0.5SnS4 (Fig. S8). These compounds span Cu- and Ag-containing systems with diverse anion chemistries and represent candidates with favorable Egap (1.3–1.5 eV), negative ΔHdecomp, and high SLME values. The band structures reveal dispersive band edges near the Fermi level, suggesting moderate effective masses conducive to efficient carrier transport. Table 3 presents the computed electron and hole effective masses of these compounds alongside other properties.
| Semiconductor (ordering) | EMLgap (eV) | SLMEML (%) |
|---|---|---|
| Cu2ZnGeSSe3 (Ordering II) | 1.21 | 32.32 |
| KCu0.5Ag0.5MgGeSe4 (Ordering I) | 1.22 | 32.26 |
| Ag2MgSnSSe3 (Ordering II) | 1.31 | 32.26 |
| Cu2Ba0.5Cd0.5SnS4 (Ordering II) | 1.40 | 32.17 |
| K0.5Cu1.5ZnSnS4 (Ordering I) | 1.34 | 32.15 |
| Ag2SrSnSe2Te2 (Ordering I) | 1.28 | 32.11 |
| K0.5Cu0.5AgZnSnS4 (Ordering I) | 1.32 | 32.02 |
| KCuMgSnSe4 (Ordering I) | 1.26 | 32.00 |
| K0.5Cu1.5Mg0.5Cd0.5SnS4 (Ordering I) | 1.32 | 31.96 |
| K0.5Cu0.5AgZn0.5Cd0.5SnS4 (Ordering I) | 1.29 | 31.94 |
| Cu2ZnGeS2Se2 (Ordering II) | 1.44 | 31.93 |
| Cu2CdGeS2Se2 (Ordering II) | 1.37 | 31.93 |
| Ag2Sr0.5Cd0.5SnS4 (Ordering I) | 1.40 | 31.92 |
| K0.5Rb0.5Cu0.5Ag0.5MgGeSe4 (Ordering I) | 1.17 | 31.92 |
| K0.5Cu0.5AgCdSnS4 (Ordering I) | 1.27 | 31.91 |
| Cu2Zn0.5Cd0.5SnS4 (Ordering II) | 1.23 | 31.90 |
| RbCu0.5Ag0.5MgGeSe4 (Ordering I) | 1.20 | 31.90 |
| KCu0.5Ag0.5MgSnSe4 (Ordering I) | 1.16 | 31.89 |
| K0.5Cu0.5AgMg0.5Cd0.5SnS4 (Ordering I) | 1.34 | 31.89 |
| Cu2Sr0.5Cd0.5SnS4 (Ordering II) | 1.40 | 31.87 |
Notably, the hole effective masses of all four candidates (0.113–0.162 m0) are significantly lighter than those of established photovoltaic absorbers such as CdTe
72 and Cu2ZnSnS4
.73 These low hole effective masses indicate highly dispersive valence band edges, which are favorable for efficient hole transport and reduced recombination losses in photovoltaic devices. Among the four compounds, AgAl0.5Ga0.5Te2 exhibits the lightest hole mass (0.113 m0), while Cu2Ca0.5Cd0.5SnS4 has the heaviest (0.162 m0), yet both remain substantially lighter than CdTe and CZTS.
While the 1201 candidates identified in this work satisfy thermodynamic stability criteria (ΔHdecomp < 0 eV) computed against relevant competing binary phases, we acknowledge that experimental synthesis feasibility depends on additional factors not captured by ground-state DFT calculations alone, including kinetic barriers to nucleation and growth, competition with metastable polymorphs, and non-equilibrium processing conditions. It has been realized that configurational mixing entropy (mean contribution of −0.032 eV) stabilizes mixed compositions further, highlighting the thermodynamic favorability of cation mixing in these structures. Many base compounds in our chemical space (e.g., CZTS,74,75 CuInS2, and CuGaSe2 (ref. 12)) have been experimentally realized, and the alloyed compositions we predict are derived from these via systematic cation/anion substitution, which is a well-established experimental strategy.24,76 In a related recent publication,77 we developed models to predict the “synthesizability” of perovskite-type compounds by combining experimental and computational data, and found that ΔHdecomp has a very clear correlation with the synthesis likelihood, thus reinforcing our belief that a negative ΔHdecomp value is a good indicator of thermodynamic feasibility and compound formability. Future work will include developing similar synthesis prediction models for multi-nary chalcogenides as well driving collaborative synthesis and characterization of the most promising candidates to validate our computational predictions. We anticipate that the publicly available ChalcoDB platform and dataset will help guide and prioritize such experimental efforts.
| This journal is © The Royal Society of Chemistry 2026 |