Open Access Article
Matthew Walker
and
Keith T. Butler
*
Department of Chemistry, University College London, 20 Gordon Street, London WC1H 0AJ, UK. E-mail: matthew.walker.21@ucl.ac.uk; k.t.butler@ucl.ac.uk
First published on 5th December 2025
Computational screening has become a powerful complement to experimental efforts in the discovery of high-performance photovoltaic (PV) materials. Most workflows rely on density functional theory (DFT) to estimate electronic and optical properties relevant to solar energy conversion. Although more efficient than laboratory-based methods, DFT calculations still entail substantial computational and environmental costs. Machine learning (ML) models have recently gained attention as surrogates for DFT, offering drastic reductions in resource use with competitive predictive performance. In this study, we reproduce a canonical DFT-based workflow to estimate the maximum efficiency limit and progressively replace its components with ML surrogates. By quantifying the CO2 emissions associated with each computational strategy, we evaluate the trade-offs between predictive efficacy and environmental cost. Our results reveal multiple hybrid ML/DFT strategies that optimize different points along the accuracy–emissions front. We find that direct prediction of scalar quantities, such as maximum efficiency, is significantly more tractable than using predicted absorption spectra as an intermediate step. Interestingly, ML models trained on DFT data can outperform DFT workflows using alternative exchange–correlation functionals in screening applications, highlighting the consistency and utility of data-driven approaches. We also assess strategies to improve ML-driven screening through expanded datasets and improved model architectures tailored to PV-relevant features. This work provides a quantitative framework for building low-emission, high-throughput discovery pipelines.
New conceptsThis work introduces a novel framework for evaluating materials discovery strategies that explicitly balances predictive performance with environmental impact—specifically carbon emissions. While machine learning (ML) has been widely heralded as a route to accelerate computational screening, our approach is the first to rigorously benchmark the carbon cost of ML-augmented workflows against traditional density functional theory (DFT) pipelines. By treating emissions as a quantifiable design parameter, we reveal trade-offs and “sweet spots” along an accuracy–emissions Pareto front, challenging the prevailing assumption that model accuracy alone should guide methodological choice. Our framework enables materials scientists to make evidence-based decisions about when and how to incorporate ML into discovery campaigns. This concept represents a shift in how computational efficiency is defined—from a purely time- or resource-based metric to one that incorporates sustainability. The additional insight is twofold: (1) we show that certain ML surrogates not only reduce emissions but also outperform higher-fidelity DFT calculations in screening contexts, and (2) we identify clear priorities for future model and dataset development to maximize impact per carbon emitted. As AI becomes increasingly integrated into materials research, our contribution lays the groundwork for responsible, low-emission innovation in computational materials science. |
Global PV capacity reached approximately 1.6 TW in 2023,2 and a future push toward 30–70 TW by 2050 could see PVs meeting most of the world's energy requirements.3 Achieving this target requires the development of new materials as well as the optimization of existing ones.4 While crystalline and multi-crystalline Si modules remain the industrial standard,5 alternative materials such as amorphous Si,6 CIGS,7 CdTe,8 organic photovoltaics,9 and dye-sensitized solar cells10 have been commercialized to varying degrees of success. A number of perovskites have also emerged as promising candidates in the last decade.11 However, established technologies often rely on critical raw materials, toxic elements, or suffer from long-term stability issues, conversion efficiency limitations, or low technological flexibility; overcoming these challenges is essential for reaching TW-level production of PV energy.12
New inorganic materials offer significant promise as future PV absorbers due to their potential for low-cost fabrication, defect tolerance, earth abundance, and facile synthesis via various techniques such as sol–gel processing or sputtering.13–16 These materials exhibit stability across a wide range of thermal, chemical, and mechanical conditions and are compatible with device architectures that may offer lower capital costs, enabling rapid scale-up.13
Computational modelling has played an important role in the development of new inorganic photovoltaic materials such as CZTS,17,18 SnS,19 BiSI,20 Sb2Se3,21 CdTe22 and many others. Typically, these studies have been DFT calculations allowing accurate estimation of optical absorption, carrier transport and defect properties.23 Although these DFT calculations are more efficient than experimental synthesis and characterisaton, they nonetheless have a non-negligible energy cost. In recent years there has been a trend to replace some of the costly DFT calculations with ML surrogate models. However, the questions previously raised about the veracity of these models remain largely unanswered.
To address these questions, we have developed a framework that enables the joint assessment of both predictive accuracy and carbon emissions associated with different computational approaches for estimating PV performance in novel inorganic crystalline materials. These approaches span from hybrid-functional DFT (the most computationally expensive) to direct ML estimation of maximum PV efficiency (the least expensive), and include intermediate strategies such as predicting optical absorption profiles or applying corrections to low-fidelity DFT calculations based on the generalized gradient approximation (GGA).24 The paper begins with a detailed outline of our evaluation methodology, covering both PV property estimation and carbon emission quantification. We then compare these approaches in terms of predictive efficacy and environmental cost. Our analysis allows us to propose optimal trade-offs, highlight important limitations, and suggest promising directions for future research aimed at improving the effectiveness and sustainability of computational PV screening. More broadly, our framework offers a template for evaluating computational discovery pipelines in which resource intensity is considered alongside predictive performance—a consideration we believe will be increasingly important across many areas of energy research.
Instead, the efficiency of potential PV absorber materials can be estimated using the spectroscopic limited maximum efficiency (SLME).27 The theory and practical details of calculating SLMEs are discussed in the following section. To distinguish between these methods (since both use detailed balance), we shall henceforth refer to methods using a step-function approximation of the absorptance spectrum as ‘step-function methods’ and those that use the calculated/predicted spectrum as ‘SLME methods’.
The SLME of a material requires an absorption profile α(E), usually in units of cm−1, and an ‘offset’ (our taxonomy):
| Δ = Edag − Eg, |
| A(E) = 1 − exp(−2d·α(E)). |
This is used to calculate the short-circuit current (density):
Detailed balance says that the rate of radiative emission must equal that of photon absorption from the surroundings, which can be quantified using the black-body spectrum at the temperature, T, of the solar device. This gives the reverse saturation current density (or recombination current density) as
The voltage-dependent total current density is then multiplied by the voltage to give the power:
The maximum value of this power, Pmax, will be found at some balance of V and J, giving the optimal efficiency as
| ΔE = EHSEg − EGGAg, |
| αHSE ≈ αGGA(E − ΔE), |
In Table 1 we provide a list of potential workflows where electronic structure calculations are replaced with ML surrogates, where DFT calculations are indicated by their functional class, GGA or HSE. We have also provided a number of workflows that include step-function approaches and purely GGA-level properties to provide context for the accuracies and costs of the ML-based approaches. Note that in method II the model has been trained on scissor-corrected spectra, so a subsequent calculated or predicted scissor correction is not necessary. Those properties that are calculated using straightforward and negligibly expensive operations, in our case executed in Python, are represented as ‘Py’ in the table. For instance, step-function approaches use Python to estimate an absorptance profile from the material's band gap.
| Method | Eg | Edag | α(E) | A(E) | ΔE | Δ | SLME |
|---|---|---|---|---|---|---|---|
| I | — | — | — | — | — | — | ML |
| II | — | — | ML | Py | — | ML | Py |
| III | GGA | GGA | GGA | Py | ML | Py | Py |
| IV | ML | — | — | Py | — | — | Py |
| V | HSE | — | — | Py | — | — | Py |
| VI | GGA | — | — | Py | ML | — | Py |
| VII | GGA | GGA | GGA | Py | — | Py | Py |
| VIII | HSE | GGA | GGA | Py | Py | Py | Py |
We use GGA-level band gaps for the offset calculation because this requires an optics calculation that inherently produces the data required for a GGA absorption spectrum, making it inefficient to predict one without the other. GGA offsets will introduce some error, though both gaps in the equation will be wrong by similar amounts, cancelling out some of this error. However, the test dataset used GGA-level offsets, so this source of error was not examined in this work.
Fig. 1 shows the relative cost of the calculations and predictions used in this work. Note that the area of a circle is proportional to the natural logarithm of its relative carbon cost, so the difference is even more stark than it appears. The negligible Python calculations are given as crosses to emphasise their low cost. ML inferences are also extremely inexpensive, though can be more meaningfully quantified as incurring a carbon cost around 1/2000× that of a static GGA calculation, which is itself around an order of magnitude cheaper than a similar HSE calculation. In terms of energy, this single ML inference used around 1.9 × 10−3 Wh (around 7 J), which CodeCarbon25 estimates as producing 4.5 × 10−4 g of CO2: equivalent to driving a typical diesel transit bus 0.3 mm.33
![]() | ||
Fig. 1 Plot of the methods for estimating SLMEs outlined in Table 1, with crosses representing Python calculations and circles representing more costly calculations, with area for cost C. The absorptance column from Table 1 has been excluded for brevity. | ||
Finally, optics and band structure calculations are more expensive than static calculations, making an accurate absorption spectrum predicting model all the more promising. The figure does in some ways under-represent the cost of machine learning approaches, since training (and hyperparameter tuning, though this was not performed in this work) is not included. Training model 1 on the 4.8k dataset for 300 epochs was equivalent to 1.7 × 105 times that of a single inference. This is more indicative of how small the inference costs are than how large the training costs are. Moreover, these are one-off costs that would become negligible when considering the application of these models on vast datasets, and would not be incurred by future users of these models.
Fig. 2(a) shows how the performance of the various ML models on a held-out test set evolves as the size of the training data increases. The dataset size is truncated at just under 5000: the number of materials in both the band gap and absorption spectra datasets (after a test/training split), since both are required to calculate a scissor-corrected SLME. From this plot, it is quite clear that all of the property models are still improving with more training data, and we have not reached data saturation. In the SI we show how the predicted absorption spectrum of GaAs (not in the training set) improves with more training data: in particular, point-to-point correlation is achieved at around 1k training data points, with the curve becoming smooth.
In the context of the final target (accurate SLMEs for PV screening), we show the effect of dataset size in Fig. 2(b). Here, the abscissa is the error in the final estimated SLME when a particular ML model is used in the workflow. The dotted red line shows a null hypothesis, where our “model” simply predicts the mean SLME of the training data. Clearly, with a few training data (≤100) all models exceed this baseline, even the model that predicts the high-dimensional absorption spectrum. The plot also demonstrates how with ∼100 data points all workflows incorporating ML perform favourably when compared to calculating an SLME from a low-cost, low-fidelity DFT optical absorption profile obtained from a GGA calculation (without a scissor correction).
Perhaps more important than the absolute values in Fig. 2(b) are the gradients. The gradients give us an indication of how the predictions may be improved with additional data collection. The gradient of the direct prediction of SLME (with no DFT intermediates) shows the steepest gradient and extrapolation at the current rate of model improvement suggests that, with several tens of thousands of high-quality estimates of SLME, a model with negligible errors is possible.
If the absorption spectrum is known but the offset is not, Fig. 2 suggests that the inclusion of an ML-predicted offset is worthwhile (rather than a semi-SLME approach with fr = 0), provided that the training dataset size exceeds ∼103.
Predicting the absorption spectra and calculating the absorptance from them gives errors very similar to predicting the absorptance directly. This is perhaps surprising as absorptance spectra are naturally scaled to be between 0 and 1, and are relatively featureless (all more or less sigmoid-shaped), whereas absorption values may be anywhere between 0 and 107, and the overall profile is generally more irregular. One possible explanation is that, when the absorptance is calculated from the predicted absorption spectrum, small discrepancies are smoothed out by the exponential function, thereby reducing the propagated error in the final SLME, whereas direct absorptance prediction has no such advantage. Given the better performance with the full training set, the methods that included spectral prediction predicted absorption rather than absorptance.
![]() | ||
| Fig. 3 Violin plot comparing the success of the methods outlined in Table 1 in recreating the test set's SLMEs, in terms of raw accuracy (LH axis) and ranking order when the materials are ranked by their SLME (RH axis). Note that the numerical difference is ηpred − ηtrue so a positive difference is an overestimate. | ||
Comparing the seven methods considered (method VIII is how the test set is calculated), we see some common trends. Scalar properties (SLME and scissor correction, methods I and III) are easier to predict than high-dimensional properties (the absorption spectrum as part of method II). Method II also suffers from the combination of errors, using predictions for the offset (by itself rather well predicted, see Fig. 2) and the absorption spectrum. This inaccuracy leads the step function-based approaches (methods IV–VI) to outperform method II. Otherwise, these approaches struggle compared to direct SLME prediction. Method V, wherein the band gap is calculated at the HSE level, does the best of these approaches, but the cost of this calculation is significantly higher than that of the ML inference in method I, as discussed in Section III D.
Finally, method VII, based on all GGA-level calculations (without any kind of scissor correction), is the poorest-performing approach. Interestingly, this approach gives the most clearly systematic error, with the vast majority being overestimates. GGA is known to underestimate band gaps due to the self-interaction error, so the absorption profiles will have an earlier offset, and thus we would expect larger short-circuit currents, but not necessarily larger efficiencies due to the voltage–current trade-off: smaller band gaps mean each excited carrier has less energy. We also see some systematic behaviour in method II, where SLME overestimates are limited to around 5 percentage points, while underestimates can be much more significant. The step function approaches also tend to overestimate SLMEs: this is likely because real absorptance spectra have more gradual onsets than step functions, especially for materials with indirect band gaps.
We can see from Fig. 3 that different ML interventions introduce errors with different degrees of systematicity. This is a reminder that training objectives and benchmarks commonly used to compare ML models are not always appropriate for a given task.39 More specifically related to ML for PV screening, this shows that trying to learn SLME directly is probably preferable to prediction of an absorption profile and using that to calculate the SLME. The direct SLME prediction is both more likely to improve with more data and gives more systematic errors. Any effort to generate more high-quality absorption profiles could be trivially translated to SLMEs; therefore, this is the most promising path for the screening of PV materials.
![]() | ||
| Fig. 4 Pareto front for performance vs. cost for the methods outlined in Table 1, where performance is measured as (a) MAE in SLME and (b) MAE in rank when the test set and the predictions are sorted by their SLMEs. | ||
Another consideration when comparing the accuracy of different machine learning approaches is interpretability: the direct SLME prediction is rather a black box, where predicting the absorption spectrum and offset gives us better insight into why a given material is a good absorber. It also allows us to calculate properties like the short-circuit current and photovoltage of a material, extending the applicability of this approach beyond traditional solar cells. Moreover, calculated SLMEs have the temperature, material thickness, and incoming radiation profile (typically the AM1.5G spectrum) implicit in their value, whereas predicting the spectra allows the user to alter these parameters for their application. This could be particularly useful when looking for materials for solar cells used on satellites or in indoor lighting. However, the distance of the Pareto front from the other 5 methods makes it hard to justify this approach. Method III is perhaps the best compromise between interpretability and accuracy.
There is also a large gulf in interpretability between all ML-based approaches and computational chemistry calculations. Even a static energy calculation provides a wealth of information compared to a single scalar from an ML model. This is an advantage of the computational methods that is difficult to quantify, but should be considered when deciding between methods. With this in mind, the scissor-correction approach, method III, is even more powerful, providing additional information (albeit at a GGA-level) compared to more ML-based approaches, while leveraging the low-cost, high-accuracy ML prediction of the scissor correction.
A final additional factor that could be considered is domain expertise. For instance, comparing V and VIII, we see that if a hybrid band gap is being calculated, it is only slightly more expensive to calculate the GGA absorption spectrum and offset to facilitate an SLME rather than just detailed balance calculation. However, it requires the user to have experience with optics calculations in, for instance, DFT. Packages like Atomate2,40 used for some example calculations in this report, make this very straightforward, while ML models like the atomistic line graph neural network (ALIGNN)41 used in this work are increasingly easy to use out-of-the-box.
Fig. 4(b) tells a similar story, although the difference in numerical accuracy and ranking accuracy is highlighted by method VI becoming part of the Pareto front. This seems to be a combination of method I being relatively poor at accurate ranking and method VI relatively good. However, VI is only narrowly better than I and is over 3 orders of magnitude more expensive, while III is much more accurate at less than 10× the expense, making VI difficult to justify in most instances.
![]() | ||
| Fig. 5 Violin plot comparing the successes of the SLME-predicting ML model (method I) and the Choudhary et al.42 TB-mBJ dataset in reproducing the SLMEs of an external test set: the Fabini Δ-sol set. | ||
Unsurprisingly, the model's performance on this external dataset is somewhat worse than on the internal test set sampled from the same DFT workflow as the training data. This degradation is expected, as discrepancies between the DFT methodologies used to generate training and test labels introduce additional sources of error, which compound with those from the ML model itself. Nevertheless, the model maintains a reasonable ability to rank materials by predicted SLME, as shown in the rank correlation plots (Fig. 4).
To contextualise these errors, we also compared SLME values for the same materials computed using two different DFT approaches: Δ-sol-corrected GGA (from Fabini et al.43) and the Tran–Blaha modified Becke–Johnson (TB-mBJ) potential45 (dataset from Choudhary et al.42). Interestingly, the absolute and ranking errors between these two DFT methods are comparable in magnitude to the errors observed between the ML predictions and the Δ-sol data. For example, the mean absolute error (MAE) in SLME values between TB-mBJ and Δ-sol is 7.2 percentage points, versus 6.8 percentage points for the ML predictions; similarly, ranking errors are also of similar scale.
These results highlight two important conclusions. First, they demonstrate that the predictive performance of ML models trained on high-fidelity data can approach the level of variability introduced by changes in the DFT methodology itself. Second, they emphasise that the generation of consistent, high-fidelity SLME datasets remains a major bottleneck in data-driven PV discovery. For SLME prediction tasks, our findings suggest that investing in better-quality training data may yield greater improvements than simply expanding the size of existing datasets. In contrast, for absorption spectrum prediction—where model errors remain large even on consistent data—improvements in model architecture and training volume may be the more effective path forward.
While these recent efforts show encouraging progress, there are still important limitations. For example, two recent studies55,56 have proposed neural network (NN) models for predicting absorption spectra, both demonstrating reasonable accuracy. However, these models were trained and tested on more constrained datasets than those used in this work, and their performance may degrade when applied to more chemically diverse materials such as those in the W-R dataset. Grunert et al.55 limited their materials to main-group elements from the first five rows of the periodic table, while Hung et al.56 allowed a broader range of elements but restricted their dataset to structures with nine atoms or fewer per unit cell. Such constraints significantly reduce the overlap with the datasets used here, particularly where both band gap and spectrum data are needed. When trained on the dataset used in this work, the GNNOpt model from Hung et al., based on the equivariant e3nn,46–50 predicts spectra that give better SLMEs than ALIGNN (see SI), but not enough to become a viable strategy, especially when the errors are confounded with those of the offsets. This suggests that developments in model architectures such as these will continue to drive improved predictions of PV-relevant properties.
Another key challenge lies in the availability of consistent, high-quality training data. Both of the recently proposed neural network models for spectral prediction were trained on data generated using generalized gradient approximation (GGA) functionals, which—as we have shown—can lead to suboptimal screening performance. The reliance on GGA is largely driven by its relative abundance compared to more accurate methods, such as hybrid-DFT. However, progress in data infrastructure and learning techniques offers promising ways forward. Initiatives such as the novel materials discovery (NOMAD) program57 and MPContribs (the platform for contributing to the Materials Project58) are enabling the sharing of curated, high-quality computational datasets in line with FAIR data principles.59
At the same time, recent advances in multi-fidelity machine learning60–62 allow models to be trained on datasets that combine varying levels of theoretical accuracy. By leveraging correlations between low- and high-fidelity data, these methods enable the use of larger training sets without sacrificing predictive reliability, thereby offering a practical route to more robust and generalizable ML models for materials discovery.
For traditional computational chemistry calculations, we note that plane-wave codes such as VASP63–65 are not the most efficient approach to hybrid DFT calculations. Atom-centred basis sets, such as those used in CRYSTAL66 and CP2K,67 are more efficient because many of the 4-electron Hartree–Fock integrals over real space decay rapidly, whereas the reciprocal space equivalents (used in plane-wave DFT) do not.68 VASP was used in this study due to its widespread use in materials science (including for the generation of large datasets) and its ease of use via workflow managers like Atomate240 – we aim to replicate the most common workflows rather than necessarily the most efficient. However, we advise future hybrid DFT-based studies on photovoltaic materials to consider the more efficient atom-centred methods.
Another efficiency improvement could come from intermediate methods between GGA and HSE, such as those considered in a recent review by Janesko:69 DFT+U, self-interaction corrections, localized orbital scaling corrections, local hybrid functionals, real-space nondynamical correlation, and their Rung-3.5 approach. Several of these approaches can approach hybrid accuracy at a fraction of the cost and are routinely used for systems where a full hybrid treatment would be prohibitively expensive.70,71 These methods have limitations of their own: DFT+U, for instance, requires optimisation on a case-by-case basis. The comparison between Δ-sol-corrected GGA and TB-mBJ in Section III E highlights the inconsistencies in these approaches.
Closely tied to the development of better data and models is the need for high-quality community benchmarks. As our results demonstrate, benchmarking efforts should not only assess predictive performance, but also account for the environmental cost of computation—such as carbon emissions—which can meaningfully influence the practicality of different approaches. Evaluation choices fundamentally shape not only our measurements but also research priorities and scientific progress. Ensuring transparency and reproducibility in benchmarking is therefore critical. Recent proposals, such as evaluation cards, offer a structured means of documenting the assumptions, metrics, and limitations that underpin model assessments.39,72 By adopting such practices in the context of materials discovery, the community can move toward more robust, equitable, and environmentally conscious progress in the development of machine learning for photovoltaics and beyond.
A final consideration for improvement is the SLME metric itself. The Blank selection metric73 has emerged as a more accurate computational characterisation of photovoltaic efficiency. However, it requires additional data such as the refractive index n(E), of which there are currently no large datasets. A more rigorous computational study of a candidate photovoltaic would go even further, considering factors such as defects, dopants, and stability under real operating conditions. However, as a heuristic for filtering large areas of chemical space for intrinsically good PV absorbers, the SLME should be sufficient, hence its use in this work. As we have emphasised with the ranking plots, exact numbers for efficiency are less important than identifying the best materials.
We have also identified clear pathways to improve ML surrogate models. Enhanced performance will likely require either substantially larger datasets of high-fidelity calculations than are presently available, or the implementation of transfer learning approaches that leverage extensive low-fidelity datasets alongside smaller, high-accuracy training sets.
More broadly, our study highlights the fundamental trade-off between computational cost and the efficacy of data-driven screening in materials design. We have outlined a blueprint for jointly evaluating the carbon cost and discovery performance of such campaigns. Embedding carbon cost reporting into computational discovery workflows is, we argue, a vital step toward ensuring that AI-, ML-, and simulation-driven approaches deliver truly beneficial and socially responsible innovation.
Z-Score normalisation was used to scale labels for a more stable gradient descent; spectral properties were normalised per bin. Each model was trained for 300 epochs with a batch size of 64 and the rest of the hyperparameters in line with the model's original paper for consistency across the various properties predicted. A batch size of 2 was used for the learning curves as this enabled each dataset size to be trained with the same batch size.
Datasets from Woods-Robinson et al.,32 Kim et al.,76 Fabini et al.,43 Yu and Zunger,27 and Choudhary et al.42 were used, all accessed from freely available sources. The main dataset (the ∼5.3k overlapping materials from W-R and Kim) was split into an 80
:
10
:
10 ratio of training
:
validation
:
test data; the test materials were kept the same for all models for fairer comparison.
Some examples of DFT calculations at GGA and HSE levels were performed using the projected augmented wave (PAW) method77,78 within the Vienna ab initio Simulation Package (VASP),63–65 with CodeCarbon25 monitoring the energy (and thus carbon) cost of each calculation. Atomate240 was used to generate the input files for these calculations, with structure files from the Materials Project,58 to simulate a high-throughput workflow rather than bespoke calculations for each material. The raw numbers for these costs are available in the SI. CodeCarbon was also used for some ML training and inferences. Matplotlib was used for plotting.
The data supporting this article have been included as part of the supplementary information (SI). The supplementary information contains additional data to support the arguments made in the paper. These are, parity plots for Model I predictions versus various out of distribution test sets; the effect of training set size on predicted spectrum smoothness; full information on calculations time and carbon cost of DFT calculations; a comparison of how even quite good predictions of spectra lead to large errors in derived SLMEs. See DOI: https://doi.org/10.1039/d5mh01404b.
| This journal is © The Royal Society of Chemistry 2026 |