Open Access Article
Yizhe Chen†
a,
Shomik Verma†b,
Kevin P. Greenman†
cde,
Haoyu Yina,
Zhihao Wang
a,
Lanjing Wangf,
Jiali Li*g,
Rafael Gómez-Bombarelli
*h,
Aron Walsh
*i and
Xiaonan Wang
*a
aDepartment of Chemical Engineering, State Key Laboratory of Chemical Engineering and Low-carbon Technology, Tsinghua University, Beijing 100084, China. E-mail: wangxiaonan@tsinghua.edu.cn
bDepartment of Mechanical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA. E-mail: skverma@mit.edu
cDepartment of Chemical Engineering, Catholic Institute of Technology, Cambridge, Massachusetts, USA. E-mail: kgreenman@catholic.tech
dDepartment of Chemistry, Catholic Institute of Technology, Cambridge, Massachusetts, USA
eDepartment of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA
fInstitute of Flexible Electronics (IFE) & Frontiers Science Center for Flexible Electronics, Northwestern Polytechnical University, Xi'an, Shaanxi 710072, China. E-mail: wanglanjing@mail.nwpu.edu.cn
gResearch Center for Eco-Environmental Sciences, Chinese Academy of Sciences, Beijing 100085, China. E-mail: jlli@rcees.ac.cn
hDepartment of Materials Science and Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA. E-mail: rafagb@mit.edu
iDepartment of Materials, Imperial College London, London SW7 2AZ, UK. E-mail: a.walsh@imperial.ac.uk
First published on 10th November 2025
The design of high-performance photosensitizers for next-generation photovoltaic and clean energy applications remains a formidable challenge due to the vast chemical space, competing photophysical trade-offs, and computational limitations of traditional quantum chemistry methods. While machine learning offers potential solutions, existing approaches suffer from data scarcity and inefficient exploration of molecular configurations. This work introduces a unified active learning framework that systematically integrates semi-empirical quantum calculations with adaptive molecular screening strategies to accelerate photosensitizer discovery. Our methodology combines three principal components: (1) A hybrid quantum mechanics/machine learning pipeline generating a chemically diverse molecular dataset while maintaining quantum chemical accuracy at significantly reduced computational costs; (2) a graph neural network architecture and uncertainty quantification; (3) Novel acquisition strategies that dynamically balance broad chemical space exploration with targeted optimization of photophysical objectives. The framework demonstrates superior performance in predicting critical energy levels (T1/S1) compared to conventional screening approaches, while effectively prioritizing synthetically feasible candidates. By open-sourcing both the curated molecular dataset and implementation tools, this work establishes an extensible platform for data-driven discovery of optoelectronic materials, with immediate applications in solar energy conversion and beyond.
First, previous work has shown that combinatorial D–A assembly yields libraries containing more than one million PS candidates, far exceeding the capacity of conventional trial-and-error approaches.7–9 For example, subtle structural variations in porphyrin derivatives—such as peripheral substituent patterns—can shift absorption maxima by over 50 nm while altering quantum yields by orders of magnitude.7,10,11 Second, the intricate balance between competing photophysical properties creates a complex optimization landscape. A PS optimized for strong visible light absorption may suffer from rapid triplet–triplet annihilation, while molecules with ideal S1/T1 energy ratios often exhibit poor solubility or photostability.12,13 Third, computational screening methods such as time-dependent density-functional theory (TD-DFT), though theoretically rigorous, become prohibitively expensive for geometry optimization and large-scale exploration, requiring days of computation for a single medium-sized molecule (50+ atoms).14,15
Recent advances in machine learning (ML) offer promising solutions to these bottlenecks.16,17 By establishing quantitative structure–property relationships (QSPRs), ML models can predict key PS characteristics like singlet-triplet energy gaps (ΔEST) with millisecond inference times.10 Nevertheless, existing ML approaches face two critical limitations:1 Public datasets contain less than 0.1% of the required photophysical data for PS design, creating severe data scarcity issues;2 conventional ML workflows prioritize passive learning from static datasets, inefficiently allocating computational resources to chemically redundant regions.12–14
Active learning (AL) is a machine learning approach where the model selects the most informative data points for labeling, aiming to improve performance with fewer labeled examples.18 It addresses these limitations through iterative cycles of prediction and targeted data acquisition.15,19–21 Unlike traditional methods that treat all molecules equally, AL algorithms dynamically identify the most informative candidates for quantum chemical calculations—those with high prediction uncertainty or high potential to improve model performance.22–24 Recent demonstrations in catalyst discovery achieved 32× acceleration over random screening by prioritizing metal alloys with optimal d-band centers15,25 For PS design specifically, AL's ability to navigate high-dimensional chemical spaces while respecting synthetic constraints could revolutionize molecular discovery pipelines.26–30
Despite these promising developments, existing AL applications in molecular and materials discovery remain limited in generality and methodological scope. Yoo et al. developed an iterative inverse-design workflow combining density-functional tight binding (DFTB) for property labeling, a graph convolutional neural network (GCNN) surrogate, and a masked-language-model (MLM) generator to optimize the HOMO–LUMO gap (HLG) of molecules.31 Their selection strategy primarily relies on property-value thresholds and periodic surrogate retraining, rather than a formal uncertainty-based acquisition loop. In addition, the target property (HLG) is not directly related to photosensitizer photophysics, and the implementation is available only upon request, which constrains reproducibility and extension.
By contrast, Dodds et al. integrated active learning with reinforcement learning (RL–AL) to enhance sample efficiency under multi-objective optimization oracles such as docking and ROCS, and released their implementation openly.32 While this framework effectively explores exploration–exploitation trade-offs, it is tailored to generic molecular design tasks and lacks photophysical objectives or physics-informed acquisition strategies specific to photosensitizer discovery.
Building upon these directions, our study establishes a unified active-learning (AL) framework that extends beyond task-specific implementations and explicitly targets photosensitizer discovery. The framework integrates three tightly coupled components: (i) a large-scale ML–xTB-calibrated dataset of 655
197 photosensitizer candidates labeled for T1/S1, achieving sub-0.08 eV mean absolute error (MAE) and reducing computational cost by 99% compared with TD-DFT;33 (ii) a hybrid acquisition strategy that combines ensemble-based uncertainty estimation with a physics-informed objective function and an early-cycle diversity schedule, enabling balanced exploration and exploitation during iterative sampling; and (iii) a fully open-source, reproducible implementation that facilitates transparent benchmarking and community reuse.
Experimental benchmarks demonstrate that the proposed sequential AL strategy, which first explores chemical diversity before focusing on target regions, consistently outperforms static baselines by 15–20% in test-set MAE. By integrating data calibration, uncertainty-driven acquisition, and open implementation in a single workflow, this framework overcomes the methodological limitations of previous AL systems and provides a generalizable, data-efficient paradigm for photosensitizer discovery, offering immediate applications in solar fuels, photocatalysis, and optoelectronic materials design.
We combined Simplified Molecular-Input Line-Entry System (SMILES) data from numerous public molecular datasets to construct a unified library of 655
197 candidate photosensitizer molecules. Each source dataset was chosen because it contributes molecules with relevant excited-state or optical properties, thereby ensuring that our merged collection covers a broad range of photophysical characteristics. Starting from an initial seed set of 50
000 molecules, we expanded the library by integrating many diverse data sources (computational, experimental, and even patent-derived). We then predicted the lowest singlet and triplet energies (S1 and T1) for all candidates using our ML-xTB workflow, achieving DFT-level accuracy at 1% of the typical cost. We performed the analysis of the dataset in Fig. 3. (Full details of the datasets, including their names, references, and selection criteria, are provided in the SI (Text S5).)
1. Initial seed generation: A diverse set of 50
000 molecules was curated from public databases (PubChemQC,36 QMspin37) and expert-designed scaffolds (porphyrins, phthalocyanines). SMILES strings were standardized using RDKit, with stereochemistry and tautomer states normalized via Morgan fingerprint clustering (radius = 2, 1024 bits).
2. xTB-sTDA high-throughput calculations: Each molecule underwent geometry optimization and excited-state calculation using the geometry, frequency, noncovalent–tight binding (GFN2-xTB)38 method combined with the simplified Tamm–Dancoff approximation (sTDA)39 implemented in xtb (GFN2-xTB/xtb-sTDA):
| S1 = Esinglet − Eground | (1) |
| T1 = Etriplet − Eground | (2) |
| ΔEST = S1 − T1 | (3) |
For the initial seed set (50
000 molecules), additional TD-DFT calculations (B3LYP/6-31+G(d), Gaussian 16) were performed on the xTB-optimized geometries to provide accurate reference values (details provided in SI).
3. Machine learning calibration: A 10-model ensemble of Chemprop Message Passing Neural Networks (Chemprop-MPNN) was trained to correct systematic errors between the 50
000 xTB-sTDA and TD-DFT calculations for the S1 and T1 excitations separately. Each network dynamically generated molecular representations from SMILES strings without static fingerprints, predicting state-specific errors:
![]() | (4) |
![]() | (5) |
The multitask loss function minimized during training was:
![]() | (6) |
The calibrated energies were then computed as:
![]() | (7) |
![]() | (8) |
197 molecules). We performed only xTB-sTDA calculations for the remaining molecules beyond the seed set and then applied the ML correction model. This calibration approach reduced the mean absolute error (MAE) from 0.23 eV (raw xTB) to 0.08 eV (ML-corrected) with respect to TD-DFT for the 50
000 molecules in the calibration set.
000 additional molecules were sampled from the remaining pool, with a total of 8 rounds conducted for each acquisition strategy. This protocol enables model development and tuning on one portion of the data, while reserving a representative set of promising candidates for unbiased final evaluation (Fig. 4).
This choice was driven by two key advantages: (1) Its explicit modeling of bond directionality (e.g., single, double, conjugated) reduces noise from undirected representations, which is critical for capturing electronic transitions in photoactive molecules; and (2) its native ensemble support enables simultaneous quantification of model and data uncertainties, making it highly suitable for active learning strategies.
![]() | (9) |
![]() | (10) |
During the active learning process, as we incrementally increased the training set size from 5k to 165k samples, we conducted independent Bayesian hyperparameter optimizations at each data scale to determine the optimal model architectures. The hyperparameter search employed Root Mean Square Error (RMSE), the default evaluation metric in Chemprop, as the criterion for selecting the best configuration.
Our analysis showed that the optimal architecture did not converge to a single universal configuration; instead, the best-performing hyperparameters varied distinctly with the dataset size, including model depths (ranging from 3 to 6), hidden layer widths (hidden_size ranging from 1600 to 1900), and dropout rates (0.05–0.25). In medium-to-large data regimes (65–165k), shallow architectures (depth = 3), wider hidden layers (hidden_size = 1900), and lower dropout rates (0.05) consistently outperformed other configurations. Adopting these individually optimized hyperparameters at each training set size reduced the test RMSE by approximately 8–15% compared to an untuned baseline model (e.g., depth = 3, hidden_size = 1200, dropout = 0.10). This stage-specific optimization strategy ensured optimal architecture selection at each phase of the active learning cycle, thereby eliminating potential bias from a fixed architecture and enabling fair comparisons across different acquisition strategies.
| Ascore(x) = σS12(x)+σT12(x), | (11) |
![]() | (12) |
discards a candidate if its radius-2, 2048 bit Morgan fingerprint yields a Tanimoto similarity greater than 0.6 to any member of B, thereby guaranteeing sufficient structural diversity while still prioritizing high-uncertainty points.The Tanimoto similarity between two molecules x and x′, based on their binary fingerprint vectors f(x) and f(x′), is defined as:
![]() | (13) |
are the numbers of ‘on’ bits in each fingerprint, and
is the number of bits shared by both fingerprints. Here, f(x) denotes the Morgan fingerprint of molecule x; the Tanimoto similarity measures the overlap between the binary features of two molecules.
v1 — Probability-within-band kernel
![]() | (14) |
v2 — Exponential alignment kernel
![]() | (15) |
v3 — Expected-improvement kernel. An EI-style variant that accounts for correlation between T1 and S1 is derived in Text S3; its numerical results are reported in Fig. S3 and therefore omitted here for brevity.
Detailed derivations, hyperparameter sweeps, and a head-to-head comparison of v1–v3 are provided in SI (Text S1–S3).
• µ(x), σ2(x) — mean and variance of the ensemble prediction for m(x);
• τe = 0.5, τs = 1.0 — emitter-like and sensitizer-like targets;44,45
• we, ws — weights for the two targets (0.5, 0.5);
• ɛ — half-width of the tolerance band (0.05);
• η — selectivity parameter in v2 (5).
2.7 × 105 emitter, sensitizer and analogue structures were subjected to AiZynthFinder. A molecule was labeled synthesized (y = 1) if the planner discovered at least one complete route to commercially available precursors within a user-defined search limit (maximum nstep retrosynthetic steps or tmax seconds). Molecules for which no route was found under these restrictions were labeled non-synthesizable (y = 0). The resulting binary data set was used to train a Chemprop message-passing neural network that outputs a continuous probability Psynth(x) ∈ [0, 1]. We refer to this domain-adapted score as the PhotoSynthScore.
During each AL iteration, we down-weight candidates that are unlikely to be synthesizable:
![]() | (16) |
is the indicator function. Only molecules with Psynth ≥ 0.6 pass the filter, focusing computational effort on candidates that are both high-performing and practically attainable. Empirically, incorporating the PhotoSynthScore improves search efficiency and maintains chemical realism throughout the discovery campaign.
Uncertainty sampling: Reduced MAE from 0.091 eV to 0.077 eV within 4 rounds, with performance plateauing at Round 8. Deepening hidden layers further improved accuracy across all test sets, particularly for large molecules (>20 non-H atoms).
Diversity-enhanced sampling: Introducing Tanimoto similarity thresholds (<0.6) increased heterocyclic system coverage by 25%, but aggressive thresholds (<0.4) raised RMSE by 15% due to oversampling of chemically irrelevant regions.
Target-property optimization: Focused screening of T1/S1 ratios (1 for sensitizers, 0.5 for emitters) reduced emitter MAE from 0.065 eV → 0.061 eV versus uncertainty sampling.
Synthetic feasibility integration: Threshold filtering (Psynth ≤ 0.6) eliminated over 20% of all candidates, as these were judged to be impractical for synthesis.
Most predictions achieve high accuracy within acceptable chemical precision thresholds. However, a small fraction of compounds exhibit notably larger errors, likely arising from unique or rare structural motifs not adequately represented in the training set. This highlights areas for future model refinement, particularly addressing the “long-tail” of challenging cases.
Uncertainty sampling prioritizes molecules that the model finds the most ambiguous, thus rapidly improving the model precision by identifying informative samples. Diversity sampling aims to maximize coverage of chemical space, ensuring structural heterogeneity and preventing redundancy. Target-property optimization selects molecules with extreme property values, facilitating efficient exploration of high-performance regions, but potentially neglecting broader chemical diversity. Lastly, synthesizability-guided sampling incorporates synthetic feasibility, prioritizing practical compounds that are experimentally accessible.
These differences directly affect model training dynamics and generalization performance. Uncertainty sampling yields rapid initial reductions in prediction errors, demonstrating superior efficiency. Diversity sampling ensures broader chemical generalization but shows moderate accuracy improvements. Target-property sampling initially contributes less to global error reductions but significantly enhances predictions for extreme property cases. Synthesizability-guided strategies moderately slow accuracy gains due to conservative selections but ensure higher practical relevance.
We further summarized efficiency and error metrics for each strategy in Table 1, providing comprehensive comparisons on test MAE, and chemical space coverage. This analysis facilitates the selection of strategies that align with research priorities, balancing prediction accuracy, scope of the study, and practical applicability.
| Strategy | Initial MAE (eV) | Final MAE (eV) | Rounds to convergence | Chemical space coverage |
|---|---|---|---|---|
| Pure random (baseline) | 0.149 | 0.083 ± 0.001 | — | — |
| Pure uncertainty | 0.149 | 0.077 | 8 | Moderate |
| Uncertainty + batch diversity | 0.149 | 0.078 | 9 | High |
| Target property + batch diversity | 0.149 | 0.062 | 9 | Moderate (target-focused) |
| Portfolio strategy | 0.149 | 0.059 | 9 | High (balanced) |
| Portfolio + synthesizability | 0.149 | 0.074 | 9 | High (practically feasible) |
Specifically, the filtered portfolio strategy shows a slight uptick or oscillation in MAE during the final rounds. After the switch to target-driven selection, the model begins choosing molecules that it predicts have extreme target values—some of these are likely unusual, highly complex structures that the model is less familiar with (and which might be chemically difficult to synthesize in reality). Incorporating such exotic compounds can momentarily worsen the model's overall performance: the newly added data might lie outside the domain where the model has predictive strength, leading to higher prediction errors for those points (and possibly disrupting the model's previously learned correlations). In other words, without the synthesizability filter, the model sometimes ventures into out-of-distribution regions in pursuit of high target property, and as a result the global MAE stops decreasing and even increases slightly in that phase. This phenomenon can be interpreted as a form of model extrapolation or overfitting issue—the model is essentially overextending into chemical space where its predictions are not reliable, analogous to how an overly aggressive exploitative strategy can mislead the model.
One notable advantage of the hybrid strategy is that it avoids an abrupt transition that might destabilize the model. In the sequential strategies, we sometimes see a kink or change in the MAE trend at the point of switching from uncertainty to target mode—if the switch happens too early, the model could struggle (as discussed, a slight MAE rise can occur if the model isn't ready to accurately handle the exploitation picks). The continuous strategy softens this by always maintaining a mix; effectively, it performs a dynamic rebalancing: as the model's confidence grows, more of the selected batch inherently contributes to exploitation (since fewer points will be at high uncertainty, the focus naturally shifts to high predicted property, without ever entirely ignoring uncertainty). This dynamic adjustability is a strong point of the hybrid method. The only slight drawback observed is that the hybrid strategy can be a bit slower to find the very top-performing molecules compared to a full-on exploitation in later rounds. Since it's never selecting only target-optimal candidates, it might miss a few opportunities to immediately test the absolute top predicted molecule in favor of an uncertain one. However, in practice this seems to be a minor penalty—the hybrid still discovers high-performing molecules throughout, just interspersed with exploratory picks. In exchange, it maintains the lowest MAE curve among the methods, indicating it never compromises the model's learning too much in pursuit of the objective.
• We developed a hybrid pipeline that combines fast semi-empirical quantum calculations with machine learning correction, enabling us to generate a large dataset of 655
197 diverse candidate molecules. This approach keeps the speed of computation high while ensuring the accuracy of quantum chemistry.
• We designed a sequential active learning strategy that separates the exploration and exploitation phases. First, the model explores chemical space broadly using uncertainty-driven sampling to find new promising areas, and then focuses on finding molecules with the desired T1/S1 energy ratios (1 for sensitizers, 0.5 for emitters).
Our open dataset and modular workflow offer a useful resource for the research community to speed up the development of new energy materials. Through detailed analysis, we show that an active learning approach which combines uncertainty sampling, target-driven optimization, diversity filters, and synthesizability constraints is most effective. This balanced strategy helps us find high-performing molecules faster, while also making sure the predictive model works well on different types of molecules. Adding diversity filters prevents the model from overfitting to very similar compounds, and considering synthesizability ensures the candidates found are likely to be made in real experiments.
While our model works well for small and moderate-sized molecules,48 it is less accurate for larger molecules with highly delocalized electronic structures or strong long-range interactions. This is mainly because the current graph neural networks are better at capturing local structure, and we only use 2D information rather than 3D shapes. For molecules whose properties depend on their 3D arrangement, this can limit our model's performance.
To overcome these issues, future work could explore hybrid architectures that insert multi-head transformer self-attention directly into the message-passing layers of a GNN, going beyond the readout functions currently used in Chemprop, thus capturing both local chemical environments and long-range interatomic dependencies. Adding 3D geometric features, such as distances and angles between atoms, will also help the model better predict properties that depend on molecular shape.49
We also see value in combining multiple types of molecular information, such as electron density and orbital properties, and predicting a range of photophysical properties at once. This multi-task and multimodal approach can make the model stronger and easier to interpret.
Additionally, looking forward, computational methods for molecular discovery have evolved significantly, moving from simple virtual screenings to fully automated experimental platforms. Initially, high-throughput virtual screening (HTVS) allowed researchers to computationally evaluate thousands of candidate molecules from static libraries to quickly identify promising leads.9 Next, active learning methods greatly improved efficiency by selectively and iteratively choosing the most informative candidates, thereby reducing computational cost and speeding up discovery. More recently, a new generation of closed-loop discovery systems has emerged, combining active learning with fully automated laboratories that autonomously synthesize, test, and validate molecular candidates in continuous cycles.50 By incorporating these successive advancements into photosensitizer discovery, future workflows could efficiently and automatically discover effective molecules for practical applications in energy and medicine.
In parallel, a promising direction for future work is to extend our framework from selecting candidates in fixed libraries to generating new molecular structures from scratch. This can be achieved by combining the trained surrogate model with generative techniques such as variational autoencoders, graph-based molecular generators, language-model-driven design, or fragment-based human-in-the-loop strategies. These methods would allow the model to propose novel compounds tailored to target singlet–triplet energy ratios and other design constraints, while exploring chemical space beyond the initial dataset.
For active learning, specifically, adaptive strategies, those that dynamically adjust sampling rules throughout different learning stages, can further improve efficiency and generalization. By smoothly moving between uncertainty, diversity, and target-based sampling, the model can learn more with fewer experiments.
In summary, by applying smart active learning, we were able to further increase the accuracy and usefulness of our method. These steps helped bridge the gap between fast ML predictions and high-accuracy quantum methods, making our framework a reliable tool for discovering new photoactive molecules. Our work highlights the importance of balancing exploration and exploitation, including chemical diversity and synthesizability, and adjusting learning strategies as the model grows, laying a practical foundation for future AI-driven molecular discovery.
197 photosensitizer candidates and the complete active-learning code are available at https://github.com/jiali1025/A_General_Active_learning_framework_for_MoleDesign. The dataset is released under a CC-BY-4.0 licence and the code under an MIT licence. No further restrictions apply.
Supplementary information (SI): additional methodological details, acquisition-function formulations, computational protocols, and a description of the unified dataset supporting this work. See DOI: https://doi.org/10.1039/d5sc05749c.
Footnote |
| † These authors contributed equally to this work. |
| This journal is © The Royal Society of Chemistry 2025 |