Open Access Article
JuHyun Lee†
a,
HyoJin Ban†
ad,
HyunIl Seo
b,
HangKen Lee
c,
Fiza Arshad
ce and
DongWook Kim
*a
aDigital Chemistry Research Center, Korea Research Institute of Chemical Technology, South Korea. E-mail: dongwkim@krict.re.kr
bQuantumSoft Co., Inc., South Korea
cPhotoenergy Research Center, Korea Research Institute of Chemical Technology, South Korea
dDepartment of Computer Science and Engineering, Chungnam National University, Daejeon, South Korea
eAdvanced Materials and Chemical Engineering, University of Science and Technology (UST), Daejeon, South Korea
First published on 27th February 2026
The efficiency of organic photovoltaics was estimated using a machine learning (ML) approach. We used the organic photovoltaics database built in-house by the Korea Research Institute of Chemical Technology. The dataset comprises reliable and representative experimental results for 1010 ternary organic solar cells (D1
:
D2
:
A), obtained through repeated measurements. The data included 67 donors and 24 non-fullerene acceptors, device structures, donor/acceptor structures, donor-to-acceptor ratios, active-layer thicknesses, experimental conditions, and local symmetry. We fragmented the donors and acceptors using a self-developed method. A dataset was created by generating descriptors of the fragmented molecules and used to train various ML algorithms, including random forest, XGBoost, LightGBM, support vector regression, and multilayer perceptron. Model performance was evaluated using the coefficient of determination (R2). XGBoost showed the highest R2 of 0.849. The contributions of key features were interpreted using SHAP analysis. This paper presents an ML framework that combines molecular fragmentation and data-driven modeling.
Since the first demonstration of OPV devices in 1986, significant progress has been achieved through advances in active-layer materials and device architectures.1 In particular, the introduction of donor–acceptor (D–A) bulk heterojunction (BHJ) structures enabled efficient exciton dissociation and charge transport, leading to substantial improvements in PCE.
Early efforts to enhance OPV performance relied heavily on the synthesis of novel donor and acceptor materials, as well as device-level optimization. Parallel to experimental developments, computational chemistry approaches based on density functional theory (DFT)2,3 and semi-empirical methods (e.g., AM1,4 PM6,5 and PM7 (ref. 6)) were employed to screen candidate materials and understand structure–property relationships. However, such approaches become computationally prohibitive when applied to large molecular libraries, limiting their applicability to high-throughput OPV material discovery.7–10
To overcome these limitations, data-driven approaches and machine learning (ML) techniques have increasingly been adopted to predict OPV performance. Early studies employed quantitative structure–property relationship (QSPR)11,12 models, followed by more advanced ML frameworks using experimentally measured or DFT13-derived descriptors. While these approaches demonstrated encouraging predictive capabilities, they often suffered from limited dataset sizes, computational bias, or reliance on idealized molecular representations.
More recently, fragmentation-based molecular representations have been introduced to reduce molecular complexity and improve model scalability. Wu et al.14 employed literature-derived fragmentation schemes to construct fingerprint-based ML models, whereas Kim et al.15 utilized synthesis-oriented functional group indexing to achieve improved prediction accuracy. Despite these advances, most existing studies remain focused on relatively simple donor–acceptor systems and are primarily based on literature-reported data,13,16–26 which are often biased toward high-efficiency devices and do not adequately reflect the diverse experimental conditions encountered in practical laboratories.
Machine-learning studies on ternary OPV systems have also emerged, with particular emphasis on D
:
A1
:
A2,27 configurations composed of small-molecule donors and acceptors. While these studies achieved meaningful success, their molecular representations are less suitable for polymer-rich systems, where repeating units, structural heterogeneity, and multiple donor components introduce additional complexity that is difficult to capture using conventional descriptors.
In this work, we focus on ternary OPV systems of the D1
:
D2
:
A type, in which two polymeric donors are blended with a single acceptor. To the best of our knowledge, this study represents the first systematic ML investigation of ternary OPV devices involving multiple polymer donors. By leveraging a curated experimental database that includes both high- and low-efficiency devices, we mitigate literature bias and construct a more balanced representation of real experimental conditions.
To accurately describe such complex systems, we propose a fragment-based molecular representation based on chemically meaningful fragmentation. Fragment-level physicochemical descriptors are generated using RDKit and Stk, and local symmetry information is incorporated as a physically motivated proxy for molecular packing tendencies without explicitly modeling solid-state morphology. Redundant and irrelevant features are further removed through REF-based feature selection, enabling the construction of robust and transferable machine-learning models for ternary OPV systems.
Using this representation, we develop machine-learning models to predict power conversion efficiency and key device parameters. Interpretability analyses provide insights into structure–process–property relationships, highlighting the contributions of specific molecular fragments and device parameters to device performance (Fig. 1).
The dataset consists of two main categories: materials and devices. The materials category contains chemical and structural information on donor and acceptor compounds, while the device category records OPV device architectures and electrode-related experimental results. ChemAxon's Marvin JS28 and JChem Microservices were used to convert compounds into MolV2000 format for structural representation, and CDK 2.8 was used to convert them into SMILES.29
In the device section, the platform stores detailed information on all OPV layers, including active-layer materials, additives, solvents, and processing conditions, which can be batch uploaded using standardized templates. Multiple experimental files can be uploaded simultaneously, enabling automated performance calculations and visualization. This automated workflow supports efficient integration and management of large-scale experimental datasets, thereby facilitating the application of artificial intelligence (AI) and machine learning (ML) techniques.
The dataset focuses on ternary organic photovoltaic systems composed of two donors and one acceptor. To accurately describe active-layer compositions, both the Donor1
:
Donor2 ratio and the overall (Donor1 + Donor2)
:
Acceptor ratio are recorded, enabling precise and reproducible representation of multicomponent formulations (Fig. 2).
![]() | ||
Fig. 2 Schematic representation of donor–donor–acceptor (D1 : D2 : A) composition in a ternary organic photovoltaic system. | ||
From the 13
430 devices stored on the established platform, non-fullerene data, data without missing values, and representative data selected by the experimenter from repeated experiments were selected. A dataset of 67 donors and 24 acceptors, totaling 1010 D–A combinations (Table S1), was utilized for machine learning. The data included the OPV device architectures, donor and acceptor material structures, D
:
A ratios, device thicknesses, annealing conditions, and experimental PCE values (Fig. 3).
In this study, we fragmented large organic molecules into structural subunits and extracted the molecular descriptors for each fragment. Calculations were performed for each fragment to further understand the chemical properties of the polymer material. The device information, process conditions, and polymer ratio information (D
:
A) were added to the dataset. Furthermore, incorporating molecular symmetry, which has not been considered in previous studies, enables the extraction of structural characteristics that are not captured by conventional molecular descriptors. This approach enables the generation of diverse functions beyond device information and facilitates the processing of large molecules.
Because categorical features describing device structure and solvent type cannot be directly used in conventional machine learning models, they were converted into numerical values using label encoding. Integer labels were assigned as nominal identifiers without implying ordinal or physical meaning (Table S2). Missing values were imputed using the mode or median, depending on the feature.
To digitally represent the donor and acceptor molecules in OPVs effectively, we developed our own fragmentation protocol. The fragmentation scheme was designed to simplify the polymer structures while retaining their essential chemical features, thereby producing independent fragments that contribute meaningfully to the overall molecular properties (Fig. 4).
Method 1: the backbone and side chains were explicitly separated and molecular descriptors were calculated independently for each structural unit.
Method 2 (scaffold-based fragmentation): the ring units that constitute the backbone are defined as scaffolds, and each scaffold represents a ring system connected by a single bond. Typically, a donor consists of two to eight scaffolds. For consistency, we standardized this number to four scaffolds. Each scaffold was further divided into a backbone and side chains.
(1) Ring center (RC): the center of symmetry of the FR system.
(2) Ring linker (RL): the ring units adjacent to the RC.
(3) Ring π-bridge (RP): the π-bridge ring bonded to the outer ring; in some cases, the RP is absent.
(4) π-Bridge (PB): the π-conjugated structure connecting the FR system and the end group; in some cases, the PB is considered a part of the FR system.
(5) End group (EG): terminal substituents located at the outermost positions of the acceptor molecule.
Through this fragmentation process, the acceptor molecule structure was defined as RC–RL–RP–PB–EG, centered on the central fused-ring system.
Accordingly, descriptors for each fragment unit were extracted using the RDKit and Stk libraries. For both Method 1 and Method 2, 217 physicochemical descriptors were computed using RDKit, whereas seven additional energy-related properties were obtained from Stk. These descriptors describe a wide range of molecular characteristics, ranging from basic molecular properties to complex structural characteristics. The complete list and aggregation methods are provided in Table S3.
Some molecules did not contain certain fragments and were therefore zero-coded. To mitigate this gap effect, pairwise computations between descriptors were performed.
This allowed us to maintain the effectiveness of each fragment-specific descriptor while compensating for the gap effect in fragment-free regions, resulting in molecular descriptors that utilize partial structures. The detailed computational operations for each descriptor type are provided in SI Table S3.
In this paper, local symmetry is defined in a restricted sense as either two-fold rotational (C2) or mirror-plane (σv) symmetry, as illustrated in Fig. 5. For donor molecules, local symmetry describes the symmetry of side-chain groups with respect to the molecular backbone. Donors are categorized into four scaffolds, yielding four symmetry states per donor; when two donor components are present, this results in a total of eight donor symmetry states. For acceptor molecules, local symmetry characterizes the symmetry of the end groups with respect to the ring center, with an additional symmetry state assigned at the ring center, leading to twenty symmetry states in total (Table S4).
The symmetry descriptors employed in this study are molecular-level features derived from specific structural motifs, rather than explicit representations of solid-state packing. Although thin-film packing is governed by processing conditions and intermolecular interactions, these symmetry states capture structural characteristics that may indirectly relate to packing-relevant tendencies, while remaining general and scalable for machine-learning models.
The dataset comprised 1010 experimentally reported OPV devices and was split into training (80%) and test (20%) sets. To assess model robustness and minimize bias from a single data split, eight independent random seeds were used for repeated training and evaluation.
The initial feature space consisted of 1054 molecular descriptors, resulting in a high feature-to-sample ratio. To reduce overfitting and computational cost, recursive feature elimination (RFE) was applied, with feature selection guided by cross-validated R2. Feature removal was terminated once performance stabilized over consecutive iterations. Performance convergence was observed when approximately 90 descriptors remained (Fig. 6a). Within this region, the final feature set was selected by identifying the subset that achieved the highest cross-validated R2 while maintaining consistent descriptor composition across successive RFE steps. Based on this criterion, 70 descriptors were retained for the final models.
![]() | ||
| Fig. 6 (a) Recursive feature elimination (RFE); (b) train and test dataset distribution in OPV prediction. | ||
This reduction in feature dimensionality led to a substantial decrease in training time (from 101.7 s to 10.5 s) while maintaining comparable predictive performance (R2 of 0.848 and 0.839 before and after feature selection, respectively). The final descriptor sets for Methods 1 and 2 are summarized in Tables S5 and S6.
A Y-scrambling test was performed to assess dataset reliability. The original models achieved an average R2 of 0.81, whereas the scrambled models yielded R2 values close to zero (Fig. S8), indicating that the predictions arose from genuine structure–property relationships rather than spurious correlations. Kernel density estimation (KDE) analysis showed similar PCE distributions for the training (n = 808) and test (n = 202) sets, confirming statistically balanced data partitioning. The small positive density near PCE = 0 originates from the smoothing nature of KDE rather than the presence of zero-valued samples (Fig. 6b).
To assess multicollinearity among the selected descriptors, Pearson correlation analysis was performed on the final 70-feature set. The mean and median absolute correlation coefficients (|r|) were 0.344 and 0.295, respectively, indicating moderate overall correlations. Of the 2415 possible descriptor pairs, only 25 pairs (≈1.0%) exhibited very high correlation (|r| > 0.9), suggesting limited redundancy and minimal multicollinearity in the final feature set (Fig. 7).
| Random state | Method 1 | Method 2 |
|---|---|---|
| 0 | 0.879 | 0.841 |
| 42 | 0.854 | 0.885 |
| 150 | 0.782 | 0.842 |
| 500 | 0.892 | 0.841 |
| 790 | 0.847 | 0.758 |
| 1000 | 0.826 | 0.854 |
| 7500 | 0.881 | 0.830 |
10 000 |
0.829 | 0.820 |
| Average R2 | 0.849 ± 0.034 | 0.834 ± 0.034 |
The similar performance of the two strategies suggests that, for large donor–acceptor molecules where conventional RDKit descriptors are insufficient, fragmentation into smaller chemically meaningful units followed by recombination yields informative representations that enhance machine-learning training. Model robustness and generalization were further assessed using 5-fold cross-validation (Tables S8 and S9).
Table 4 summarizes the performance of the five machine-learning models using Method 1. XGBoost and RF achieved the highest average R2 values (0.849), followed by LGBM (0.812) and MLP (0.785), which exhibited moderate performance. SVR showed comparatively lower performance (0.718), consistent with its limited scalability in high-dimensional feature spaces.
| Random state | XGB | MLP | RF | SVR | LGBM |
|---|---|---|---|---|---|
| 0 | 0.879 | 0.760 | 0.877 | 0.713 | 0.831 |
| 42 | 0.854 | 0.811 | 0.869 | 0.725 | 0.828 |
| 150 | 0.782 | 0.770 | 0.796 | 0.657 | 0.800 |
| 500 | 0.892 | 0.845 | 0.881 | 0.818 | 0.841 |
| 790 | 0.847 | 0.735 | 0.838 | 0.657 | 0.812 |
| 1000 | 0.824 | 0.752 | 0.823 | 0.692 | 0.780 |
| 7500 | 0.881 | 0.866 | 0.874 | 0.813 | 0.853 |
10 000 |
0.829 | 0.738 | 0.829 | 0.650 | 0.748 |
| Average R2 | 0.849 ± 0.034 | 0.785 ± 0.047 | 0.849 ± 0.029 | 0.670 ± 0.054 | 0.812 ± 0.032 |
Repeated experiments across eight random states yielded low standard deviations (0.029–0.054), indicating stable predictions with minimal sensitivity to data partitioning and a low risk of overfitting. These results are visualized using violin plots in Fig. 8.
Using the experimental dataset, we further evaluated the model predictions for PCE and additionally explored the applicability of the model to Jsc and FF as auxiliary targets. The distribution of PCE prediction errors approximated a Gaussian profile, indicating minimal systematic bias (Fig. 9a). Scatter plots comparing experimental and predicted values further illustrate the predictive capability of the model, yielding R2 values of 0.69 for Jsc and 0.71 for FF (Fig. 9b–d). The corresponding prediction errors, quantified by MAE and RMSE, are summarized in Table S10 for the XGBoost model evaluated on the test set.
In contrast, predictions for the open-circuit voltage (Voc) showed relatively low accuracy (R2 = 0.2–0.4). This limitation32–34 is attributed to the descriptor set used in this study, which primarily captures molecular and electronic structure information but does not explicitly account for interfacial disorder, charge–transfer state energetics, or nonradiative recombination processes.
Finally, a comparison with two previous studies on fragmentation methods highlights the methodological novelty of our approach (Table 5). Rather than directly comparing identical datasets, this analysis focuses on differences in molecular representation and feature construction strategies. Kim et al.15 used predefined, synthesis-oriented molecular fragments that efficiently organize structural diversity but treat fragments largely independently, whereas Wu et al.14 employed a function-driven fragmentation scheme based on the electron push–pull principle, emphasizing electronic roles with limited structural uniformity across molecular classes.
In contrast, our study adopts a chemically meaningful fragmentation framework that incorporates explicit local symmetry information and integrates fragment-level features into a unified molecular representation. Combined with experimentally measured device parameters and polymer-specific chemical descriptors, this approach simultaneously captures electronic, structural, and processing effects, leading to improved predictive accuracy and enhanced generalizability. Our model was evaluated using a held-out test set from an in-house experimental database, while the results from Kim et al.15 and Wu et al.14 were taken from their respective publications, each based on independently constructed datasets and evaluation protocols.
To further assess the generalization ability of the model, 130 data35–64 points independently collected from the literature and not used during model training were extracted and evaluated using the same XGBoost model. Despite the inherent heterogeneity of literature data, including variations in material combinations and experimental conditions, the model achieved an R2 of 0.62, an MAE of 2.11, and an RMSE of 2.63, indicating reasonable predictive performance on unseen data (Fig. S9).
The SHAP analysis (Fig. 10) identifies key features governing PCE prediction, which are primarily associated with molecular stability, local electronic structure, and processing conditions, highlighting the multiscale nature of performance-determining factors in organic photovoltaic devices.
Among the top five features, RC_FpDensityMorgan1 shows the strongest impact, indicating that dense and electronically coherent core environments favor efficient charge transport and reduced energetic disorder. 1_fr_halogen further highlights the positive role of halogen substitution, consistent with enhanced intermolecular interactions and frontier orbital tuning. PB_GasteigerChargeMedian emphasizes the importance of balanced local charge distribution in promoting charge separation while suppressing recombination. The prominence of solution concentration (wt%) reflects its role as a key processing parameter that mediates thin-film formation and structure–property coupling, with its influence becoming significant in specific molecular contexts. RC_VSA_EState6 indicates that exposed electronic environments at the molecular core critically affect interfacial charge–transfer processes.
Beyond intrinsic molecular descriptors, several processing parameters including concentration, thermal annealing temperature, solvent type, device structure, and active-layer thickness also appear among the influential features. Although their average SHAP magnitudes are smaller, these parameters can become decisive depending on molecular structure, suggesting that they act as conditional modifiers of morphology, crystallinity, and charge–transport pathways. Notably, symmetry-related descriptors also contribute non-negligibly, implying that fragment-level local symmetry influences molecular packing regularity and orientational degeneracy.
Consistent with the SHAP results, the feature-importance rankings obtained from the XGBoost, RF, and LGBM models also highlight RC_FpDensityMorgan1 and 1_fr_halogen as major contributors (Fig. S10 and S11), supporting the reliability of the SHAP-based interpretation across different learning algorithms.
Overall, SHAP analysis demonstrates that PCE is governed by a synergistic interplay between molecular electronic structure, topology, symmetry, and processing conditions. The fragment-based, symmetry-aware descriptor framework effectively captures these coupled effects, providing both high predictive accuracy and physically interpretable structure–process property relationships.
Among five machine-learning models evaluated, XGBoost showed the best performance with an R2 of 0.849. Robust validation using five-fold cross-validation and multiple random states confirmed the reliability of the models. SHAP analysis identified local molecular symmetry as a key factor influencing PCE, highlighting its important role in OPV performance.
Future work will extend this framework by refining symmetry definitions and integrating external literature data to explore new high-efficiency donor–acceptor combinations.
The computational codes developed and used for this study are openly accessible via GitHub at [https://github.com/juhyun7749/OPV-ML-2025].
To ensure long-term preservation and reproducibility, a fixed version of the repository corresponding to the manuscript has been archived in Zenodo (DOI: https://doi.org/10.5281/zenodo.18688051).
All relevant datasets required to reproduce the analyses and figures presented in this article are included in the Zenodo archive.
The archived materials are provided under an open license to facilitate reuse and future research based on this work. The calculated molecular descriptors for the OPV donor and acceptor materials used in this study are openly available.
Supplementary information (SI): additional figures, model details, descriptor information, and dataset descriptions used in this study. See DOI: https://doi.org/10.1039/d5dd00496a.
Footnote |
| † These authors contributed equally to this work. |
| This journal is © The Royal Society of Chemistry 2026 |