Martin
Seifrid‡
*ab,
Stanley
Lo‡
b,
Dylan G.
Choi
c,
Gary
Tom
bde,
My Linh
Le
f,
Kunyu
Li§
c,
Rahul
Sankar§
c,
Hoai-Thanh
Vuong§
c,
Hiba
Wakidi§
c,
Ahra
Yi§
c,
Ziyue
Zhu§
c,
Nora
Schopp
c,
Aaron
Peng
c,
Benjamin R.
Luginbuhl
c,
Thuc-Quyen
Nguyen
*c and
Alán
Aspuru-Guzik
*bdeghij
aDepartment of Materials Science and Engineering, North Carolina State University, Raleigh, North Carolina 27695, USA. E-mail: m_seifrid@ncsu.edu
bDepartment of Chemistry, Chemical Physics Theory Group, University of Toronto, 80 St. George St., Ontario M5S 3H6, Canada. E-mail: aspuru@utoronto.ca
cDepartment of Chemistry and Biochemistry, Center for Polymers and Organic Solids, University of California Santa Barbara, Santa Barbara, California 93106, USA. E-mail: quyen@chem.ucsb.edu
dDepartment of Computer Science, University of Toronto, 40 St George St, Toronto, ON M5S 2E4, Canada
eVector Institute for Artificial Intelligence, 661 University Ave. Suite 710, Toronto, Ontario M5G 1M1, Canada
fMaterials Department, University of California Santa Barbara, Santa Barbara, California 93106, USA
gDepartment of Chemical Engineering & Applied Chemistry, University of Toronto, 200 College St., Ontario M5S 3E5, Canada
hDepartment of Materials Science & Engineering, University of Toronto, 184 College St., Ontario M5S 3E4, Canada
iLebovic Fellow, Canadian Institute for Advanced Research (CIFAR), 661 University Ave., Toronto, Ontario M5G 1M1, Canada
jAcceleration Consortium, University of Toronto, 80 St. George St, Toronto, ON M5S 3H6, Canada
First published on 20th May 2024
Our study explores the current state of machine learning (ML) as applied to predicting and designing organic photovoltaic (OPV) devices. We outline key considerations for selecting the method of encoding a molecular structure and selecting the algorithm while also emphasizing important aspects of training and rigorously evaluating ML models. This work presents the first comprehensive dataset of OPV device fabrication data mined from the literature. The top models achieve state-of-the-art predictive performance. In particular, we identify an algorithm that is used less frequently, but may be particularly well suited to similar datasets. However, predictive performance remains modest (R2 ≅ 0.6) overall. An in-depth analysis of the dataset attributes this limitation to challenges relating to the size of the dataset, as well as data quality and sparsity. These aspects are directly tied to difficulties imposed by current reporting and publication practices. Advocating for standardized reporting of OPV device fabrication data reporting in publications emerges as crucial to streamline literature mining and foster ML adoption. This comprehensive investigation emphasizes the critical role of both data quantity and quality, and highlights the need for collective efforts to unlock ML's potential to drive advancements in OPV.
Given the many complex relationships and processes that ultimately determine OPV device performance, the potential of machine learning (ML) to accelerate the development of OPV materials and devices has emerged as a tantalizing promise. ML techniques could allow scientists to quickly screen potential combinations of donor (D) and acceptor (A) materials, or suggest the most appropriate device fabrication parameters for a given combination of materials. Previous efforts in using ML to predict the PCE of OPV devices have been focused exclusively on the donor and acceptor materials. Notably, the Harvard Clean Energy Project10–12 and others13–19 have used DFT-computed molecular descriptors to predict PCEs. A similar approach has recently incorporated genetic algorithms in an effort to design high-performing non-fullerene acceptor-based (NFA) devices.20–22 Others have generated one-hot encodings based on human intuition regarding molecular substructures.23 However, the most accurate models to date are those that have used high-throughput domain-specific descriptors.17,24–30 Notably, the only device fabrication parameter to have been incorporated so far is the donor:
acceptor (D
:
A) ratio.17 Others, which – as discussed above – are known to play an important role in determining PCE, have not been explored yet.
In this work, we critically assess the state of ML for predicting PCE of OPV devices. We have curated a dataset of molecular structures, experimental device fabrication parameters and device performance, and evaluate the effect of various structural representations, ML algorithms, and device fabrication parameters on model performance. We also discuss difficulties related to gathering the dataset, as well as the considerations related to using experimental data from the literature, which is applicable to many materials domains outside of OPV.
To compile our dataset, we began with the dataset of 565 devices in Wu et al.,23 which only contained donor and acceptor names, short circuit current density (JSC), open circuit voltage (VOC), fill factor (FF), and PCE (Fig. 1). We first evaluated the quality of the data by comparing reported PCE values with values calculated from the product of JSC, VOC, and FF:
PCEcalculated = JSC × VOC × FF | (1) |
|PCEreported − PCEcalculated| > 0.01 × PCEcalculated | (2) |
The threshold of 0.01 was chosen arbitrarily. It reflects that if the reported and calculated PCE values were within 1%, we are relatively certain that any inconsistencies were only due to minor rounding errors. This allowed us to quickly identify data points with incorrect or nonsensical entries. To gather the fabrication data for the 565 devices, we reviewed the 277 original reports from which the dataset was initially constructed and extracted information from the 565 individual devices. To accurately extract the desired parameters, each paper was checked by multiple scientists. We found a number of duplicates, unreliable reports or incorrect references in the original dataset, which were eliminated, leading to the 558 devices in the final dataset.
We associated every donor and acceptor name with a molecular structure in the form of a SMILES (Simplified Molecular Input Line Entry System) string. This proved particularly challenging because of the diversity of polymer and NFA names, as well as frequent overlaps between material names. Due to the lack of a standardized naming convention for conjugated molecules and polymers, there are cases where numerous entries were ambiguously labeled as “P1”, despite having markedly distinct structures. On the other hand, research groups will sometimes use different names for the same molecular structure. For example, IT-F appears as F-ITIC, ITVfIC, ITIC3, and ITIC-F. In these instances, duplicate labels were replaced with a consistent name for the purpose of clarity. Finally, representing molecules with SMILES strings is limited. Polymers cannot be easily represented as the repeating, statistical entities that they actually are. While this is an active area of research,31–34 we simply encoded the structure of the monomer. Additionally, regioregularity – which is a factor for both polymers and NFAs – is not easily represented in SMILES notation. For example, IT-F and other NFAs with mono-substituted end groups are usually obtained as a mixture of three isomers, the populations of which play a role in determining PCE.35–38 However, SMILES forces us to choose a single isomer.
To select film and device fabrication parameters, we considered fundamental processes in OPV, such as light absorption and exciton formation, exciton migration, charge separation, and charge migration to the electrodes. Each of these processes is influenced by specific parameters related to the donor, acceptor, and active layer casting conditions that can be directly controlled by the researcher. The selection of parameters aimed to capture the relevant information for these processes. These parameters included donor and acceptor material properties (HOMO, LUMO, Eoptg), D:
A ratio, solvent, total solids concentration, additive (and concentration), active layer thickness, thermal annealing temperature, hole transport layer (HTL), electron transport layer (ETL), hole mobility, and electron mobility. Other well-studied device performance characteristics such as effective mobility, Langevin prefactor, and Vocvs. ln(I) slope were not reported with enough consistency to be included in the dataset.
In extracting material properties (HOMO, LUMO, Eoptg), we observed that multiple values were reported for the same material. This may arise from a variety of factors: differences in purity, dispersity (in the case of polymers), measurement method, or measurement error.39 Using the reported values would result in data points with the same molecular representation (i.e. molecular structure) having multiple different property values. Accounting for such variation in a ML model is nontrivial, and disentangling variations from experimental noise and differences in material composition was not possible. For materials with variation in the reported material properties, the mean was taken as the corresponding feature value. Some materials had identical values reported multiple times. Upon further investigation, this repetition was due to multiple papers citing a single source-often from the literature describing the synthesis of the material. If the repetition is greater than 10 times, only one of the entries is used in the calculation of the mean. Upon further investigation, it was determined that this was due to papers citing a single source (often the material property values reported in the paper describing the initial synthesis). If the value was repeated more than ten times, we automatically discounted nine of the entries when fitting.
Extracting active layer and device fabrication details often required digging through the paper's supporting information or references. D:
A ratios were not consistently reported, requiring reviewers to refer to the paper's Methods section in order to calculate them from the total solids concentrations of the individual (D and A) solutions. Additionally, some papers reported either active layer thickness or spin coating speed, as they are related. Thus, the reporting of these parameters was inconsistent across papers. For optional fabrication features (solvent additive and concentration, thermal annealing temperature and time) it was assumed that if they were not reported, they were not used.
![]() | ||
Fig. 2 Schematic representation of different structural encodings of the common donor polymer PTB7-Th. |
Alternatively, molecules can be decomposed into substructures, or fragments, which are then encoded as arrays. Extended-connectivity fingerprints (ECFP) effectively capture local structural information and are a common cheminformatics tool for modeling structure–activity relationships and calculating structural similarity. ECFP breaks down molecular structures into circular topological fingerprints around each atom up to a set maximum radius and hashes the presence or absence of the substructure into a bit vector of set length – similar to OHE but for substructures. ECFP includes additional atomic characteristics including “heavy” atom count, valence minus the number of hydrogens, atomic number, atomic mass, atomic charge, the number of implicit and explicit hydrogens, and whether the atom is part of a ring.42 BRICS decomposes molecules into fragments of varying size based on common retrosynthetic and drug-like substructures, which are then encoded as arrays of varying length.43 This representation makes it possible to identify common substructures and analyze their impact on the overall molecular properties. Recently, the polymer-unit fingerprint (PUFp) has also been introduced as a way to provide a more explainable fingerprint-based representation for conjugated molecules.44
Molecular structure can naturally be represented by a graph where atoms are the nodes and bonds are the edges. However, encoding the graph is also non-trivial. Several line notations (string-based representations) have been developed for this purpose, wherein atoms are represented by specific characters and the bonds can be either explicit or implicit. SMILES captures both the topology and connectivity of atoms, enabling programmatic comparison and manipulation of molecular structures.45 SELFIES (Self-Referencing Embedded Strings) is a new system aimed at generative models in particular.46 Unlike SMILES which can have many invalid strings, every string of SELFIES characters is inherently valid. Recently, graph neural networks (GNN) have demonstrated state-of-the-art performance on chemical regression tasks.47–50 Molecules are represented as graphs; node and edge features are vectors associated with each atom or bond, describing properties of the node (i.e. atomic number, hybridization, aromaticity etc.) or edge (i.e. bond order, conjugation, stereochemistry). Global features contain information about the entire molecule. The GNN then aggregates the features for each node based on the graph connections into messages, which are then used to learn updated nodes and edges. The global node allows for message-passing between all the nodes in the graph, pooling the edge and node features into global features, and can be thought of as a node that is connected to all other nodes.51 A final aggregation layer takes the updated graph and outputs the final prediction. In addition, the embeddings generated by GNNs can be fed into other model architectures. To generate the graph embeddings, the GNN is trained to predict a relevant set of properties. During training, the GNN generates a learned representation (latent space) that captures information about the relationship between molecular structure and the target property. This representation, which is extracted from the embedding layer of the network, can then be used by another ML model to predict a separate target.48,52,53
Model performance can be quantified based on a number of different parameters including R2 score (coefficient of determination), root mean square error (RMSE), and mean absolute error (MAE). These metrics all capture different aspects of model performance. However, we have observed that the correlation coefficient (R) is occasionally reported instead of the R2 score, and as the sole performance metric.23,24,28–30 This is not a robust metric since R only captures the linear correlation between true and predicted values while the R2 score quantifies the deviation of the model prediction from the ground truth. However, while R is less strict in evaluating the model performance, a model with high R score may indicate modest success in training, and can still provide predictions that approximate the ground truth up to some correction term. In order to gain a better statistical understanding of the variability in model performance, cross-validation should be performed. This provides information on the variability of the model with different train and test sets. We achieve this by carrying out 5-fold cross-validation from seven independent random seeds which provide 35 different train and test set combinations. Given that the random seeds are independent, we can determine variability based on the standard error of the mean. Finally, it is also important to carefully apply feature scaling so as not to overestimate the model's performance. The training set features should be scaled first, ensuring a standardized feature space that the models can learn from, and then that scaling should be applied to the test set features. If all features are scaled together, this can result in data leakage between the train and test sets, implicitly encoding information about the train set into the test set, providing overestimates on the model performance.54,55
Linear methods are the simplest: they assume that there is some linear relationship between the input features and the target value, which can be used to predict the target value based on a linear combination of the input features. One of the greatest advantages to linear regression is the interpretability of the model: linear effects are easy to quantify, and to describe. As expected, multiple linear regression (MLR) is only useable with relatively low dimensional input data (material properties), where it achieves an R2 score of 0.35 ± 0.03. Linear models are not able to effectively handle binary (OHE, ECFP) or sequence-based (SMILES, SELFIES) features.
Non-parametric models such as k-nearest neighbors (KNN) make no assumptions about the distribution of data from which a sample is drawn. The algorithm computes the distances from the nearest number of datapoints (k) around the new datapoint (i.e. test set) to predict the property of the new datapoint. The distance algorithm and k can be user-defined or vary based on the local density of points. KNN is simple and versatile but can be computationally expensive for large datasets.56 While KNN is most often used for classification tasks, it can also be employed for regression. In regression, KNN predicts the average target value of the k nearest points in the training set. Because the KNN algorithm simply predicts the average of the nearest data points, it performs remarkably well across many different representations. In fact, its performance is comparable to or better than much more complex models in the cases of OHE (R2 = 0.31 ± 0.03), material properties (R2 = 0.43 ± 0.03), SMILES and SELFIES (R2 = 0.41 ± 0.03 and 0.41 ± 0.04, respectively), and even graph embeddings (R2 = 0.43 ± 0.03).
Kernel methods apply a non-linear transformation (e.g. radial basis function, polynomial, sigmoid, etc.) to the input features, known as the kernel trick, which can then be used in many different algorithms. Support vector regression (SVR) uses this approach to find an optimal hyperplane in the same or higher dimension that maximizes the margin (distance) between the hyperplane and the nearest data points while also minimizing the prediction error. SVR is effective for handling high-dimensional or non-linear data and outliers. However, it does not scale efficiently for very large datasets. Kernel ridge regression (KRR) is a regularization technique that takes advantage of the kernel trick in order to learn a linear relationship in the kernel space, and a non-linear relationship in the original space similar to SVR. Unlike SVR, KRR uses a squared error loss while SVR calculates the loss from the distance between the test set and the decision boundary of the hyperplane which is determined by the user (i.e. epsilon). Gaussian process regression (GPR) also uses the kernel trick, but it takes a Bayesian approach. GPs are probabilistic models that learn the probability distribution (posterior) over all data points. A prior distribution, typically a Gaussian distribution, over all possible functions to model the data must be specified, with the covariance of the distributions determined by the kernel function. The updated distribution or posterior distribution incorporates information from the prior distribution and the dataset using Bayes' theorem which is then used to make predictions on new, unseen data points. It is particularly useful for small datasets and can capture complex relationships with uncertainty estimates. The biggest disadvantage of GPs is the high computational demand which remains a challenge for high-dimensional and large datasets. In general, KRR and GPR are out-performed by other methods on all representations except for ECFP, where the Tanimoto kernel57 can be employed, leading to good predictive performance (R2 scores of 0.52 ± 0.02 and 0.56 ± 0.02, respectively). On the other hand, SVR is comparable to the top methods for most representations, including R2 scores of 0.53 ± 0.03 with ECFP and 0.56 ± 0.03 with Mordred descriptors.
Decision trees are commonly used for both classification and regression. Decision trees are interpretable and versatile, making them valuable for a wide range of applications in data analysis and prediction. They work by recursively partitioning the data into subsets based on feature values, ultimately leading to a tree-like structure where each leaf node represents a predicted outcome. However, they are prone to overfitting since every data point can be explained with sufficient tree depth. This can be mitigated by using ensemble methods and carefully selecting the model parameters such as tree depth. We evaluate a number of different ensemble tree-based methods including: random forest (RF), gradient boosted trees (XGBoost),58 natural gradient boosting with uncertainty estimation (NGBoost),59 and histogram gradient boosting (HGB), the implementation of which is inspired by LightGBM.60,61 Tree-based methods are able to handle non-linear relationships within the feature space, perform automatic feature selection,62 and can be interpreted by analyzing the decision boundaries established by the trees. RF is an ensemble of decision trees trained on various bootstrapped subsets of the entire dataset, with features randomly selected for branches within the trees. A final prediction and uncertainty is attained by consensus of the ensembles. XGBoost is an optimized implementation of gradient boosting that sequentially combines many weak individual predictive models to form a strong final ensemble predictive model, with each subsequent model trained to correct the errors of the previous models. However, this comes with trade-offs in the form of lack of interpretability, trial-and-error hyperparameter tuning, and higher computational cost. NGBoost is similar to XGBoost, also learning the errors of previous trees through boosting, but predicts the parameters of a distribution rather than the regression value directly. NGBoost is useful when uncertainty estimation is desired and offers better interpretability through the analysis of the predicted distributions. HGB regression is much more efficient by virtue of first binning the input features and building the trees off of bins rather than floating point values. As a result, the number of split candidates and the computational cost are significantly reduced. As an added benefit, one of the bins is reserved for missing values, which is very useful for incomplete datasets as will be discussed below. The tree-based methods discussed above are almost indistinguishable in terms of predictive performance, except for XGBoost which performs slightly worse. Most importantly, tree-based methods perform best with ECFP (between R2 = 0.58 ± 0.02 and 0.59 ± 0.02), and Mordred descriptors (between R2 = 0.56 ± 0.03 and 0.6 ± 0.03) because ECFP and Mordred descriptors extract relevant chemical and molecular features unlike other molecular representations such as SMILES or one-hot encodings.
We also evaluated two common DL model architectures: multi-layer perceptron (MLP) neural networks (NN) and GNNs. MLPs are versatile and powerful models that have been shown to be capable of learning complex relationships between the input features and target variables. These models are made up of multiple interconnected layers of neurons with non-linear activation functions (e.g. ReLU, sigmoid, tanh). The weights and biases of the neurons are optimized over many iterations using gradient descent and backpropagation via the connections between neurons to minimize the loss function (mean squared error). However, they require careful architecture design, parameter tuning, and large quantities of training data. GNNs are neural network architectures designed to handle data that can be represented in the form of a graph. GNNs can effectively learn relationships and patterns in molecular structures because local features can be influenced by distant atoms and bonds through message-passing.47,50,63 Here we utilize the GraphNets architecture for GNN prediction, and the ChemProp message passing networks for generating the molecular embeddings.48–50,64,65
The tree-based models (RF, HGB, XGB, NGB) significantly outperform the DL models (i.e. MLP and GNN) because they are insensitive to irrelevant features and can learn irregular functions.62 Given the small size of our dataset, the limitations of DL models are pronounced. DL models can easily overfit to a small training set by learning from spurious correlations of irrelevant features and predict inaccurate output values.
The best DL model (MLP trained on ECFP, R2 = 0.46 ± 0.04) performs well because the ECFP representation extracts relevant features from the molecular structure,42 making the inherently challenging task of learning the most relevant features from sparse and low quantity data less difficult for the MLP algorithm. In addition, DL models are biased towards smooth functions which make it challenging for the model to learn from tabular data and discrete data, as is the case with this dataset.62 Interestingly, while the GNN regressor struggled due to the size of the dataset, the graph embeddings extracted from the GNN embedder combined with a tree-based or kernel-based prediction model outperformed the direct prediction of PCE from the graph.
Independent of model choice, the best representations are found to be material properties, ECFP, Mordred descriptors, and graph embeddings (model performance in Table S1†). OHE is a good baseline representation in that if the same model performs worse with a new representation, it is likely that the new representation is uninformative. OHE performs poorly across all of the models because none of the chemical, substructural, or atomic characteristics are captured. We find that most representations perform significantly better than OHE. While the PUFp representation provides model performance comparable to the material properties representation (Table S2†), it suffers from a lack of generalizability since the PUFp workflow creates a fingerprint that is specific to the dataset.44 However, when only including fabrication conditions, the nature of which will be detailed in Section 2.4, the difference is very small. This suggests that fabrication conditions alone are not very informative. None of the models that were evaluated were able to take advantage of the sequence information in SMILES and SELFIES, explaining their poor performance. Models such as long short-term memory (LSTM) networks may be useful. However, the dataset is likely too small for the LSTM network to accurately learn the syntax and grammar of string-based representations in addition to accurately learning the complex relationships between the representation and properties.66,67
Despite the modest R2 scores, we find that our top-performing models (RF and HGB) either match or surpass the state of the art for other similar OPV datasets (Fig. S2†).22,29 In particular, we achieve better predictive performance (R2 = 0.5 ± 0.03) using only ECFP and the material properties available in the dataset in comparison to the performance (R2 = 0.4) achieved when using computed descriptors.22
We also explored the ability of multi-output models for predicting PCE. While the most common approach is to predict PCE as the single output, PCE can also be predicted as the product of its components – JSC, VOC, and FF (eqn (1)). We selected three models: two that natively support multi-output regression in scikit-learn (RF, HGB), and an artificial neural network (ANN) made in PyTorch, which is similar to the MLP described above (further details available in Methods). The models were trained concurrently on PCE, JSC, VOC, and FF. Model scores for PCE prediction were calculated using eqn (1) from the predicted JSC, VOC, and FF (Fig. S3†). Ultimately, model performance was not significantly different between directly predicting PCE (Fig. 3) and predicting PCE from JSC, VOC, and FF. However, we observed that all models were able to predict JSC much better than VOC, despite heuristics suggesting a strong correlation between VOC and LUMOA–HOMOD,68 which suggests that this problem is worth further investigation.
![]() | ||
Fig. 3 Heatmap of model performance for predicting PCE from the molecular structure of the donor and acceptor materials as measured by the R2 score. Tree-based models (RF, XGBoost, HGB, and NGBoost) perform best on a dataset of this size (558 points). Structural representations based on ECFP and Mordred descriptors perform best across the board. Numbers in each cell correspond to the average R2 score of the model over seven independent five-fold cross-validations ± the standard error of the mean. Gray cells correspond to models that were not tested or failed to yield reasonable results (0 ≤ R2 ≤ 1). Heatmaps for the RMSE, MAE and R (for comparison to other published works only) scores are presented in Fig. S1.† |
Encoding scalar device fabrication features such as energy levels, annealing temperature or the D:
A ratio is straightforward. However, encoding categorical features is more challenging. In the case of OPV device fabrication, these are (i) solvent, (ii) solvent additive (if used), (iii) the hole transport layer, HTL, and (iv) the electron transport layer, ETL. Similarly to the donor and acceptor materials, these categorical features could be encoded using one-hot encoding or as labels. However, as seen with OHE of the active layer materials, these encodings are not informative for a ML model. The HTL and ETL were encoded with their energy levels, as determined from literature reports (Table S3†). The solvent and solvent additive were encoded with physicochemical descriptors obtained from the HSPiP database (Table S4†).69
The features beyond molecular structure were split into four broad categories: material properties (as in Fig. 3), film fabrication parameters (fabrication), device architecture, and electrical characterization. As discussed in greater detail below, a small number of features (total solids concentration and spin coating speed in particular) were missing from many of the data points (Fig. 4a). These features were excluded in order to maximize the size of the dataset on which the models were trained (Fig. 4b and S4†). Data points with missing values were only provided to the HGBR models, while those data points were removed for all other models. Consequently, the datasets provided to all algorithms except HGBR are ca. 95% of the size of the HGBR model's dataset (Fig. 4b), which we do not regard as being a significant difference when comparing the HGBR models to all others.
![]() | ||
Fig. 4 Top: percentage (left) and amount of data (right) missing from key categories. Bottom: relative (left) and absolute (right) dataset sizes of subsets of processing parameters. |
The correlation between each set of features and the output variables were evaluated using both Pearson's correlation coefficient (R) and Spearman's rank correlation coefficient (ρ). R is a measure of the extent to which two variables are linearly correlated, while ρ is a measure of rank correlation (i.e. the extent to which two variables are correlated through some monotonic function). Feature correlations (R and ρ) with PCE are relatively low (Fig. S5–S8†). Only some of the material property features show moderate correlation with PCE (|R| > 0.3 or |ρ| > 0.3): donor HOMO and LUMO, and all acceptor properties (HOMO, LUMO, HOMO–LUMO gap, and optical gap). These stronger correlations are expected from what is known about the function and design of OPV devices. Of the processing and device architecture features only active layer thickness and spin coating speed show weak correlation with PCE (|R| > 0.15 or |ρ| > 0.15). There is also only moderate correlation at best (|R| ≤ 0.28 and |ρ| ≤ 0.25) between the selection of solvent and solvent additive with PCE. There are two potential explanations for this. First, the identity and physical properties of solvents and solvent additives are known to strongly influence the performance of OPV devices through a number of factors (e.g., material solubility and rate of evaporation).4,5 Second, it is possible that there is also a degree of spurious correlation between solvent or solvent additive and PCE because higher-performing materials are more often dissolved in particular solvents.
Although the KRR and GPR models with ECFP structural representations, which used the Tanimoto kernel, performed well, we were not able to evaluate their performance when device fabrication features are included. Although mixing different kernels is possible, it is unclear whether it would be meaningful and how the weights would be learned.52,57 The Tanimoto kernel calculates distance based on all 4096 ECFP features while device fabrication feature distances would be calculated one at a time. As such, assigning correct weights would be non-trivial.
Instead, we evaluate top-performing algorithms with ECFP and Mordred representations (the best molecular representations from the previous step): SVR and the tree-based algorithms (RF, HGB, XGBoost, NGBoost). When combined with device fabrication parameters, Mordred descriptors slightly outperform ECFP for encoding molecular structure of the donor and acceptor materials on average (Fig. 5), however the difference is not statistically significant. For example, the performance of RF models increases from R2 = 0.58 ± 0.03 to R2 = 0.60 ± 0.02.
![]() | ||
Fig. 5 Heatmap of model performance as measured by the R2 score for predicting PCE from molecular structure and various subsets of OPV device fabrication parameters. Numbers in each cell correspond to the average R2 score of the model over seven independent five-fold cross-validations ± the standard error of the mean. Heatmaps for the RMSE and MAE scores are presented in Fig. S9.† |
Additional features do not significantly increase performance over the models simply trained on molecular structure. Performance of all tree-based models except HGB degrades by ca. 0.01–0.02 between models trained on molecular structure and material properties compared to models trained on molecular structure and the fabrication subset of features. This can be attributed to the fact that the material properties features do not contain missing data and are closely tied to molecular structure. That is not the case for the other features in the fabrication and device architecture subsets. Furthermore, adding the device architecture subset of features significantly decreases model performance for all but HGB across both ECFP and Mordred molecular descriptors. This may be related to the imbalanced selection of HTLs and ETLs in OPV device fabrication.
Empirically, device fabrication parameters are known to be important factors in OPV device performance. The lack of improvement in model performance when including these features suggests that it may be due to the quality of the training data. The top-performing model is HGB trained on Mordred descriptors and any of the device fabrication feature subsets (R2 = 0.61 ± 0.03). This may be due to the fact that HGB can handle missing data points, and therefore is capable of learning input feature–target relationships even from data points with missing features (e.g. total solids concentration and spin coating speed).
The top ML model architecture, HGB, using Mordred descriptors for molecular structure and trained on all available features achieves an R2 score of 0.61 ± 0.03. The model predictions are poor for data points where the true PCE is below 5%. Predictions in this range are as high as ca. 12%. This is to be expected because this range is under-represented in the dataset and therefore the model sees fewer examples on average. A similar phenomenon can be observed in the high-PCE tail of the distribution. In both cases, predictive errors tend toward the points with the highest density in the dataset (ca. 10% PCE). Additionally, there are no predictions near the maximum PCE (15.71%). The same phenomena are observed for models with different molecular representations and data (Fig. S10†).
Notably, a model trained on Mordred descriptors and the device architecture subset of features (Fig. 6) is not significantly better than the same model trained on only Mordred descriptors, which achieves a R2 score of 0.6 ± 0.03 (Fig. 3). This suggests that the models' performance is limited by the quality of the underlying data.
Finally, we evaluated the performance of the HGB model when including features related to charge carrier mobility: log(hole mobility) and log(electron mobility) from space charge-limited current (SCLC) diode measurements, as well as the ratio of the two. The resulting model performance is significantly better (R2 = 0.68 ± 0.02, Fig. S11†) than anything else reported in the literature. However, including these values does not make a lot of sense from a practical perspective. In order to predict the PCE of a device, one would already need to have measured the hole and electron mobilities. This involves fabricating a device with a modified electrode architecture, and is therefore less efficient than simply measuring the device's performance experimentally.
To address the issue of missing values, we can try to impute data. Imputing can help to fill in missing data based on various statistical treatments and assumptions about its distribution. We explored imputing data using some of the most common imputation techniques: mean, median, most-frequent, uniform- and distance-KNN, and iterative. The simplest imputing strategies are mean, median and most-frequent, which replace missing values with the mean, median and mode of the available data, respectively. Mean imputing is most useful when the standard deviation of the distribution is relatively small and data is missing at random. On the other hand, median imputing is more suitable for distributions where there are outliers because it is less sensitive to them. Finally, most-frequent (or mode) imputing is not suitable for imbalanced distributions, as is the case for some of the features in our dataset. In KNN-based imputing strategies, missing values are completed using the mean of the k (usually five) nearest samples in the dataset. They leverage the similarity between instances and are effective when data points with similar attributes tend to have similar outcomes. The distance between samples can be measured in two ways. Uniform KNN imputing assigns equal weights to all neighboring points, whereas distance-based KNN imputing weighs neighbors by the inverse of their distance. Therefore, closer points will have a greater influence than those that are more distant. Iterative imputation, also known as multiple imputation, is an advanced technique that involves an iterative process to fill in missing values. The process is usually based on modeling the relationships between features, and iteratively refining the imputations. Multiple imputations are combined to provide a more robust estimate of the missing values, considering the uncertainty associated with the imputations. The downside of iterative imputing is that it is computationally expensive and does not scale well.
While imputing data can be a valuable technique to enhance the quality and completeness of a dataset, it can also be unproductive (no change in model performance) or even counterproductive (degraded model performance). If the proportion of missing data is large (30–40%), imputing may lead to inaccurate results. It may also be counterproductive in instances where there is a pattern to which data is missing, or where missing data is associated with specific characteristics or factors because the ML model can potentially learn from such correlations. Finally, imputation may be challenging in datasets with intricate dependencies between variables, as is likely the case for OPV materials and devices. Ultimately, imputing data did not change model performance in any measurable way (Fig. S12†). This suggests that the models are limited not only by the quantity of data, but also by their quality.
Donor material energy levels (HOMO, LUMO) and energies (Eoptg, EHLg) do not fit cleanly within a distribution (Fig. S15†) primarily due to the low number of unique molecular structures (143), while the same features for acceptors form smoother distributions due to the larger number of unique molecular structures (261). In both cases, over-represented materials are clearly visible by the large counts in material property values. Additionally, it should be noted that measuring energy levels of materials is often imprecise, and that there are known issues with the most common technique, cyclic voltammetry (CV).39 Polymer properties are much less frequently reported. We were only able to find 202 values of Đ (36% of the dataset), 185 for Mn (33%), and 106 for Mw (19%). Few papers even mention polymer molecular weight or dispersity, and many that do simply refer to a previous paper where the value was initially reported. The molecular weights and dispersities of commercial materials are barely reported.
Total solids concentration is most often 20 mg mL−1 even though there are reports of concentrations between 6 mg mL−1 and 37.5 mg mL−1 (Fig. S15 and Table S5†). Additionally, data points are concentrated around round numbers such as 10, 15, 20 and 25 mg mL−1. Unfortunately, 115 entries did not have any information about total solids concentration (or individual solution concentration from which we could calculate total solids concentration). This hole in the dataset is particularly egregious since it is one of the easiest numbers to report. The fact that 21% of the entries do not have this data suggests that we as a community of editors and reviewers are not paying close enough attention to the reproducibility of published data. The same can be applied to another parameter that can be recorded with ease: spin coating speed. The two most common values are 2000 rpm and 3000 rpm. However, only 48% of entries (266) reported spin coating speeds even though the entries in this dataset are restricted to those devices fabricated by spin coating. The same biases are also evident in the distributions of annealing temperature and time, although these data are reported more regularly. Finally, although the distribution of active layer thickness values is quite smooth and uniform, the values are centered at 100 nm, with the mode being exactly 100 nm. This is statistically unlikely, and is characteristic of either rounding from human bias, inaccurate measurements, or both.
We found the HGB model, a gradient-boosted decision tree-based algorithm, to be best at predicting PCE from molecular structure, and from molecular structure and device fabrication features. Despite modest R2 scores of ca. 0.6, this reflects the state-of-the-art predictive performance for PCE. The same HGB algorithm outperformed models trained on a dataset almost twice the size (ca. 1000 data points) of ours,22 and matched the performance of models when trained on a dataset more than double the size of ours (ca. 1300 data points).29 The performance of the HGB algorithm may be in part attributed to its ability to extract maximal information by incorporating missing values – of which there are many – with its binning strategy.60
There are a number of ways in which model performance could be improved. Firstly, the “big p, small n” problem – where the number of variables or features (p) is much larger than the number of observations or data points (n) – might be overcome by simply adding additional data points. The quality of the dataset could also be improved by including more varied device fabrication procedures for each material combination. The current iteration of the dataset primarily includes only the top performing device from each donor:
acceptor combination found in a paper. However, there are often others that may provide additional information despite lower PCE values. Furthermore, most reported PCE values correspond to the best single device, and not an average of multiple replicates, due to current reporting practices.
Our findings underscore the challenges stemming from mining literature data and inherent data quality issues. Although recent advances in literature mining with large language models are promising, the ultimate problem is one of data quality and not quantity.73–75 Standardizing data reporting in OPV-related publications becomes crucial. Establishing minimum reporting standards will streamline literature mining and enable the adoption of ML methodologies within the OPV domain. There has long been discussion in the literature about standards for reporting PCE measurements.76–80 While there are currently no standards or requirements for reporting how OPV devices are made, similar steps forward have recently been proposed for inorganic phosphors81 and perovskite photovoltaics.82
There is much that can be learned from the ongoing discussions within the organic chemistry literature by recognizing the similarities between predicting OPV device performance and predicting reaction yields in organic chemistry sheds light on shared complexities. Both domains face challenges involving two molecular structures as input features, processing variables, sparse data, non-smooth response surfaces, and biases in reported data, notably the underreporting of negative results.70,83–86 Acknowledging these parallels emphasizes the broader challenges in predictive modeling within chemistry-driven domains. Addressing these challenges necessitates enhanced dataset quality and size, and standardized data reporting.85,87,88
This comprehensive understanding highlights the pivotal role of data quality, underscores the importance of standardized reporting practices, and emphasizes the collective effort required to enhance dataset quality and modeling techniques, ultimately unlocking the full potential of ML in driving innovation within the OPV and broader chemistry domains.
In the file, sidechains were shortened to numbered R groups (e.g. R1, R2), each of which corresponded to a different structure. Automatic R group replacement as implemented in RDKit's ReplaceSubstructs was not possible because SMILES and SMARTS cannot process the presence of different R groups. To automate the sidechain assignment process, we developed a method that uses placeholder metal atoms. Each R group was mapped to a placeholder metal atom and to the corresponding SMILES string. First, the R groups in the SMILES strings were replaced with the corresponding metal atoms. Then, the metal atoms were replaced with the full sidechain in the resulting RDKit Mol object using RDKit's ReplaceSubstructs function. Finally, the cleaned structure was returned as a canonicalized SMILES string.
• Certain features were often simply reported as ranges (e.g., spin speed, thicknesses, etc.). In these instances, we entered the average of the minimum and maximum values.
• When LUMO energies were not directly reported, HOMO + Eoptg was accepted.
• EHLg was calculated from the difference between the fitted HOMO and LUMO energy values.
• HOMO/LUMO: identical donor and acceptor molecules had variable HOMO/LUMO levels from paper to paper. Thus, both previous papers from the same research group or widely accepted universal UPS measurements were utilized in place of the omitted/variable values. Furthermore, when values for Eoptg were not reported, values were extracted by extracting the absorption onset from figures using WebPlotDigitizer (https://automeris.io/WebPlotDigitizer/).89
• Information about polymer molecular weight was very challenging to obtain. Mw, Mn and Đ are not consistently reported. Often, we were required to follow a trail of references through a group's publications to find the original synthesis, and assumed that the molecular weight distribution of the polymer was the same in subsequent reports.
• If two of Mw, Mn and Đ were reported, the third value was calculated.
• When solvents were presented as a mixture, they were often reported as a percentage by volume (% v/v). From this, a ratio of the two solvents presented was calculated (e.g., 15%, v/v = 6.667:
1).
• When recording reported solvent additives, some papers would include the additive, but not provide the additive percentage. In these cases, the solvent additive was included, while the concentration (%) was omitted, resulting in an incomplete data point.
• If a solvent system was reported as a ternary mixture (e.g., o-DCB:
CB
:
DIO 1
:
1
:
0.3), the minority solvent was recorded as an additive and its concentration was calculated from the ratio.
• If active layer thickness for a particular device was not reported in a paper, we substituted the thickness of a similar device in the same paper if the total solids concentration and spin coating speed were the same. Otherwise, the data point was omitted.
• Reports of charge extraction by linearly increasing voltage (CELIV)90 measurements for the “faster carrier” were excluded as this did not specify whether hole or electron mobility were recorded.
• From each set of parameters of VOC, JSC, FF, we calculated the PCE with the formula PCE = VOC × JSC × FF. When comparing these with reported values within the dataset, we looked for errors > 1% of the PCE values. In these cases, we compared the values in the dataset and those in the relevant paper. We often found that values had simply been misreported. The most common errors were: (i) inconsistent reporting of FF as either a percentage or a fraction (i.e. a 100-fold discrepancy in FF), (ii) values reported for a different device, or (iii) reported PCE had rounding errors. Because of these discrepancies, we chose to use our calculated PCE instead of the reported PCE values.
To reduce the number of incomplete data points, we replaced annealing temperatures and times, as well as solvent additives with default values. Devices that were not annealed were assigned default annealing temperatures of 25 °C and times of 0 minutes. Formulations that did not use a solvent additive were assigned a solvent additive label of “_” and an additive concentration of 0%.
Common donor and acceptor materials were assigned many different energy level values in the dataset. However, our models have no way of accounting for that variability, which can arise from a number of possible sources.39 To standardize the reported energy levels, we fit Gaussian distributions to any material with two or more unique energy level values, and replaced them with the fitted mean. However, certain materials were reported to have the same exact energy level in multiple papers. Upon further investigation, we found that this is due to works simply reporting a value from another paper. In instances where the same energy level value occurred more than ten times for a given material, we only counted it once when performing the Gaussian fitting.
SMILES, SELFIES, and BRICS are alphanumeric representations of variable length, incompatible with ML models. Therefore, they must be tokenized (i.e., converted into numeric representations) and “padded” so that all vectors are of the same size. Tokenizing creates a mapping between the unique symbols in a representation and integer values. The mapping is used as a reference for encoding each individual molecule. In order to achieve a consistent length across the dataset, the tokenized sequences were adjusted to the length of the longest sequence by post-padding with the appropriate amount of zeros.
When training exclusively on molecular structures represented by ECFP, the Tanimoto kernel was used for KRR and GP models.
As described in Fig. 4b and S4,† models trained on molecular structure and processing parameters were supplied with specific subsets of the features, and data points with any missing values were dropped. Data points with missing values were not removed from the training set of the HGB models. When descriptors were used for solvent additives, devices in which solvents additives weren't used were substituted with zeroes, except for the HGB models in which case they were left as missing values.
Certain models such as MLP are sensitive to the scale of feature values. To mitigate this, we applied min–max and/or standard scaling using Scikit-Learn's MinMaxScaler and StandardScaler. In the case of GPs, all features were scaled using the uniform QuantileTransformer. Scaling was also applied to the target values. Typically, target values were subjected to standard scaling, except in the case of MLP model architectures where min–max was also used. Crucially, the features were scaled based on the values in the training set. Scaling training and test sets together introduces data leakage into the model, which results in overestimation of model performance.54,55
Model performance was evaluated with 5-fold cross-validation.54 In addition, we used seven different random seeds for 5-fold splitting. In this way, we generated 35 different training and test sets.
Models that were deemed to be more sensitive to hyperparameters (KNN, MLP) were subjected to an inner loop of Bayesian hyperparameter optimization using the BayesSearchCV class of Scikit-Optimize before being trained on the full training set.95 The Bayesian cross-validation search performs 5-fold cross-validation on the training fold and Bayesian optimization to estimate the ideal model hyperparameters efficiently. The hyperparameters from the best model in the inner loop were used for the model trained on the full training set. Ranges for hyperparameter optimization were selected based on common values and expert intuition.
To assess overall model performance, we evaluated average R2, root-mean-square error (RMSE), mean absolute error (MAE), and Pearson correlation coefficient (R) over the 35 test sets. Variation in model performance was assessed based on the standard error over the 35 test sets with a sample size of 7 (the number of independent seeds). Predictions, scores and parity plots for all models can be found in the GitHub repository.
Footnotes |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4ta01942c |
‡ Authors contributed equally. |
§ Authors contributed equally. |
This journal is © The Royal Society of Chemistry 2024 |