Asif
Mahmood
and
Jin-Liang
Wang
*
Key Laboratory of Cluster Science of Ministry of Education, Beijing Key Laboratory of Photoelectronic/Electrophotonic Conversion Materials, School of Chemistry and Chemical Engineering, Beijing Institute of Technology, Beijing, 100081, China. E-mail: jinlwang@bit.edu.cn
First published on 26th November 2020
Machine learning (ML) is a field of computer science that uses algorithms and techniques for automating solutions to complex problems that are hard to program using conventional programming methods. Owing to the chemical versatility of organic building blocks, a large number of organic semi-conductors have been used for organic solar cells. Selecting a suitable organic semi-conductor is like searching for a needle in a haystack. Data-driven science, the fourth paradigm of science, has the potential to guide experimentalists to discover and develop new high-performance materials. The last decade has seen impressive progress in materials informatics and data science; however, data-driven molecular design of organic solar cell materials is still challenging. The data-analysis capability of machine learning methods is well known. This review is written about the use of machine learning methods for organic solar cell research. In this review, we have outlined the basics of machine learning and common procedures for applying machine learning. A brief introduction on different classes of machine learning algorithms as well as related software and tools is provided. Then, the current research status of machine learning in organic solar cells is reviewed. We have discussed the challenges in anticipating the data driven material design, such as the complexity metric of organic solar cells, diversity of chemical structures and necessary programming ability. We have also proposed some suggestions that can enhance the usefulness of machine learning for organic solar cell research enterprises.
Broader contextTo provide an environment friendly solution to the energy crisis, extensive research in various fields of solar cells is going on. Organic solar cells are the most prestigious candidate. After tremendous success of machine learning in language translation and image recognition, it has gained significant attention from material scientists. Finding efficient materials for organic solar cells is a prime task for further improvement. An innovative approach is required to replace the tedious and expensive trial and error method because a large number of organic semi-conductors are possible due to the chemical versatility of organic building blocks. In this review, we have discussed the current state of machine learning assisted material design for organic solar cells. There are serious challenges for the scientific community to accelerate the data-driven research paradigm. However, this field has the potential to overcome obstacles. We have proposed suggestions to overcome these obstacles. Machine learning based design can be a cost-effective way to foster innovation in organic solar cell research. |
Scharber's model was developed to predict photovoltaic efficiency of organic solar cells.18 It uses few electronic parameters of active layer materials (donor and acceptor) for prediction. Its scope is limited due to assumptions based on it. There are so many obstacles in its extension i.e. it is hard to include other descriptors such structural, topological and thermodynamic. The performance of this model is quite poor.19–21
Data driven research has started a new paradigm and shown promising results within materials science. It can provide an understanding of fundamental factors that govern the performance of materials for specific applications. Material discovery through a data-driven paradigm is efficient and effective to utilize relevant information.22,23 The systematic way of doing this is known as machine learning, which learns from past data and helps to screen the candidates for laboratory work. The discovery of excellent candidate materials for organic solar cells can be made fast and cheaper through multidimensional design by combining machine learning (ML), DFT calculations and available experimental data survey (Fig. 1).
The use of machine learning (ML) should be encouraged because it can provide a chance to get hidden information and trends. It uses a digitized form of chemical structures, and theoretically calculated and experimentally measured properties.24–26 Further development in the ML model's explainability and interpretability will increase their usefulness. As compared to other fields of materials science, the use of ML for organic photovoltaic is limited. However, in the last year, the number of published papers has increased significantly as compared to the past. To the best of our knowledge, this is the first review article related to ML use in organic solar cell research.
This review article is written to provide a comprehensive view of the machine learning methods used for organic photovoltaic materials. In Section 2, a brief description of machine learning and its classes is provided. In Section 3, basic steps to train ML models are given. In Section 4, the research status with respect to the applications of machine learning in organic solar cells is discussed. Section 5 discusses the limitations and drawbacks of machine learning and possible solutions.
The selection of an appropriate ML algorithm plays a key role since it significantly affects the prediction performance. There is no single best method for all cases. There are so many machine learning algorithms (Fig. 3). To achieve a highly effective model, it is crucial to select an appropriate algorithm that is mostly done on the basis of a trial and error process. Detailed literature search can help in this aspect. The learner should have enough knowledge of algorithms. It can help to select the best algorithms. A range of learning algorithms can be applied (Fig. 3), depending on the type of data and the question posed. The scope of this review article does not allow discussion of all the algorithms. Readers interested in a more comprehensive understanding of different algorithms can look at the following review articles.27–30 Information about popular machine learning frameworks and libraries is summarized in Table 1.
Name | Description | Website |
---|---|---|
Scikit-learn | Free python-based library of machine learning algorithms | https://scikit-learn.org/stable/index.html |
TensorFlow | Open-source software library for dataflow and programming | https://www.tensorflow.org |
CRAN-machine learning | The Comprehensive R Archive Network (CRAN) is based on R programming language and contains verity of machine learning algorithms | https://cran.rstudio.com/web/views/MachineLearning.html |
Keras | Open-source python library, supports data mining using deep learning algorithms | https://keras.io |
Weka | Waikato Environment for Knowledge Analysis (Weka) offers verity of ML algorithms with graphical user interface | https://www.cs.waikato.ac.nz/ml/weka |
Amazon Web Services machine learning | Offers fully managed solution for machine learning through various libraries such keras, TensorFlow, and Apache Spark ML etc. | https://aws.amazon.com/machine-learning |
IBM Watson machine learning | Efficient and fast machine learning facilities based on Apache Spark libraries. | https://www.ibm.com/cloud/machine-learning |
Google artificial intelligence | Cloud-based artificial intelligence machine learning framework. | https://ai.google/ |
The required quantity of data varies model to model but a general rule of thumb is that at least 50 data points are necessary for a reasonable ML model. However, some models such as neural networks require much larger quantities. The data quantity and quality are perhaps the main challenges in the application of ML in materials science. A large fraction of data is available in journal publications.
Values of different descriptors fall in different scales. It is better to normalize the values, as this will make it easy to find the difference between observations and use within a single algorithm. If the number of features (descriptors) is higher than the number of observations or features highly correlate with each other, then dimensionality reduction tools are used. Principal component analysis (PCA), discriminant analysis (LDA) and independent component analysis (ICA) are the most famous ones. They help to reduce the dimension of the feature space and identify the most relevant features. These tools also visualize the data.
(1) |
(2) |
(3) |
An alternative approach for estimating the model's performance is splitting the available data into training/test sets several times, each time with a different group of observations in each set. This is a way to check performance several times, then sum up for each train/test split. This process also provides us with a measure of variability and stability of the model performance. This method can be executed using the cross validation (CV) and bootstrapping procedures.
Generally, a material property strongly depends on specific factors. Selection of suitable descriptors for specific properties is a crucial step before applying the ML process, especially microscopic descriptors that are experimentally and computationally expensive to determine. A good material descriptor should at least meet the following three criteria: It should be (i) a unique characterization of the material, (ii) sensitive to the target property, and (iii) easy to calculate.
We have grouped the reported studies on the basis of input data. However, this grouping is not highly accurate due to the use of multiple types of inputs in some studies.
Fig. 4 Different types of molecular representations applied to one molecule. Adopted with permission from ref. 34 Copyright 2018, AAAS. |
Name | Description | Website | Ref. |
---|---|---|---|
DRAGON | Software, it can calculate 5270 descriptors | http://www.talete.mi.it | 36 |
E-DRAGON | Web version of DRAGON can calculate over 3000 descriptors for molecules with atoms 150 atoms | http://www.vcclab.org/lab/edragon/ | 37 |
Mold2 | Software to calculate 779 descriptors | https://www.fda.gov/science-research/ | 38 |
Mordred | Software, it can calculate over 1800 2D and 3D descriptors | https://github.com/mordred-descriptor/ | 39 |
PaDEL-Descriptor | Tool, can calculate 1444 1D and 2D descriptors, 431 3D descriptors, and 12 different types of fingerprint | http://www.yapcwsoft.com/dd/ | 40 |
MOE | tool to calculate over 300 topological, physical and structural descriptors | http://www.chemcomp.com/ | |
MOLGEN QSPR | software to calculate 708 arithmetical, topological, and geometrical descriptors | http://molgen.de/ | 41 |
ChemoPY | Python package to calculate 1135 2D and 3D descriptors | http://code.google.com/p/pychem/ | 42 |
BlueDesc | software to calculate 174 descriptors | http://www.ra.cs.uni-tuebingen.de/software/bluedesc/ | 43 |
PowerMV | PC software for calculation of 1000 descriptors | https://www.niss.org/research/software/ | 44 |
In 2011, Aspuru Guzik's group used ML for the discovery of promising OPV donor materials.45 Current–voltage properties of 2.6 million molecular motifs were modelled using linear regression. On the basis of feature predictions, they identified benzothiadiazole, pyridinethiadazole and thienopyrrole as the top candidates.
Zhang et al. established a data set of 111000 molecules and trained an ML model using random forest (RF).46 Through this model, they predicted the LUMO and HOMO with below 0.16 eV error without any DFT calculations. This can speed up high-throughput screening of organic semi-conductors for solar cells.
Su et al. designed a series of novel acceptors based on multi-conformational bistricyclic aromatic ene (BAE) derivative.47 They have predicted their PCE using an ML model build using experimental data through cascaded support vector machine (CasSVM). The CasSVM model is a novel two-level network (Fig. 5), which consists of three subset SVM models taking JSC, VOC, and FF, respectively, as outputs in the first level. Then, the second level was used to establish the relationship between the first level outputs and the ultimate endpoint PCE. The best established CasSVM model has predicted the PCE value of OPVs with mean absolute error (MAE) of 0.35 (%), which is approximately 10% (3.89%) of the mean PCE. The value of R2 was 0.96. This approach can be very useful for experimental chemists to screen the potential candidates before synthesis.
Fig. 5 The structure of the cascaded SVM QSAR model. Sub-1–3 are the input descriptors, respectively, for JSC, VOC, and FF. SVM1∼4 are the subset SVM models used for the prediction of JSC, VOC, FF and PCE, respectively. Reprinted with permission from ref. 47 Copyright 2018, Wiley-VCH Verlag GmbH & Co. KGaA. |
In recent years, non-fullerene acceptors have been extensively used in organic photovoltaics (OPVs).3–5,51–55 In 2017, Aspuru-Guzik et al. collected a data set of over 51000 non-fullerene acceptors based on benzothiadiazole (BT), diketopyrrolopyrroles (DPPs), perylene diimides (PDIs), tetraazabenzodifluoranthenes (BFIs) and fluoranthene-fused imides from Harvard Clean Energy Project (HCEP).20 A data set of 94 experimentally reported molecules was used to calibrate the DFT methods for the calculation of HOMO and LUMO values of new non-fullerene acceptors. Gaussian process regression was used instead of the widely used linear regression because of the absence of a linear trend. They used the Scharber model to calculate the PCE for organic solar cells based on designed non-fullerene acceptors and poly[N-90-heptadecanyl-2,7-carbazole-alt-5,5-(40,70-di-2-thienyl-20,10,30-benzothiadiazole)] (PCDTBT), a standard electron-donor material. DFT calculated HOMO and LUMO values of acceptors and experimentally reported HOMO and LUMO values of PCDTBT were used as input for the Scharber model. To check the PCE prediction ability of the Scharber model, they utilized a set of 49 reported experimental values and predicted their Scharber model PCE values. A weak correlation was found (r = 0.43 and R2 = 0.11).
Prediction of PCE and specific device properties are equally important. To improve a specific property, it is important to find the relationship between the specific property and molecular descriptors.56 For example, most high performing OSC devices showed lower open-circuit voltage (VOC). In BHJ OSCs, charge separation is typically associated with large voltage losses because of the extra energy required to split excitons into free carriers. This voltage loss in OSCs with high performance is usually around 0.6 V, which is 0.2–0.3 V higher than that for c-Si and GaAs-based solar cells.57 Non-fullerene acceptors with extended thin-film absorption and suitable energy levels can help to achieve a balanced trade-off between VOC and JSC.58 Their structural versatility allows highly tunable absorption and molecular energy levels. Machine learning can speed up the screening of suitable materials. Predication of specific parameters will be helpful to further enhance the PCE. Aspuru Guzik et al. calibrated the VOC and JSC values, which were calculated from the Scharber model with available experimental data using structural similarity.21 Information on the molecular graph is extracted with extended connectivity fingerprints, and exploited using a Gaussian process. This calibration method reduced the functional dependence of the calculated properties, and it will help to ease the high-throughput virtual screening.
In 2019, Sun et al. collected the dataset of 1719 donor materials.59 They tested different inputs including seven types of molecular fingerprints, two types of descriptors, ASCII strings and images. Classification of donor materials was carried out in two categories “low” and “high” PCE. Models developed using fingerprints showed the best performance in predicting the PCE class (86.76% accuracy). They have verified the ML results by synthesizing 10 donor materials. The model classified eight molecules into the correct category. Experimental results were in good agreement with predicted results. Classification in just two categories (0–2.9%, 3–14.6%) is very easy as compared to predicating the PCE of individual semi-conductors. This study does not have much practical value i.e. the second category is too wide.
In the same year, Saeki et al. extracted 2.3 million molecules from the Harvard Clean Energy Project database.60 1000 molecules were selected on the basis of calculated PCE. They used molecular access system (MACCS) fingerprints and the extended connectivity fingerprint (ECFP6) key to train the ML model. 149 molecules were selected using random forest (RF) screening (Fig. 6). The RF method for PCE prediction showed an accuracy of 48%. They selected one polymer on the basis of synthetic feasibility. A solar cell device based on the new polymer showed a PCE of 0.53% that is much lower than the RF prediction (5.0–5.8%). There are two reasons behind this failure. Firstly, the PCE calculated from the Scharber model is used to train the RF model. The performance of the Scharber model is very poor. Secondly, the structures of polymer donors reported in the literature are more complex than that of semi-conductors in the HCEP database. Even if we ignore these factors, the PCE prediction accuracy of the RF model is very low. Therefore, the ML model should be more accurate and multiple materials should be synthesized for experimental validation.
Fig. 6 Scheme of polymer design by combining RF screening and manual screening/modification. Adopted with permission from ref. 60 Copyright 2018, American Chemical Society. |
Schmidt and co-workers collected a dataset of 3,989 monomers and trained a model using a grammar variational autoencoder (GVA).61 The trained model can predict the lowest optical transition energy and lowest unoccupied molecular orbital (LUMO) without the knowledge of the atomic positions. Moreover, this model can generate new molecular structures with the desired LUMO and optical gap energies. The prediction accuracy of the deep neural network (DNN) was higher than that of GVA. However, it is necessary to perform DFT calculations to find the atomic position that is necessary to predict the LUMO. So, in the case of the DNN model, it impossible to skip the DFT calculations.
Paul et al. used extremely randomized tree to predict the HOMO values for donor compounds.62 Their proposed models showed better results than neural networks trained on molecular fingerprints, SMILES, Chemception and Molecular Graph.
Peng and Zhao used convolutional neural network (CNN) to construct generative and prediction models for the design and analysis of non-fullerene acceptors.63 Different molecular descriptors have been used such as extended-connectivity fingerprints, coulomb matrix, molecular graph, bagof-bonds, and SMILES string. The depth of convolutional layers influences the diversity of the generated NFAs. Quantum chemistry calculations were performed to verify the predicted molecules. In the prediction model, the dilated convolution layers are adopted for feature extraction, and the attention mechanism is used as an interpretable module. The authors concluded that graph representation is better than a string representation.
In most of the experimental studies, donor and acceptor materials are optimized separately. Just optimization of one of the two components of the cell results in a limited exploration of the space of combinations. Troisi used ML to find the answer to the question: can components be optimized separately or should the optimization occur simultaneously.64 They used molecular fingerprints as input. Combinations of 262 donors (D) and 76 acceptors (A) were collected from the literature. They have predicted the PCE of BHJ solar cells based on combination of donor and acceptor materials. A high accuracy (r = 0.78) was obtained even though the data set was small. The best combination is proposed for experimental investigation. Min et al. reported an excellent work. They trained five ML models using linear regression (LR), multiple linear regression (MLR), boosted regression trees (BRT), RF, and ANN algorithms. 565 D/A combinations collected from the literature were used as training data sets.65 For polymer-NFA OSC devices, correlation between D/A pairs and the PCE predication is validated. BRT and RF models showed higher prediction ability, with an r value of 0.71 and 0.70, respectively. These two models were used to predict PCE of >32 million D/A combinations. Six D/A pairs were selected and incorporated into OSC devices. Experimental PCEs were close to the predicted ones. All the synthesized non-fullerene acceptors belong to high performing Y6 series. The workflow of the whole study is summarized in Fig. 7.
Fig. 7 Workflow of building, application, and evaluations of machine learning methods. (a) Scheme of collecting experimental data and converting chemical structure to digitized data. (b) Scheme of machine training, predicting, and method evaluation. Adopted from ref. 65 Creative Commons Attribution 4.0 International License. |
Fig. 8 Structure of the convolutional neural network (CNN). Reprinted with permission from ref. 66 Copyright 2018, Wiley-VCH Verlag GmbH & Co. KGaA. |
Data set | Source | Input | Method | ML models | Performance* | Experimental validation | Ref. |
---|---|---|---|---|---|---|---|
*PCE unless mention. ** mean absolute errors. *** Between ML predicted and Scharber's model calculated PCE. AOC = Accuracy of classification. | |||||||
2.3 million | HCEP | Descriptors | regression | Linear regression | 0.84 R2 | No | 45 |
111000 | Literature + database | Descriptors | Regression | RF | HOMO(0.85), LUMO (0.94) R2 | No | 46 |
161 | Literature | Descriptors | Regression | CasSVM | 0.96R2 | No | 47 |
51000 | HCEP | Fingerprints | Regression | GPR | 0.43(r) | No | 20 |
2.3 million | HCEP | Fingerprints | Regression | GPR | 0.65(r) | No | 21 |
1719 | Literature | Fingerprints | Classification (two groups) | RF | 86.67% AOC | Yes (successful) | 59 |
1000 | HCEP | Fingerprints | Classification (four groups) | RF | 48% AOC | Yes (failed) | 60 |
3989 | Quantum-Machine.org | Fingerprints | Regression | deep tensor network | HOMO(45) LUMO(31)** | No | 61 |
350 | HOPV15 | Fingerprints | Regression | decision trees | HOMO (0.74) R2 | No | 62 |
51 000 | HCEP | Fingerprints | Regression | CNN | 0.91 (r)*** | No | 63 |
320 | Literature | Fingerprints | Regression | KRR | 0.78 (r) | No | 64 |
565 | Literature | Fingerprints | Regression | BRT | 0.71 | Yes (successful) | 65 |
5000 | HCEP | Image | Classification (two groups) | CNN | 91.02% AOC *** | No | 66 |
270 | Literature | Microscopic properties | Regression | GB | 0.79(r) | No | 67 |
300 | Literature | Microscopic properties | Regression | GBRT | 0.78(r) | No | 68 |
566 | Literature | Microscopic properties | Regression | k-NN | 0.72(r) | No | 69 |
290 | Literature | Microscopic properties | Regression | GBRT | 0.80 (r) | No | 70 |
249 | Literature | Microscopic properties | Regression | kRR | 0.68 (r) | No | 19 |
2.3 million | HCEP | Energy level | Scharber's model | No | 71 | ||
380 | Auto-generation | Energy level | Scharber's model | No | 72 | ||
135 | Literature | Energy level | regression | RF | 0.80R2 | No | 73 |
124 | Literature | Energy level | regression | RF | 0.77R2 | No | 74 |
121 | Literature | Energy level | regression | RF | 0.77(VOC)R2 | No | 75 |
70 | Literature | Energy level | regression | RF | 0.69R2 | No | 76 |
1800 | Simulation | Simulated properties | No | 77 | |||
65000 | Simulation | Simulated properties | Classification | CNN | 95.80%(JSC) AOC | No | 78 |
20000 | Simulation | Simulated properties | ANN | Yes | 79 |
Ma et al. used RF and Gradient boosting regression tree (GBRT) algorithms to predict device characteristics (VOC, JSC, and FF) through microscopic properties.68JSC (r = 0.78) and FF (r = 0.73) showed strong correlation with PCE, however, VOC (r = 0.15) showed very weak correlation with PCE, these results are consistent with recently reported results.60 The JSC and FF are found to be poorly correlated (r = 0.33), with almost no correlation between VOC and JSC (r = −0.18) as well as VOC and FF (r = −0.09).
To analyze the effect of descriptor types on the prediction ability of ML models, Trois et al. trained k-Nearest Neighbors (k-NN), Kernel Ridge Regression (KRR) and Support Vector Regression (SVR) models on data of 566 donor/acceptor pairs collected from the literature.69 Both structural (topology properties) and physical descriptors (energy-levels, molecular size, light absorption and mixing properties) are used. The structural descriptors are the ones that contributed most to the ML models. Some physical properties showed high correlation with PCE but didn’t improve the model prediction ability because information they carry was already encoded in the structural descriptors.
To achieve push–pull conjugated systems, different types of building blocks such as electron-deficient, electron-rich and π-spacer are used to design organic semi-conductors. Ma et al. performed ML modelling to screen 10000 molecules designed from 32 building blocks.70 The purpose of this study was to understand the effect of the nature and arrangement of building blocks. Descriptors were calculated from ground and excited states of candidate molecules. 126 potential candidates with predicted efficiencies ≥8% were selected on the basis of GBRT and ANN models. This approach is efficient to screen the potential candidates for organic solar cells.
Troisi et al. used electronic and geometrical properties to train ML models to predict device parameters, and it performed much better than the Scharber model (Pearson's coefficient (r) = 0.68).19
In organic solar cells, the thermodynamics of mixing of active layer materials controls the evolution of film morphology, and consequently, changes the charge-transport and light-harvesting, so controls the overall performance and stability of the final device.80,81 It is important to study the relationship between molecular interaction parameters and phase behaviour of thin films. For this purpose, Perea et al. used the ANN model and the Flory–Huggins solution theory to study the phase evolution of polymers and fullerenes.82 The ANN model was used to predict solubility parameters by surface charge distribution. Combined with solubility parameters, a figure of merit was established to describe the stability of polymer–fullerene blends (Fig. 9).
Fig. 9 Computational flowchart describing the routine for determining the relative stability capable of describing the microstructure of polymer:fullerene blends. (i) Creation of the σ-profile from the conductor-like screening model (COSMO); (ii) σ-moments as extracted from COSMO are fed into an artificial neural network (ANN) to determine Hansen solubility parameters (HSPs); (iii) HSPs are used to calculate the qualitative Flory–Huggings interaction parameters (χ1,2); (iv) implementation of moiety-monomer-structure properties (reduced molar volumes/weights); (v) spinodal demixing diagrams resulting from polymer blend theory; and (vi) figure of merit (FoM) defined as the ratio of the Flory–Huggins intermolecular parameter and the spinodal diagram forms the basis of a relative stability metric. Adopted with permission from ref. 82 Copyright 2017, American Chemical Society. |
In 2017, Aspuru Guzik's group investigated millions of molecular motifs using 150 million DFT calculations.71 PCE was predicted using Scharber's model,18 and the calculated energy level was used as input. Candidates with a PCE of more than 10% were identified. In 2017, Imamura et al. reported the automatic generation of thiophene-based polymers from donor and acceptor units, estimation of the orbital levels by Hückel-based models and an evaluation of photovoltaic characteristics.72 PCE was calculated using Scharber's model, but its performance is very poor.19–21 Molecular descriptors and microscopic properties of semi-conductors were totally ignored.
Min-Hsuan Lee performed Random Forest (RF) modelling on a database of >100 bulk heterojunction solar cells and achieved high prediction accuracy (R2 of 0.85 and 0.80 for the training set and testing set, respectively).73 In this study, the number of descriptors was small (HOMO, LUMO and band gap). Inclusion of parameters such as solubility parameters, interaction parameters and surface energies as input for the ML model can further enhance the usefulness of similar studies.
Various examples of ML of binary solar cells are discussed above. Generally, ternary OSCs show higher performance than binary OSCs. Binary OSCs are facing the problem of insufficient light harvesting due to narrow absorption range of organic semi-conductors. In ternary OSCs, the third component, can be a donor or an acceptor, and not only works as an additional absorber to enhance photon harvesting but also helps to achieve favourable morphology.84 As the working of ternary solar cells is complex as compared with binary solar cells,85,86 it is a challenging task to find ideal third components for ternary solar cells. Min-Hsuan Lee has trained an ML model for ternary solar cells using Random Forest, Gradient Boosting, k-Nearest Neighbors (k-NN), Linear-Regression and Support Vector Regression. The value of LUMO for the donor (D1) showed a noticeable linear correlation with PCE (r = −0.55), and the correlations of other indicators with PCE were weak.74 The value of VOC is highly correlated with the HOMO of the donor (r = −0.54) and the LUMO of the donor (r = −0.54), respectively, suggesting that the energy levels of the donor need further consideration to find the origin of VOC in ternary OSCs. The Random Forest model showed higher R2 (0.77 on test set) value among all ML methods. In another study, he trained the ML model to predict the VOC of fullerene derivatives based ternary organic solar cells. The descriptors were the same as in a previous study.75 The Random Forest model showed an R2 value of 0.77. In both studies, just energy levels of organic semi-conductors were used as descriptors and other molecular descriptors and the thin film morphology effect were ignored. Therefore, it is required to develop a hybrid modelling framework that should include thin-film characteristics (e.g., the appropriate ratio of three components) and fabrication conditions (e.g., annealing temperature and solvent additive). Optimization of all the factors can enhance charge generation and reduce voltage loss, and resultantly can improve device efficiency.87 Theoretical analysis of the morphology of the three components is much more complex than that of two components.
Tandem organic solar cells have shown superior power conversion efficiency (PCE). A tandem organic solar cell consists of two sub-cells. The major purposes of this device architecture are to widen the photon response range, and suppress the transmission loss and thermalization loss.88 It is more challenging to develop a relationship between efficiency and physical properties of active layer materials due to the great diversity of organic materials that leads to more candidate materials. To solve this problem, Min-Hsuan Lee used ML algorithms to predict the efficiency of tandem OSCs and identify sufficient bandgap combinations for tandem OSCs.76 Random Forest regression was used to perform machine learning using energy levels as input to predict the efficiency. The results indicate that energy offset in the LUMO level between the donor and the acceptor material should be optimized to improve the process of electron transfer and device performance.
Fig. 10 (a) Simple sketch of CNN architecture and (b) Confusion matrix. Adopted from ref. 78 Creative Commons Attribution 4.0 International License. |
MacKenzie et al. used the Shockley–Read–Hall based drift-diffusion model to simulate current/voltage (JV) curves.79 They generated a set of 20000 devices and calculated electrical parameters such as carrier trapping rates, energetic disorder, trap densities, recombination time constants, and parasitic resistances. Simulated data were used to train the neural network. Then the trained model was used to study the effect of surfactant choice and annealing temperature on charge carrier dynamics of some well-known OSC devices.
The solubility of the active layer materials in a specific solvent controls the film morphology and resultantly affects the performance of the device. Risko et al. calculated the free energy of mixing using molecular dynamics (MD) simulation.95 They also used Bayesian statistics to calculate the free energy of mixing. This approach is an effective and fast way to study the large number of solvent and solvent additives.
For effective training of an ML model, a large amount of data is required. Availability of data is not a problem for fields such as image recognition, i.e. millions of input datasets are available. However, in organic solar cells data are in hundreds or thousands. It is reported that accuracy is increased with increase of the data points (number of molecules).31,66,76 In the case of ML models trained using descriptors related to power conversion processes, it is hard to include huge data due to time consuming DFT calculations. For limited data sets, meta-learning is a promising solution, whereby knowledge is learned within and across problems. A Bayesian framework is also a good option for limited data. A dual strategy is required to maintained balance between the availability of the data and the predictive capability of models. Instead of affecting the model precision directly, the effect of data size can be mediated by the degree of freedom (DoF) of the model, resulting in the phenomenon of association between precision and DoF. This concept is originated from the statistical bias-variance trade-off. In this regard, Zhang and Ling proposed a strategy to apply machine learning to small datasets.96
In this route, the first step is collection of images, which will be manual due to the presence of various images for one compound under different experimental conditions. Automatic image extraction and sorting will be very difficult. Therefore, human assistance is essential. The second step will be task specification and analysis, for example decision of the data label or target properties. The morphology of the active layer strongly influences the FF values.99 Therefore, it will be more realistic to select FF as the target rather than PCE. Then FF along with the other factors can correlate with the PCE. Another thing to decide is the use of classification or regression. In the case of small data sets, classification can be preferred and vice versa. The third step will be training of the model and extraction of the pattern from the data to make a prediction. The last step will be experimental validation. It feels a hard journey to connect images to performance, however, it will be fruitful.
Although no study is reported where experimental morphologies were used as input for machine learning analysis of organic solar cells, there are several studies that used simulated images and properties to train an ML model. Some studies are discussed in Section 4.6 and some are cited. These studies will pave the way toward the utilization of experimental morphologies for machine learning models.100
ML | Machine learning |
BHJ | Bulk-heterojunction |
GBRT | Bradient boosting regression tree |
RF | Random forest |
SVM | Support vector machine |
ANN | Artificial neural network |
DNN | Deep neural network |
CNN | Convolutional neural network |
KRR | Kernel ridge regression |
SVR | Support vector regression |
GPR | Gaussian processes regression |
SMILES | Simplified molecular-input line-entry system |
PCE | Power conversion efficiency |
V OC | Open-circuit voltage |
J SC | Short-circuit current density |
FF | Fill factor |
HOMO | Highest occupied molecular orbitals |
LUMO | Lowest occupied molecular orbitals |
DFT | Density functional theory |
r | Correlation coefficient |
R 2 | Coefficient of determination |
HCEP | Harvard Clean Energy Project |
HOPV15 | The Harvard Organic Photovoltaic Dataset |
This journal is © The Royal Society of Chemistry 2021 |