Jun
Yan‡
a,
Xabier
Rodríguez-Martínez‡
*bc,
Drew
Pearce
a,
Hana
Douglas
a,
Danai
Bili
a,
Mohammed
Azzouzi
a,
Flurin
Eisner
a,
Alise
Virbule
a,
Elham
Rezasoltani
a,
Valentina
Belova
c,
Bernhard
Dörling
c,
Sheridan
Few
af,
Anna A.
Szumska
a,
Xueyan
Hou
a,
Guichuan
Zhang
d,
Hin-Lap
Yip
de,
Mariano
Campoy-Quiles
*c and
Jenny
Nelson
*a
aDepartment of Physics, Imperial College London, SW7 2AZ, London, UK. E-mail: jenny.nelson@imperial.ac.uk
bElectronic and Photonic Materials (EFM), Department of Physics, Chemistry and Biology (IFM), Linköping University, Linköping, SE 581 83, Sweden. E-mail: xabier.rodriguez.martinez@liu.se
cInstituto de Ciencia de Materiales de Barcelona, ICMAB-CSIC, Campus UAB, Bellaterra 08193, Spain. E-mail: mcampoy@icmab.es
dInstitute of Polymer Optoelectronic Materials and Devices, State Key Laboratory of Luminescent Materials and Devices, South China University of Technology, Guangzhou 510640, P. R. China
eDepartment of Materials Science and Engineering, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong
fSustainability Research Institute, School of Earth and Environment, University of Leeds, LS2 9JT, Leeds, UK
First published on 20th May 2022
Non-fullerene acceptors (NFAs) are excellent light harvesters, yet the origin of their high optical extinction is not well understood. In this work, we investigate the absorption strength of NFAs by building a database of time-dependent density functional theory (TDDFT) calculations of ∼500 π-conjugated molecules. The calculations are first validated by comparison with experimental measurements in solution and solid state using common fullerene and non-fullerene acceptors. We find that the molar extinction coefficient (εd,max) shows reasonable agreement between calculation in vacuum and experiment for molecules in solution, highlighting the effectiveness of TDDFT for predicting optical properties of organic π-conjugated molecules. We then perform a statistical analysis based on molecular descriptors to identify which features are important in defining the absorption strength. This allows us to identify structural features that are correlated with high absorption strength in NFAs and could be used to guide molecular design: highly absorbing NFAs should possess a planar, linear, and fully conjugated molecular backbone with highly polarisable heteroatoms. We then exploit a random decision forest algorithm to draw predictions for εd,max using a computational framework based on extended tight-binding Hamiltonians, which shows reasonable predicting accuracy with lower computational cost than TDDFT. This work provides a general understanding of the relationship between molecular structure and absorption strength in π-conjugated organic molecules, including NFAs, while introducing predictive machine-learning models of low computational cost.
Broader contextOrganic π-conjugated semiconductors (OSCs) work as the main light harvesters in organic photovoltaics (OPVs). Their synthetic versatility converts them onto suitable candidates for rational molecular design based on high-throughput screening techniques. Significant advances in the efficiency of OPVs (exceeding 19% in single junctions under 1 sun) have been made by trial-and-error with new but increasingly diverse materials, primarily non-fullerene acceptors (NFAs) and mostly owing to their high absorption strength. However, the reasons for that superior light harvesting performance remain elusive, thus preventing the molecular tailoring of NFAs with further enhanced light harvesting capabilities toward breakthrough OPV efficiencies. A statistical analysis of time-dependent density functional theory calculations and machine learning (ML) models reveal that molecular linearity, planarity, polarizability, and number of π-conjugated carbon atoms correlate strongly with the absorption strength of OSCs. A structure–absorption strength relationship is established to introduce design rules for highly absorbing OSCs. ML models, in combination with extended tight-binding Hamiltonians, are shown to predict the absorption strength of OSCs. As a result, this work contributes to an improved understanding of the absorption strength of π-conjugated organic molecules in general while suggesting ways to design highly absorbing NFAs that maximize the light harvesting capabilities for solar energy conversion. |
Fig. 1 (a) Molecular structures of typical organic acceptors, including PC61BM, PC71BM, O-IDFBR, O-IDTBR, ITIC, IT-4F, IDIC, IEICO, IEICO-4F, Y5, Y6, and Y7. (b) Refractive index and (c) extinction coefficient of a larger set of typical organic acceptor thin films measured using VASE. (d) Experimental εd,max in solution versus calculated εd,max in vacuum using TDDFT of a set of ∼80 π-conjugated molecules. (e) Estimated experimental εd,max in film (solid state) versus that in solution using eqn (3). Panel (d) contains a subset of well-known NFA molecules that are highlighted in colour. All TDDFT results in panel (d) were performed using the functional B3LYP and basis set 6-311+G(d,p), except for the ones (grey squares) taken from ref. 69 that are based on the LRC-wPBEh functional and 6-311+G(d) basis set. We also note here that the side chains of molecules are replaced by H atoms or methyl groups in the calculations as they are computationally expensive and do not contribute to the π-conjugation, hence electronic transitions.29 The experimental data of εd,max in film are converted from maximum values of extinction coefficients shown in panel (c) using eqn (3), while solution data are collected from literature, noting that different values may be present for the same material as retrieved from different sources. Grey dashed lines indicate the perfect match between x and y axis. The data required for generating panels (d) and (e) in this figure are presented in the ESI.† |
Excited state calculations based on quantum chemistry methods, such as time-dependent density functional theory (TDDFT),8,29,31–33 Hartree–Fock method,34ab initio Monte Carlo method,35 second order Møllier–Plesset theory (MP2),36 and coupled cluster method,37 have been applied to predict the electronic and optical properties of molecules. Among them, TDDFT is the most widely applied method for excited state calculations, and has shown reasonable accuracy in calculating and predicting the trends in absorption strength of organic molecules,29,31 as also demonstrated in this work. However, the rapid scaling of computation time with molecular size has been the real obstacle limiting the applicability of TDDFT for excited state calculations on molecules with hundreds of atoms. Given the size and diverse structure of modern NFAs, faster and more efficient methods are therefore needed to establish the relationship between excited-state and molecular properties in NFAs.
The emergence of artificial intelligence (AI) has made it possible to study quantitative structure–property relationships (QSPRs) in molecules with massively improved computational efficiency. As the most popular branch of AI, machine-learning (ML) has attracted much attention in materials science over the last decade, and has been widely applied for material property prediction and material discovery.38–41 Recently, ML has also gained popularity in OPV scenarios,42–52 yet existing ML studies related to OPVs have been primarily focused either on the energetics42,43,53–56 or directly on PCE,42,48,57–64 with little attention paid to the absorption strength of the photoactive materials.65,66 Moreover, there are no ML studies explicitly focused on the absorption strength of NFAs beyond the identification of moieties of frequent appearance in highly absorbing molecules.42 However, QSPR and ML models have been successfully applied to investigate the absorption strength of fluorophores or dyes typically employed in bioimaging, showing encouraging results.30,67,68 Therefore, it is appealing to apply ML methods in combination with QSPR models to investigate the origin of the large absorption strength in state-of-the-art NFAs.
Here, we present an experimental, TDDFT, QSPR, statistical and ML study of the absorption strength of NFAs to identify the key chemical and structural features that lead to high optical absorption in state-of-the-art NFAs. We exploit a database of nearly 500 unique organic molecules (or 3500 calculations) generated using DFT and TDDFT over several years. We obtain good quantitative agreement between TDDFT calculations of absorption strength and experimental values for state-of-the-art NFAs and fullerenes, which supports the use of TDDFT results for further statistical and QSPR modelling. Accordingly, we extract molecular information from the DFT-optimized geometries by computing nearly 6000 molecular descriptors and first looking for correlations with the absorption strength. The strongest correlations are found between experimentally measured maximum molar extinction coefficient (εd,max) and two main molecular descriptors from calculations: λ1,p and C2SP2, which describe the size of the molecule in the direction of maximal atomic polarizability, and the number of sp2 hybridized carbon atoms that are bound to two other carbons (C2), respectively. These quantities can be related to a few key material features leading to high absorption strength: linearity, planarity, and extension of the π-conjugation in the form of fused and closed-ring moieties, in good agreement with previous ML reports on fluorophores and dyes.30 We further identify several moieties and paired combinations thereof that are frequently found in highly absorbing NFAs, corresponding to thieno[3,2-b]thiophene (TT), thiophene (T), 2-(5,6-difluoro-3-oxo-2,3-dihydro-1H-inden-1-ylidene)malononitrile (2FIC), 2-(3-oxo-2,3-dihydro-1H-inden-1-ylidene)malononitrile (IC) and indaceno[1,2-b:5,6-b′]dithiophene (IDT). These form a catalogue of molecular design rules to further enhance the absorption strength of organic π-conjugated molecules, such as next-generation NFAs. We then train and test an ensemble learning method, namely a random decision forest (RF), to predict εd,max and provide further information about the most important features in the modelling of absorption strength in organic π-conjugated molecules. Finally, we explore the possibility to predict εd,max while using a cheaper molecular geometry optimization method based on semiempirical extended tight-binding (xTB) Hamiltonians instead of the expensive DFT approach. We do so by training a RF with our TDDFT database and proving its predictive properties in terms of εd,max when interpolated using xTB-optimized geometries. This approach shows application potential in high-throughput screening studies in combination with generative molecular models.
As a metric for absorption strength, we initially consider several candidates such as the oscillator strength (fosc), the absorption coefficient (α) or the imaginary part of the dielectric function (ε2). In this work, we eventually focus on the maximum molar extinction coefficient (εd,max, M−1 cm−1) of NFAs as it shows the best agreement between experimental and theoretical data, as we demonstrate below. εd,max constitutes a typical experimental measurement in solution that can also be accessed from myriad literature references. Note that the usual calculations based on single molecules using TDDFT cannot account for solid state effects as they are performed for isolated molecules in vacuum or surrounded by an isotropic medium (such as a solvent using the polarizable-continuum-solvent-model, PCM, Fig. S2, ESI†). The derivation of the theoretical εd is provided in the Methods section, which results in a mathematical expression for εd,max as
(1) |
The experimental εd,max from solution can be obtained using the optical density (OD) measurements performed using UV-visible spectroscopy, via
(2) |
(3) |
Fig. 1d presents the results of the comparison between experimental εd,max in solution and theoretical εd,max calculated from single molecules using TDDFT in vacuum. A brief discussion of the solvent effect on the absorption strength and the reasons why we choose a vacuum medium are provided in Fig. S2 (ESI†). Despite the scattering of data points, we observe the occurrence of a monotonic relationship between solution and calculated εd,max with a Pearson correlation coefficient (r) of 0.77. Interestingly, such correlation is no longer observed when quantifying the absorption strength in terms of αmax neither when adding further data points from literature on π-conjugated fluorophores to our statistical analysis (Fig. S3a (ESI†), r = 0.30), which is believed to be caused by the differences in molecular weight; in that case, only εd,max is found to follow a monotonic trend (Fig. S3b, ESI†). Some of the material assumptions on refractive index and density required to obtain αmax values might be responsible for the observed mismatch. It is worth noting that, expectedly, the correlation between solid state (film) and solution (r = 0.66, Fig. 1e) or calculated εd,max (r = 0.61, Fig. 1e) is not as good as that from solution data versus calculated εd,max (r = 0.77, Fig. 1d, neither for αmax as shown in Fig. S4, ESI†). Such discrepancy is attributed to solid-state effects such as aggregation effects,15 intermolecular orientation,70,71 and side chain interactions,72 which are not considered in single molecule excited state calculations.29 The observed trend that a highly absorbing material in solution will produce highly absorbing films is, nonetheless, generally valid and thus solution data are relevant for devices. Since the NFAs analysed here have a rather similar number of π-electrons (nπ), the corresponding εd,max per π-electron (Fig. S6, ESI†) shows a similar trend as that in Fig. 1d, e, and Fig. S4 (ESI†). Despite the simplicity of single molecule excited state calculations, these data show that using TDDFT calculations of the excited state to deliver εd,max can provide a reasonably good approximation to experimental measurements. Moreover, dealing with TDDFT calculations gives us room to correlate key molecular properties, such as molecular size and shape (aspect ratio), linearity, planarity, grafted side chain positions, or functional groups, to the absorption strength using molecular descriptors. These observations provide a foundation from molecular structures to identify the origin and further extend the high optical extinction of NFAs through chemical design rules, as we show in the upcoming sections.
Then, we introduce several target features related with absorption strength, starting from the maximum oscillator strength of any calculated transition (fmax); the maximum oscillator strength of any transition in the visible electromagnetic window (herein constrained between 300–1200 nm or 1–4 eV for its relevance in solar energy harvesting applications) (fmax,vis); and the sum of oscillator strengths of all transitions in the visible window, fsum,vis. These three features are also evaluated per nπ for the molecule, i.e., fmax/nπ, fmax,vis/nπ and fsum,vis/nπ. We then consider the maximum absorption coefficient (αmax) obtained using eqn (1) and (3); the maximum of the imaginary part of the dielectric function (ε2,max);29 and εd,max. Finally, we compute the spectral overlap between the OD (dα(E), where d is set to a typical film thickness value of 100 nm and α(E) derives from the Gaussian-broadened spectrum of f in the visible spectral range taking a standard deviation of 0.1 eV) and the AM1.5G solar photon flux spectrum (ΦAM1.5G), namely .
These features, together with their corresponding histograms (Fig. S14, ESI†) in terms of Spearman's rank correlation coefficients (ρ), are explained in more detail in Note S2 (ESI†). Molecular descriptors are calculated using up to four different open-source packages73–76 (Note S2, ESI†) to generate a (curated) collection of 3239 entries (including 40 electronic descriptors derived from the TDDFT calculations, namely the energy of the molecular orbitals ranging from HOMO−19 to LUMO+19). Then, we scan for statistical correlations between those descriptors and all target features introduced above, from which we consider as highly correlated descriptors those showing ρ ≥ 0.7 as threshold. However, since some descriptors are calculated in groups or families where weighting factors are varied among atomic masses, van der Waals volumes, electronegativities, ionization potentials or polarizabilities, we usually encounter sets of multicollinear descriptors that show very similar trends with respect to the target feature. Accordingly, to drop redundant (collinear) descriptors we classify them into clusters to select the most representative candidate of each bundle (i.e., cluster). This serves us to simplify the identification of characteristic and well-correlated descriptors families. The clustering algorithm applied to analyse multicollinear descriptors based on ρ and r values is further described in Note S3 (ESI†).
After running the clusterization of descriptors on all target features, we identify strong correlations with molecular descriptors for fmax, fsum,vis and εd,max (i.e., implicitly fmax,vis). For the remaining target variables (foverlap, αmax, ε2,max, fmax/nπ, fmax,vis/nπ and fsum,vis/nπ), we do not identify molecular descriptors with ρ above the threshold value (0.7) and they are generally below 0.6 units, see Fig. S14 (ESI†). The lack of correlation for foverlap could be justified by the existence of a gas-to-solid shift in the corresponding absorption spectrum, which prevents proper matching of the Gaussian-broadened absorption features with the solar photon flux. Regarding αmax and ε2,max, the estimation of these values from TDDFT calculations requires taking generalized assumptions on several materials properties (such as density or refractive index) that might be enough to disturb the underlying trends in our heterogeneous material database. For the quantities normalised by the number of pi electrons, i.e. fmax/nπ, fmax,vis/nπ and fsum,vis/nπ, the weak correlation is expected since normalization tends to deviate from linear correlations depending on the straightness of the molecule.29 Due to the strong correlation between size of the molecule and oscillator strength as discussed below based on C2SP2, the normalised quantity is believed to be a secondary factor, therefore not clear correlations are observed. In the successful correlation cases (i.e. fmax, fsum,vis and εd,max) and with the given thresholds of 0.7 units for ρ and r, we identify a single feature cluster lead by the λ1,p descriptor in the case of fmax and εd,max (Fig. 2a). For fsum,vis, a threshold ρ of 0.68 reveals C2SP2 as a rather descriptive molecular feature (Fig. 2b). Interestingly, C2SP2 is also found in the main cluster represented by λ1,p in fmax and εd,max, and we could not identify any strong correlations between the absorption strength (in any of its proposed metrics) and electronic descriptors (from HOMO−19 to LUMO+19 energy levels). Note that εd,max values in excess of 2.5 × 105 M−1 cm−1 in Fig. 2a and b are mostly attributed to artificially straight conjugated oligomers with >10 monomers contained in our database, for which the straightness, hence high εd,max, are unlikely to be maintained in the experimental solid state scenario. In fact, only the exemplary and asymmetric NFA known as BDTP-4F (inset of Fig. 2a)77,78 surpasses that threshold with a record εd,max in our NFA dataset (2.7 × 105 M−1 cm−1, and 2.4 × 105 M−1 cm−1 measured in CHCl3 solution).77
λ 1,p is part of a bundle of three-dimensional molecular size and shape descriptors known as weighted holistic invariant molecular (WHIM) descriptors.79–81 These can be interpreted as a generalized search for the principal axes with respect to a defined atomic property.82 In this particular case, λ1,p is obtained by performing a principal component analysis (PCA) on the centred atomic coordinates of the molecule using a covariance matrix (sjk) that is weighted by the atomic polarizabilities (pi):
(4) |
To further interpret these two magnitudes (λ1,p and C2SP2) as the main correlated descriptors with εd,max and fsum,vis, we inspect the DFT-optimized geometries of archetypal NFAs (Fig. 2c). The observed trend suggests that optical extinction monotonically increases with λ1,p (Fig. 2a) in molecules having most of their polarizable atoms arranged along a main axis, i.e., linear molecules. While CBM shows large torsion angles mainly affecting the 2,1,3-benzothiadiazole (BT) moieties (thus making the molecule non-planar and increasing λ3,p, see Fig. S15, ESI†), Y6 shows a characteristic curved geometry that limits its εd,max despite showing improved planarity. The NFA with the highest λ1,p (NIBT) shows both linearity and planarity, with most of the more polarizable atoms (mainly C and S) lying along the principal polarizable axis of the molecule. Thus, in terms of molecular geometry, the absorption strength of NFAs could be further enhanced by distributing most of the atomic polarizability along a main axis while keeping good planarity and minimizing curvature. However, λ1,p is not the sole molecular descriptor governing absorption strength, as BDTP-4F shows ca. 40% lower λ1,p (0.75 nm2) yet ca. 40% higher εd,max than NIBT (Fig. 2a), which suggests that the molecular symmetry of NFAs could be another important factor affecting εd,max. Our preliminary investigations on this issue indicate that molecular asymmetry, as quantified by the WHIM symmetry index Gu, might drive absorption strength higher (Fig. S16a, ESI†), yet we require a larger NFA database including more asymmetric molecules to further explore such an observation. Also, we acknowledge that this observation might be biased by the systematic omission of side chains in the TDDFT calculations. By comparing λ1,p in a selection of small molecule acceptors geometrically optimized with and without side chains (Fig. S17a, ESI†), we observe that in most cases the addition of side chains either decreases λ1,p slightly or keeps it invariant. Still, the positive correlation of λ1,p with respect to εd,max is maintained (Fig. S17b, ESI†). Furthermore, the presence of naphthalene imide derivatives in the molecular structure of NIBT could be hindering further increase of the absorption strength with λ1,p, as suggested by our statistical analysis of frequent moieties in the selection of good light harvesters (presented in the next section). On the other hand, an increase of nπ in the molecule in the form of closed-ring conjugated moieties will systematically increase C2SP2 and accordingly fsum,vis. These findings support the previously known design rules in terms of molecular linearity and π-conjugation enabling large oscillator strength in organic small molecules and polymers, and are consistent with a recent study on chromophores.30 In particular, trans-conjugated polymer stereoisomers are known to possess higher optical extinction due to their increased straightness and persistence length,29 which agrees with our observations on exemplary curved (Y6) and more linear (NIBT) NFAs.
The energy of the first optical transition (E1) is also of practical importance in light harvesters such as NFAs as the lower energy part of the solar spectrum, down to ∼1 eV, contains a higher photon flux density. Our results show the number of heteroatoms in the molecule as the most correlated feature with (ρ = −0.72, Fig. S18a, ESI†) while forming a single feature cluster, yet neither λ1,p nor C2SP2 show strong correlations with E1. This fact prevents the introduction of molecular design rules targeted at E1 using λ1,p or C2SP2. However, we acknowledge a negative correlation between E1 and fosc,max among common NFAs that suggests further room for absorption strength increase as E1 is reduced (Fig. S19, ESI†).
We further study the existing correlation between pairs of moieties to understand in which way the different molecular fragments should (or should not) be combined to retrieve highly-absorbing molecules. Our analysis starts by creating molecular subsets determined by the presence of a given moiety, which acts as source node (coloured in black) in the network graph shown in Fig. 3b. Within that subset, we identify the high-absorbing molecules (fosc,max > 2.5) and compute the Z-scores of their moieties (child nodes, coloured in grey in Fig. 3b) with respect to the molecules of the entire molecular subset. As per the network shown in Fig. 3b, the absolute Z-scores will determine the width of the edges connecting the nodes (moieties) and its sign the colour of the edge (green for positive Z-score [overrepresentation] and red for negative Z-score [underrepresentation]). Therefore, green and thick edges connect pairs of molecules that are more frequently found in high-absorbing molecules whereas thick and red edges indicate combinations of moieties that lead to less absorbing molecules. In this analysis, we set up 8 different source nodes corresponding to the most overrepresented moieties observed in Fig. 3a. As a result, Fig. 3b can be interpreted as a catalogue of design rules relating pairs of moieties with high oscillator strength in π-conjugated small molecules.
Accordingly, we deploy a state-of-the-art ML method, namely a RF, to aid in both aspects: feature selection and building of εd,max models of higher accuracy. RFs constitute one of the simplest and most widely applied ML methods in molecular screening and data mining studies.30,43,46,86 They are particularly appealing for their straightforward implementation through open-source Python libraries such as Scikit-Learn,87 and also for their inherent robustness against overfitting and fast optimization. RFs are formed by an ensemble of decision trees (estimators) that are executed in parallel and independently from each other. Decision trees serve to classify data by starting from a single root node that is subsequently divided into child nodes, the latter being chosen randomly among the input features. At every node splitting step (i.e., decision making), the algorithm selects the pathway that minimizes the mean square error (MSE). Eventually, when every tree reaches its maximum extension (which is set arbitrarily via model hyperparameters), the predictions of all trees are averaged (ensembled), hence constituting the final predicted value of the RF. At this stage, myriad cross-validation (CV) techniques exist to evaluate the quality of the model and help in the tuning of hyperparameters. CV methods can estimate the ML model performance, evaluate potential over- or underfitting, and quantify how accurate the model is on drawing predictions on unseen data. In this work, we adopt two common cross-validation schemes, namely a repeated holdout CV; and a leave-one-out cross-validation (LOOCV). On the one hand, in a repeated holdout CV the pristine dataset is randomly split onto two distinct subsets, namely the training (here gathering 70% of the data) and testing (the remaining fraction of data, i.e. 30%) subsets. The model is trained and tested on the respective subsets, and the corresponding statistical metrics (R2, r, MSE, etc.) annotated. Eventually, the process is repeated k times (10-fold in this work), and all metrics are averaged to evaluate the ML model performance (its CV score). On the other hand, in a LOOCV the holdout process is taken to the extreme as the testing subset consists of a single data point while the remaining data is used in the training step. The process runs recursively for all data, thus eventually all data points are used for training and testing in the LOOCV protocol. Yet being computationally expensive, a LOOCV results in a more accurate estimate of model performance.
Table S1 (ESI†) includes the performance of an out-of-the-box RF model trained and cross-validated using 300 trees (estimators). Exemplary comparisons between the two previous baseline models (1-nearest neighbor and linear regression) and the out-of-the-box RF model are found in Fig. S20 (ESI†). The RF models indicate that scoring functions (R2, r) well above 0.6–0.8 are feasible upon careful feature selection and further optimization of the RF regressor. Feature selection in RFs is usually performed by filtering variables based on their feature importance, which is a metric that accounts for how much a feature decreases the weighted variance in the node splitting steps of the decision trees. This property enables feature ranking to then apply myriad algorithms to filter out the least important variables as seen by the RF regressor. In this work, we perform a recursive feature elimination (RFE) procedure to the initial library of 3239 descriptors as described in Note S2 (ESI†). In a RFE protocol, a significant fraction of the initial population of features is dropped in successive training steps of the RF ensemble. Features are dropped based on their corresponding feature importance until reaching an arbitrarily low number of input variables, hence simplifying the original model. Our RFE analysis shows that a threshold average R2 of 0.70 is achieved using a 12-variable model (R2 = 0.70 ± 0.05, r = 0.84 ± 0.03), which outperforms the RF model presented earlier while including a drastic reduction in the number of variables (from 3239 to 12). The sweet spot in model accuracy and number of degrees of freedom is found for the 10-variable model, which shows the maximum average Radj2 (0.67 ± 0.06).
Notably, a threshold R2 of 0.60 is already achieved training a 3-parameter RF model (R2 = 0.63 ± 0.06, Radj2 = 0.62 ± 0.06, r = 0.80 ± 0.03), which is particularly appealing given its simplicity. The resulting three-variable model includes one three-dimensional descriptor (λ1,v or WHIM_45, as computed by the RDKit library, Fig. 4a), one two-dimensional descriptor (CIC3, as computed by PaDEL software, Fig. 4b) and one electronic descriptor, in this case the energy level of the second molecular orbital below the frontier HOMO (HOMO−2, Fig. 4c). λ1,v refers to the first eigenvalue of the covariance matrix weighted by the atomic van der Waals volumes; thus, λ1,v is included in the multicollinear feature cluster represented by λ1,p that we previously and statistically identified, showing nearly perfect correlation (r = 0.99) with λ1,p. Accordingly, λ1,v can be exchanged by λ1,p without loss of performance in the RF model. This finding confirms that the linearity of the molecule (either quantified in terms of polarizabilities or van der Waals volumes) plays a key role in determining its absorption strength in the form of εd,max. On the other hand, CIC3 is a graph-based, third-order neighbourhood symmetry index82 which lacks a straightforward interpretation due to its mathematical complexity. We observe, however, that it linearly scales as log2A, with A being the total number of vertices (atoms) in the graph (molecule)82 thus likely reflecting the size of π-conjugation as per the characteristics of our dataset. The interpretation of HOMO−2 as an important descriptor is more challenging, and it is not possible to substitute it by a different descriptor without a noticeable drop in the model performance (excepting HOMO−1, which shows r = 0.96).
Interestingly, electronic descriptors (in particular) are required for the RF models to achieve their highest potential and scoring despite we have not observed strong correlations in our earlier statistical analysis. To probe it, we have performed the same RFE protocol yet skipping the set of electronic descriptors among the input features. Our results show that the top performing RF models (selecting 29 variables and getting R2 = 0.58 ± 0.06, Radj2 = 0.48 ± 0.07, r = 0.78 ± 0.04; or selecting 9 variables to obtain Radj2 = 0.52 ± 0.06, see Fig. S13, ESI†) are yet behind the scorings recorded when the electronic descriptors are included in the list of features. Note that the performance without electronic descriptors is lower than the 3-parameter model that includes HOMO−2 as descriptor, highlighting its positive effect on the performance of the RF regressor.
Molecular fingerprints have also been extensively exploited as input vectorial descriptors in statistical and ML models focused on feature prediction.42,88–90 Molecular fingerprints are usually represented as bit activation vectors of arbitrary length and degree of complexity, representing the absence or presence of certain molecular (bonding) pattern, moiety, functional group, or atom. In this work, we exploit the RDKit library to generate moiety fingerprints, MACCS keys, Morgan fingerprints, path-based or topological fingerprints, E-state fingerprints, and Coulomb vectors. These fingerprints are quickly computed and serve to complement and improve the learning process of the ML models employed herein.
To better analyse the influence of the different fingerprint vectors in improving the RF scoring, we trained and cross-validated the 3-parameter RF model previously found in combination with all fingerprint vectors generated. The results shown in Table S2 (ESI†) indicate that by adding a Morgan fingerprint vector of 64 bits to the initial set of input features the model performance can be substantially improved: R2 increases by 10% (relative), and r by another (relative) 5% (see Fig. 4d). Therefore, Morgan fingerprints are particularly suitable to fine-tune the training and prediction accuracy of εd,max in RF models although lacking of a straightforward physical interpretation. Additional refinement of the RF hyperparameters results in further improved models. We performed this optimization through a randomized search (in 350 iterations) of the hyperparameters controlling the number of estimators in the RF, the minimum number of samples per leaf node and the minimum number of samples required to split an internal node, which constitute the main adjustable hyperparameters of the RF algorithm. These results are shown in Table S3 (ESI†), together with the scoring obtained in a rigorous LOOCV of the optimized RF model (Fig. 4e). As an alternative ensemble of decision trees, we have also tested and optimized an Extra Trees (ET) regressor in Scikit-Learn. Its performance is, however, very close to that attained in the workhorse RF regressor (Table S3 and Fig. S21, ESI†).
Fig. 5 (a) ML workflow used in this work to draw εd,max predictions. A RF model is trained on TDDFT data and interpolated (validated) on xTB geometries, including also their corresponding molecular descriptors. To improve the accuracy of the model, energy levels obtained using the GFN2-xTB Hamiltonian require calibration with TDDFT values (Fig. S23, ESI†). (b) Leave-one-out interpolation of the resulting RF model using three input molecular descriptors (including calibrated energy levels) and a 64-bit Morgan fingerprint vector. |
Nevertheless, the dissimilarity between xTB- and DFT-optimized molecular geometries might have a direct impact on the value of the (three-dimensional) molecular descriptors, and hence on the final accuracy of the interpolated ML model if some of those are included. Accordingly, we have first quantitatively compared both sets of molecular (non-electronic) descriptors by computing r in all of them and found that the median of their distributions is very close to unity in all cases (Fig. S22, ESI†). Based on this finding, we proceed by training the RF model with TDDFT-derived descriptors and exploring how well the model interpolates when fed with xTB-derived descriptors. Fig. S23a (ESI†) shows a leave-one-out interpolation of a RF model trained using TDDFT data and interpolated on GFN2-xTB-optimized molecules, descriptors and energy levels.84,91,92 In this kind of model validation, all TDDFT data is used in the training step excepting that for a single molecule, for which we retrieve its corresponding xTB-optimized geometry and descriptors as the sole interpolation (testing) dataset; this procedure is subsequently repeated for all molecules. Thus, the model performance is assessed by comparing the actual TDDFT-derived εd,max of the molecules (x-axis in Fig. 5b) with that predicted by a RF model trained with TDDFT data and interpolated using xTB-derived descriptors (y-axis in Fig. 5b). This is useful to evaluate whether such RF model fed with TDDFT data could be exploited to predict εd,max in unseen molecules that are geometrically optimized through xTB Hamiltonians.
Our first model takes as inputs the three molecular descriptors found previously to be the most important features in the RF model together with their corresponding (64-bit) Morgan fingerprints. The scoring of the LOOCV in this preliminary model (R2 = 0.53, r = 0.74) is limited due to the existence of a mismatch between the absolute energy levels retrieved by either DFT (B3LYP) or GFN2-xTB methods (Fig. S23b, ESI†). Thus, the RF model trained on TDDFT data needs proper calibration of the energy levels obtained through GFN2-xTB, which we perform using either a linear regression, a support vector regressor (SVR) or an additional RF model (Fig. S23c, ESI†). By applying such calibration on the HOMO−2 energy levels, we obtain the champion RF model (R2 = 0.61, r = 0.78) shown in Fig. 5b using three molecular descriptors and a 64-bit Morgan fingerprint vector. Hence, Fig. 5b shows that molecular databases of xTB-optimized geometries could be exploited in combination with TDDFT-trained ML models to predict the absorption strength (εd,max) at significantly lower computational cost and with reasonable accuracy. The statistical analysis and ML modelling framework introduced here is thus expected to show large potential in the high-throughput screening of highly absorbing molecular candidates in combination with generative models (autoencoders and neural networks) as part of future work in the group.
Nevertheless, understanding how light harvesting alone can be maximized by smart molecular design is significant for improving several different aspects of OPV performance. Light absorption is the primary step towards charge generation and is therefore strongly related to the macroscopic short-circuit current density of the device. According to the reciprocity relation between absorption and emission,20 high absorption should in principle lead to strong emission, therefore reducing the nonradiative energy losses, and benefitting the open-circuit voltage. In addition, high absorption allows the fabrication of thin devices, therefore facilitating charge extraction and enhancing fill factor.93 Moreover, based on the causality principle, high absorption strength would lead to higher refractive index, which takes the first interference maximum of electric field to lower thicknesses, resulting in large light harvesting potential in thinner devices. Therefore, designing highly absorbing organic π-conjugated molecules has the potential to enhance different aspects relating to the performance of OPVs in conjunction with the proposed predictive ML model.
A separate aspect for future work is the impact of solid-state molecular interactions on light absorption. This paper concerns the optical absorption of isolated molecules while applications normally require thin films of molecules. Although intermolecular interactions can strongly impact the strength as well as the spectrum of thin film absorption,94 this has been neglected in the present study due to the lack of a suitable database of computations and the lack of solid state packing information. In the future, ML approaches could be used to better understand and predict how solid state interactions affect optical absorption, and thereby improve molecular design rules. Such advances may be enabled by the growing capability in computational structure prediction as well as improved understanding of the impact of intermolecular interactions on excited state properties.
Theoretical description of molar extinction coefficient (εd): to calculate the molar extinction coefficient εd, let us start with defining the absorption coefficient α in a quantum picture (we stay with SI units for the moment). The absorption coefficient for transition from state 1 to state 2 can be defined as19,96
(5) |
(6) |
(7) |
(8) |
(9) |
(10) |
(11) |
(12) |
I(d) = I0e−αd = I010−OD | (13) |
OD = ρεdd, | (14) |
(15) |
(16) |
(17) |
Converting complex refractive index from solid state ellipsometry measurements to εd: using ellipsometry measurements from film (solid state), we can extract the complex refractive index, η
η = n + iκ | (18) |
(19) |
(20) |
Footnotes |
† Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d2ee00887d |
‡ J. Y. and X. R.-M. contributed equally to this work. |
This journal is © The Royal Society of Chemistry 2022 |