Ke
Chen
ab,
Christian
Kunkel
ab,
Karsten
Reuter
ab and
Johannes T.
Margraf
*ab
aChair for Theoretical Chemistry, Catalysis Research Center, Technische Universität München, Lichtenbergstraße 4, D-85747 Garching, Germany. E-mail: margraf@fhi-berlin.mpg.de
bFritz-Haber-Institut der Max-Planck-Gesellschaft, Faradayweg 4-6, D-14195 Berlin, Germany
First published on 8th February 2022
The molecular reorganization energy λ strongly influences the charge carrier mobility of organic semiconductors and is therefore an important target for molecular design. Machine learning (ML) models generally have the potential to strongly accelerate this design process (e.g. in virtual screening studies) by providing fast and accurate estimates of molecular properties. While such models are well established for simple properties (e.g. the atomization energy), λ poses a significant challenge in this context. In this paper, we address the questions of how ML models for λ can be improved and what their benefit is in high-throughput virtual screening (HTVS) studies. We find that, while improved predictive accuracy can be obtained relative to a semiempirical baseline model, the improvement in molecular discovery is somewhat marginal. In particular, the ML enhanced screenings are more effective in identifying promising candidates but lead to a less diverse sample. We further use substructure analysis to derive a general design rule for organic molecules with low λ from the HTVS results.
Despite this success, there remains a gap between the small, rigid molecules in QM9 and technologically or pharmaceutically relevant compounds, which are often larger and much more flexible. Furthermore, the target properties of molecular discovery are in practice seldom simple electronic properties that are directly accessible through single-point DFT calculations. Instead, complex properties like the bulk electronic conductivity, pharmacological or catalytic activity of a molecule are ultimately of interest.10 Unfortunately, these are extremely complicated to rigorously simulate even for a single molecule. In high-throughput virtual screening (HTVS) studies, it has therefore become common to focus on simplified descriptors that are known to correlate with the property of interest.11–13 Such descriptors include, e.g., the binding energy of a key intermediate in catalysis or the internal reorganization energy (λ) in molecular electronics.
Measuring the energetic cost for charge-carriers to move between molecular sites,14,15λ provides an important contribution to the charge-carrier mobility in crystalline and amorphous organic semiconductors.16,17 While computational screening for low-λ molecular structures has successfully guided discovery,18 its sensitivity to small variations in molecular structure19 renders a targeted molecular design challenging. Fragment19–21 or rule-based22,23 design strategies have been proposed to tackle this problem, while virtual screening24–29 or data-efficient30,31 discovery were used to assess large molecular candidate spaces, albeit without fully capturing the underlying structure–property relationships.
A reliable ML-based prediction of λ could fill exactly this gap—providing significant speed-ups for the assessment of thousands of molecules while potentially allowing for the extraction of robust chemical rules by explainable AI.32 ML-based approaches were indeed recently successful for the prediction of λ for rigid molecules,33 while flexible molecules still pose a significant challenge,34 likely because λ simultaneously depends on two potential energy surfaces (see Fig. 1).
Fig. 1 Illustration of the adiabatic potential energy surfaces of neutral and cationic molecular states. The reorganization energy λ is here calculated from the four indicated points38 as λ = E0(R+) − E0(R0) + E+(R0) − E+(R+). Focusing on holes as charge carriers, E0 and E+ are the total energies of the neutral and cationic molecular states, evaluated at the equilibrium geometries R0 and R+ of the respective states. In practice, two equilibrium geometries thus need to be obtained. |
In this contribution we therefore critically study the ML prediction of λ (specifically for hole conduction) as a challenging problem for chemical machine learning. To this end, we present a new dataset of hybrid DFT-level reorganization energies for 10900 carbon and hydrogen containing molecules consisting of up to sixty atoms and five rotatable bonds. A series of Gaussian Process Regression (GPR)35,36 models are developed for this dataset, both for straightforward structure/property mapping and Δ-ML37 using a semiempirical baseline. We find that the conformational freedom of these molecules can introduce significant noise to this inference task, so that the performance of the models is strongly influenced by the conformer sampling method. We further show that significant improvements in the predictive performance are achieved by adopting the Δ-learning strategy. Finally, we critically evaluate the usefulness of the obtained ML methods for the discovery of low-λ structures in a diverse chemical space and for deducing molecular design rules.
While these molecules thus purposely cover a diverse molecular and conformational space, we note that—as with any enumerated chemical dataset—unstable and reactive systems could be contained and synthesizability should be assessed separately. All chemoinformatics-related tasks were carried out using RDKit 2019.09.03.39
For the lowest-energy conformers, reorganization energies were computed at the GFN1-xTB level (λGFN1). Note that GFN1-xTB was chosen instead of its successor (GFN2-xTB) because we found the former to be slightly more reliable in terms of predicting λ and molecular geometries for the systems considered herein (see Fig. S2 and S3†). Electronic descriptor values entering property-based ML models (as detailed in the Results section) were also extracted from results of these calculations. These include frontier orbital energies and their gaps, Fermi levels, total energies and vertical energy differences. Final target λDFT values were calculated at the B3LYP42–44 level of theory using the FHI-AIMS45 code, including the TS dispersion correction.46 Electronic wave functions were expanded in an extended “tier 1” basis set using “light” integration settings. Note that this level of theory is commonly employed for characterizing organic semiconductors, thus forming a good reference method for this study.19,25,28,47
ȳ(X*) = αK(X*, X), | (1) |
α = (K(X, X) + σn21)−1y | (2) |
In all models reported herein, the commonly used radial basis function (RBF) kernel is employed:
(3) |
A series of GPR models are presented herein, which differ in the type of representation and in how the covariance matrix is constructed. The most straightforward of these uses a representation of the molecular geometry of the lowest-energy conformer in the neutral charge state. This representation x(i)s is constructed in two steps. First, each atomic environment is encoded into a rotationally invariant local representation using the smooth overlap of atomic positions (SOAP)48 as implemented in Dscribe49 (see Fig. S4† for details). These atomic representations are then combined into molecular representations using the auto-bag method,50 which partitions the local feature vectors into kmax clusters using the k-means algorithm.51 Each molecular structure can then be encoded by a kmax-dimensional global feature vector that counts the occurrence of local environments that are assigned to each cluster. The effect of the hyperparameter kmax on the predictive performance is shown in Fig. S5,† arriving at a converged value of 500. Here, SOAP is only one of the possible choices for representing atomic environments. In fact, there is a range of modern many-body representations, which are closely related to each other and typically display comparable accuracy.52 To illustrate this we also considered the Many-Body Tensor Representation of Huo and Rupp.53 This indeed yields very similar predictive performance for structure based models (see Fig. S6†).
Note that above we introduced the subscript s to refer to the use of structure-based molecular representations and the corresponding baseline ML model is denoted with Ks. Furthermore, a model termed Kp based on electronic properties computed at the semiempirical GFN1-xTB level was developed, with the corresponding representation x(i)p (see below for details). Finally, a model Ksp is explored, that combines the two kernel functions as Ksp(i,j) = Ks(x(i)s,x(j)s) + Kp(x(i)p,x(j)p).
The hyperparameters θs = (σfs, ls, σn), θp = (σfp, lp, σn), and θsp = (σfs, σfp, ls, lp, σn) for the respective models are determined by maximizing their log-marginal likelihood over D using the L-BFGS algorithm with randomly sampled initial values. Our custom GPR model is based on respective code from the scikit-learn54 implementation.
It should be noted that the choice of the ML method can in principle have a strong influence on the predictive accuracy. For the case of molecular reorganization energies, Abarbanel and Hutchison therefore performed an extensive comparison of different regression approaches (e.g. using kernel, decision tree and neural network based methods), finding little difference between different ML approaches.34 To confirm this insensitivity, we also trained a decision tree based AdaBoost55 model on the current data set and indeed found little difference to the GPR approach used herein (see Fig. S7†).
This flexibility can influence the ML predictions of λ in two ways. First, the reference λ values may depend on the conformer, and flexible molecules display much larger conformational variety. Second, the ML prediction of λ is based on a representation derived from a 3D molecular geometry. For highly flexible molecules, we can expect significantly larger deviations between the geometries predicted with more approximate levels of theory and high-level references. This is known to impact the accuracy of ML models adversely.56 To arrive at an internally consistent procedure when comparing among different molecular systems, we therefore focus on the lowest energy conformers that we can identify for each molecular system.
Unfortunately, a full conformer search at the DFT level is prohibitively expensive. This means that we require a robust and efficient protocol for the search of low-energy conformers. To this end we rely on semiempirical and force-field methods from the GFN family, which have recently been established for this purpose. These are used in combination with CREST, which implements a purpose-built workflow for conformational search.41 Depending on the underlying energy function, the accuracy and computational cost of this search can vary significantly, however. We therefore tested three different workflows, denoted as conf1-3.
In our reference method (conf1), we employ CREST in combination with the density functional tight-binding method GFN1-xTB.40 Performing conformer searches for the 10 molecules of Fig. 2a, we find that between 3 and 90 conformers are identified within the default energy window of 6 kcal mol−1 (260 meV) above the lowest energy one, underscoring the conformational flexibility of molecules in our dataset. For these conformer ensembles, we show the wide range of encountered λDFT values in Fig. 2b. Importantly, there is little variation between the values of λDFT calculated for the lowest-energy conformers at the GFN1-xTB and DFT level, which suggests that GFN1-xTB conformers are a reliable proxy for the true first-principles ground state geometry. Note that the excellent agreement in Fig. 2b only reflects the quality of GFN1-xTB conformers, while all reorganization energies in this subfigure were calculated at the DFT level. Unfortunately, performing the full conformer search at the GFN1-xTB level is still computationally prohibitive for hundreds of thousands of molecules, however.
Alternatively, the significantly more efficient force-field method GFN-FF57 can be used, and the conformer search be accelerated using the ‘quick’ setting in CREST (herein termed conf2). For 100 randomly selected molecules, Fig. 2c shows a comparison of λGFN1 values for the lowest-energy conformers obtained with conf1 and conf2. While the bulk of the predictions falls within the error margins of ±20 meV, we also find 16 outliers – marked in orange. These can be attributed to an incomplete coverage of conformational space in the conf2 ensemble and to differences in the energetic ranking between GFN1-xTB and GFN-FF.
To address the latter point, in conf3 we therefore combine the higher accuracy of GFN1-xTB and the computational speed of GFN-FF: a conformer ensemble is generated with CREST at the GFN-FF level, while a subsequent local relaxation and energetic re-ranking is carried out using GFN1-xTB. Comparing again to conf1, we see a significantly better agreement between the methods (see Fig. 2d), with 5 remaining outliers falling beyond the error margins of ±20 meV. It should be noted, that conformer searches are in general a difficult global optimization problem, which cannot be solved deterministically in an efficient manner. Therefore, some amount of uncertainty is unavoidable and will affect the ML models in all cases. As discussed in the following, achieving lower uncertainty at this stage leads to significantly lower predictive errors, however.
In part, this can be explained by the smaller range of λGFN1 values (see next section). However, a fundamental difference between the two targets also exists: While we predict λGFN1 on the basis of the corresponding neutral state molecular equilibrium structures, this does not hold for λDFT. In, the latter case, the differing neutral state equilibrium geometries (between GFN1-xTB and DFT) further complicate the learning task.
It should be noted here that learning λGFN1 is itself only of methodological interest, however. Indeed, the conf3 search requires GFN1-xTB for energy ranking, which has a similar computational effort to calculating λGFN1. In the following, we therefore exclusively focus on predicting λDFT, using conf3 for structure generation. To this end we extended our DFT annotated dataset to cover in total 10900 molecules, randomly drawn from the full hydrocarbon database. The distribution of obtained λDFT values is shown in Fig. S8.† 1000 molecules served as an external test set for model validation, while at maximum 9600 of the remaining 9900 entered the respective training sets.
Since robust models already require the use of GFN1-xTB for conformer ranking, it is natural to ask whether electronic properties at the GFN1-xTB level could be used to improve them. The most straightforward way to do this is via a Δ-learning37 strategy, i.e. by learning a correction to λGFN1. To this end, we first use a simple linear regression to describe systematic differences between λDFT and λGFN1:
λlin = aλGFN1 + b | (4) |
This linear model alone yields a stable MAE of 40 meV, independent of the training set size. It thus outperforms the structure based Ks models for all but the largest training sets (see Fig. 4). This means that, contrary to the findings of ref. 34 we find a reasonably good correlation between GFN and DFT based reorganization energies (R2 = 0.54, see Fig. S10†). This is likely due to the different class of molecules (thiophene oligomers) considered therein. Defining as a new target property:
λΔ = λDFT − λlin, | (5) |
Fig. 4 Learning curves for various ML models. Comparison of Ks models with various Δ-learning approaches. Shadings analogous to Fig. 3. The Ks model corresponds to the curve labeled λconf3DFT in that figure. |
The GFN1-xTB calculations required for obtaining λGFN1 can also be exploited in a different way. One challenge for the structure-based models is the indirect relationship between the neutral GFN1-xTB geometry and λDFT. We therefore also explored property-based models (termed Kp) which use frontier orbital energies and gaps, Fermi levels, total energies and vertical energy differences of the neutral and cationic system to construct a representation, as fully detailed in Table S2.† The respective ΔKp model is actually slightly better than the corresponding structure-based model ΔKs, despite not including any structural information. Finally, a combined model incorporating the structural and property kernels (termed ΔKsp), performs better still, reaching an MAE of 25 meV at the largest training set size.
Please note that no optimization of the feature selection was performed for the property based models, other than checking that there were no strong linear dependencies between different properties. However, a more systematic feature selection procedure can provide physical insight and potentially improve the models. To explore this, we performed permutational feature importance (PFI) analysis for the ΔKp model (see Fig. S11†).59 This indicates that some features are particularly relevant for the model, e.g. the HOMO energy of the cationic state in the neutral geometry, the Fermi energy of the neutral state in the cation geometry and the individual contributions to the GFN1 reorganization energy. Based on this, we constructed additional models which only used subsets of the most important features. However, these sparse models displayed somewhat worse performance than the full model, indicating that all features ultimately contribute to the prediction accuracy. Nonetheless, more sophisticated feature engineering (e.g. using recursive selection or nonlinear transformations) may be able to achieve better performance with sparse models.
As illustrated in Fig. 5a, all three methods are quite successful in identifying promising candidates: from the 500 selected systems, GFN1-xTB identifies 436 molecules that display λDFT < 200 meV, compared to the somewhat higher numbers for the ΔKs and the ΔKsp models (where 487 and 492 are respectively identified). Narrowing the range to λDFT < 140 meV, the ΔKsp still performs best and identifies 251 structures, while the ΔKs and the GFN1-xTB identify 217 and 118 such cases, respectively.
Fig. 5 Results of the targeted identification of low-λ structures. (a) Distribution of λDFT values in the final selections derived from three different methods (see text). We only consider compounds that satisfy λDFT < 200 meV. (b) Kernel principal component analysis map of the identified structures (generated with the ASAP58 code). Kernel-density estimates are shown along the principal components. |
The 20 lowest-λ structures from all three screenings are shown in Fig. 6. Interestingly, 15 compounds in this subset were identified by the GFN1-xTB screening, while the ΔKs and ΔKsp models identified 9 and 11, falling slightly behind. In other words, the GFN1-xTB model actually has an edge over the ML model when considering the extreme low end of the distribution, although it is in general less effective in identifying low-λ structures. It is also notable that, although some overlap between the methods is observed (i.e. from the 1500 molecules selected by the three screenings only 1131 are unique candidates), many structures are exclusively identified by one method, in particular by GFN1-xTB. This is illustrated by the Kernel principal component analysis map58 shown in Fig. 5b, which places similar molecular structures close to each other. Clearly, the semiempirical GFN1-xTB model overall exhibits the highest diversity, while the candidates selected by the data-driven models appear somewhat more concentrated. This reflects the fact that GPR models use metrics of molecular similarity in their predictions.
Fig. 6 Lowest-λDFT candidates. Shown are the best candidates identified among 120k molecules in the three virtual screening campaigns. The corresponding λDFT values are listed below. |
However, this is not primarily just a problem of the chosen models, since other ML approaches also (implicitly) work with feature similarity. It is rather that ML models are by definition most strongly influenced by those types of molecules which occur most frequently in the dataset. The HTVS setting does not necessarily require a good description of an average molecule, however. Instead, it requires a good description of the small percentage of unusual molecules that we are interested in. This implies that a non-uniform sampling strategy for training set construction might be helpful in this context. This will be explored in future work.
At the suggestion of a reviewer, the virtual screening was also performed with the ΔKp approach (see Fig. S12 and S13†). This model shows comparable performance to ΔKs for systems with λ < 140 meV, but is considerably worse for the range 140 meV < λ < 200 meV. This indicates that the structural information in ΔKs and ΔKsp helps the models to reliably identify systems that are structurally similar to low-λ training set molecules, thus increasing their screening accuracy.
A more quantitative understanding of this can be obtained from a substructure analysis. To this end, we analysed whether certain structural motifs are significantly more likely to be found in the low-λ subset than in the full dataset. This can be quantified via the enrichment of a given substructure, defined as
(6) |
fi = (ni,all/Nall). | (7) |
To obtain a general design rule, we search for substructures with both high enrichment and reasonably high frequency. This allows balancing between overly specific substructures that only occur in very few molecules to begin with (high enrichment/low frequency) and overly simple motifs that occur in many molecules, independent of λ (low enrichment/high frequency).
As a preliminary screening, potential substructures were defined via Morgan-fingerprints60 of different bond-radii (see Fig. 8). As illustrated in Fig. S14,† this revealed a number of highly enriched substructures, which confirmed the initial impression that acetylene-bridged and cyclopentadiene containing structures are highly favourable. However, the substructures obtained in this fashion are often redundant and chemically unintuitive (i.e. by only containing parts of aromatic rings). We therefore manually derived a number of reasonable substructures from this analysis, in order to elucidate a robust and general design rule for low-λ molecules (see Fig. 7). Here, we focused on acetylene-bridged benzene rings, as cyclopentadiene is prone to dimerize in Diels–Alder reactions, pointing to potential stability issues with these molecules.
In Fig. 7a, we plot the enrichment and frequency of each substructure. This reveals a contravening trend: The simplest structure (1) is very common in the full dataset, but also displays very low enrichment in the low-λ set. In contrast, the more elaborate structures (8) and (9) are highly enriched, but very rare overall. Meanwhile substructure (5) (two meta-substituted acetylene-bridged benzene rings) features a quite high enrichment and is also fairly common in the database. As a consequence, ten further molecules with this motif can be found in the previously computed set of 10900 λDFT-values. This allows us to confirm that the corresponding molecules indeed display significantly lower reorganization energies than the full training set (Fig. 7c).
The distributions of λDFT-values for all substructures are shown in Fig. 7d. This confirms the impression obtained from the enrichment plots. Simple substructures like (1) are generally unspecific and can be found in both high- and low-λ molecules. Meanwhile, highly enriched substructures indeed robustly predict high quality candidates, and can thus be used to define general design rules.
It should be noted that the above analysis is ultimately limited by the biases of the underlying dataset. For example, heteroatomic substituents could affect the suitability of certain motifs quite strongly due to electronic push–pull effects, which are largely absent in the hydrocarbon dataset used herein. Nonetheless, the methodology we apply could of course also be applied to other datasets.
While this leads to a significant improvement of the predictive performance compared to the baseline, we find that the benefits of this are actually somewhat marginal in the context of virtual screening. Specifically, ML enhanced screening is more effective in identifying promising candidates, but the semiempirical model actually has some advantages in terms of candidate diversity. This calls into question whether the cost of building the ML models (in particular the generation of training data) is actually justified. In particular, computing λDFT for a single molecule takes on average 28 CPU hours on our hardware. In contrast, the generation of conformer ensembles (ca. 1 CPU hour per molecule) and the training of the ML models (one-time cost of 20 CPU hours for the largest training sets) are reasonably affordable. To obtain a clear advantage, more accurate and/or data-efficient ML models are thus required.
One way to achieve this would be to work with full conformer ensembles rather than single conformers to construct the representations.61 It should also be noted that packing and contact effects occurring in molecular crystals or amorphous structures are known to influence the encountered solid-state conformation and flexibility for geometrical relaxation.26,62,63 Potentially, generative ML models trained on condensed phase data could therefore help producing more realistic conformer ensembles.
Footnote |
† Electronic supplementary information (ESI) available: Details on structure generation, electronic properties, hyperparameters and substructure analysis. See DOI: 10.1039/d1dd00038a |
This journal is © The Royal Society of Chemistry 2022 |