Open Access Article
Alexandra
Wahab
a and
Renana
Gershoni-Poranne
*b
aThe Laboratory for Organic Chemistry, Department of Chemistry and Applied Biosciences, ETH Zurich, 8093 Zurich, Switzerland
bThe Schulich Faculty of Chemistry and the Resnick Sustainability Center for Catalysis, Technion - Israel Institute of Technology, Haifa 32000, Israel. E-mail: rporanne@technion.ac.il
First published on 14th May 2024
We introduce the third installment of the COMPAS Project – a COMputational database of Polycyclic Aromatic Systems, focused on peri-condensed polybenzenoid hydrocarbons. In this installment, we develop two datasets containing the optimized ground-state structures and a selection of molecular properties of ∼39k and ∼9k peri-condensed polybenzenoid hydrocarbons (at the GFN2-xTB and CAM-B3LYP-D3BJ/cc-pvdz//CAM-B3LYP-D3BJ/def2-SVP levels, respectively). The manuscript details the enumeration and data generation processes and describes the information available within the datasets. An in-depth comparison between the two types of computation is performed, and it is found that the geometrical disagreement is maximal for slightly-distorted molecules. In addition, a data-driven analysis of the structure–property trends of peri-condensed PBHs is performed, highlighting the effect of the size of peri-condensed islands and linearly annulated rings on the HOMO–LUMO gap. The insights described herein are important for rational design of novel functional aromatic molecules for use in, e.g., organic electronics. The generated datasets provide a basis for additional data-driven machine- and deep-learning studies in chemistry.
![]() | ||
| Fig. 1 Representative examples of PBHs on a background of a hexagonal grid, cc-PBHs are colored in blue and pc-PBHs are colored in purple. | ||
Thanks to the decades of intensive computational and experimental research into PBHs, a great deal has already been discovered about them (e.g., edge effects)23–25 and several models have been developed to understand and predict their behavior (e.g., Clar's sextet theory,26–28 the Y-rule,29–31 annellation theory,32 and our own additivity approach).33,34 Nonetheless, certain aspects of their structure–property relationships remain poorly understood, which impedes rational design of improved PBH-based candidates. Recent reports on the synthesis35–38 and characterization of challenging PBHs and on computational developments39–42 aimed at further elucidation of their properties underline the ongoing interest in PBH systems and the importance of obtaining reliable and useful data for them.
Data-driven investigations, which have become increasingly accessible due to advances in computational abilities, have the potential to address these knowledge gaps, thus both deepening our chemical understanding and enabling practical molecular design. Such tools have already been applied in the chemical space of PASs, including studies focused on spectra prediction,43 performing brute-force high-throughput screenings for organic electronics,44,45 active discovery of organic semiconductors,46 and design of organic electronic materials with generative models.47 As a result, several databases have been constructed that include PASs, which focus on general chemical data,48,49 computational benchmark data,50 spectroscopic data for astrochemical studies,51–56 aromaticity,57 and organic electronic materials.58,59 However, most of these databases focus on extant molecules, or generate molecules that are biased towards certain functionalities, thus neglecting large swaths of chemical space that may contain promising new structural motifs. Furthermore, they either contain too few data (less than 1000 entries), are not consistently curated, and/or include an unsystematic mixture of PASs from different subclasses. To overcome this problem a large, systematically constructed, and well-curated database of PAS compounds is needed. To address the paucity of PAS data, our group conceptualized and initiated the first COMputational database of Polycyclic Aromatic Systems—the COMPAS Project. The COMPAS Project is designed to house several datasets, each comprising a carefully curated and methodical enumeration of the chemical space of a certain subclass of PASs, calculated at a uniform level of theory. The first installment, COMPAS-1,60 focuses on ground-state cata-condensed polybenzenoid hydrocarbons; the second installment, COMPAS-2,61 focuses on ground-state cata-condensed heterocyclic PASs. COMPAS-1 and COMPAS-2 have already been used to provide the first examples of interpretable machine and deep-learning models for PASs62,63 and to demonstrate the first generative design of PASs with targeted properties.64 Both datasets, as well as all future installments, are freely available for use, according to the FAIR65 principles of data sharing. Herein, we report on the third installment, COMPAS-3, which expands the COMPAS database to peri-condensed PBHs (pc-PBHs) in the ground state. Similarly to the previous two installments, COMPAS-3 contains two computationally-generated datasets: (1) COMPAS-3D—8844 peri-condensed PBHs comprising 1–10 rings, calculated with density functional theory (DFT) at the CAM-B3LYP-D3BJ/aug-cc-pVDZ//CAM-B3LYP-D3BJ/def2-SVP level of theory; (2) COMPAS-3x—39
482 peri-condensed PBHs comprising 1–11 rings, calculated with xTB using GFN2-xTB.
This manuscript is divided into three main sections: (a) a description of the data generation workflow and the contents of each of the datasets; (b) a comparison between the two datasets and discussion of the differences between the two levels of computations; and (c) an analysis of the data, showcasing structure–property relationships that are revealed from the trends in the data.
![]() | ||
Fig. 2 Flowchart of the data-generation process. (1) CaGe66 was used to generate unoptimized geometries of pc-PBHs containing up to 11 rings. (2) xTB was used to optimize all geometries. (3) The data were filtered to remove invalid and/or unwanted molecules. The geometries and properties of the remaining molecules comprise the COMPAS-3x dataset (39 482 molecules). (4) DFT was used to further optimize the pc-PBHs containing up to 10 rings. The geometries and properties of these 8844 molecules comprise the COMPAS-3D dataset. | ||
We differentiate between three cases of open-shell character in the ground state (Fig. 3): (A) an odd number of hydrogens/carbons (e.g., phenalenyl radical, C13H9, is a three-ring pc-PBH with a single unpaired electron); (B) non-Kekuléan structures, i.e., PBHs for which no classical closed-shell valence structure can be drawn67,68 (e.g., triangulene, C22H12, is a non-Kekuléan six-ring pc-PBH with two unpaired electrons in the ground state); and (C) molecules that possess a closed-shell resonance structure, but have appreciable diradical character, which is a relatively common occurrence in pc-PBHs, due to their extended conjugation (e.g., zethrenes).68
![]() | ||
| Fig. 3 Representative examples of the three cases of (poly)radical/(poly)radicaloid molecules that were discarded from COMPAS-3. | ||
The first case can be dealt with quite easily. pc-PBHs containing the same number of rings may or may not be isomers (i.e., they may contain differing numbers of carbon and hydrogen atoms, despite having the same number of rings). Hence, in contrast to cc-PBHs, for pc-PBHs various molecular formulae exist per family (“families” are separated according to and referred to by the number of rings in the isomers). Since all formulae containing an odd number of hydrogens/carbons describe obviously radical systems, these cases were easily identified and discarded prior to structure enumeration. The remaining molecular formulae and corresponding numbers of isomers for each family are detailed in Table 1.
| No. rings | Molecular formula | Initial no. isomers (CaGe) | Final no. isomers |
|---|---|---|---|
| 4 | C16H10 | 1 | 1 |
| 5 | C20H12 | 3 | 3 |
| 6 | C22H12 | 3 | 2 |
| C24H14 | 14 | 13 | |
| 7 | C24H12 | 1 | 1 |
| C26H14 | 10 | 9 | |
| C28H16 | 67 | 58 | |
| 8 | C28H14 | 9 | 8 |
| C30H16 | 67 | 57 | |
| C32H18 | 340 | 264 | |
| 9 | C30H14 | 4 | 3 |
| C32H16 | 55 | 44 | |
| C34H18 | 398 | 308 | |
| C36H20 | 1710 | 1182 | |
| 10 | C32H14 | 1 | 1 |
| C34H16 | 42 | 32 | |
| C36H18 | 547 | 180 | |
| C38H20 | 2439 | 1594 | |
| C40H22 | 8561 | 5084 | |
| 11 | C36H16 | 26 | 17 |
| C38H18 | 333 | 216 | |
| C40H20 | 2874 | 1683 | |
| C42H22 | 14 598 |
7662 | |
| C44H24 | 42 621 |
21 060 |
|
74 724
|
39 482
|
We then used the chemical & abstract graph environment (CaGe) software66 to obtain the initial (unoptimized) xyz coordinates of the 74
724 structures corresponding to the chemical formulae in Table 1 (Fig. 2, step 1). We implemented subsequent filtering steps to identify and discard the non-Kekuléan structures and the molecules with open-shell character (vide infra). Table 1 details the initial (generated by CaGe) and final (following filtering) numbers of isomers predicted for each family and each chemical formula of pc-PBHs.
724 molecules enumerated by CaGe were optimized with the GFN2-xTB method,69 xTB70 version 6.2. Harmonic vibrational frequencies were calculated after structure optimization to ensure true minima on the potential energy surface (i.e., Nimag = 0; Fig. 2, step 2). Following data filtering (vide infra), a total of 39
482 molecules were retained. For each of these, xTB calculations and subsequent frequencies calculations were performed to optimize the cationic and anionic forms as well. The geometries and properties of these 39
482 pc-PBHs containing up to 11 rings comprise the dataset denoted as COMPAS-3x (see Table 1).
The first case includes non-Kekuléan structures and molecules that have non-negligible open-shell character in the ground state, which we excluded by design. The second case includes molecules that, for technical reasons, did not cleanly converge to a PBH structure and needed to be removed to guarantee data reliability. For example, a structure containing sp3-hybridized carbons—all carbon atoms in PBHs should be sp2-hybridized. Such cases can arise when two carbon atoms, which are not supposed to share a bond, are located very closely in the starting geometry. Consequently, a spurious bond may be generated between these two carbons during the optimization process.
To identify the different types of undesired molecules, we first generated the SMILES strings of all xTB-optimized structures using the xyz2mol71 script. Molecules were discarded in any of the three following cases: (a) if a SMILES string was not generated (an indication of an invalid chemical structure); (b) if it contained any of the characters ‘@’, ‘=’, or ‘C’, (an indication of an sp3-hybridized carbon); or (c) if it contained any of the characters ‘[’,‘]’, ‘−’, or ‘+’ (an indication of radical structure, which SMILES often wrongly denotes with charge). Following this filtering step, 55
820 molecules remained (i.e., 74.7% of the initial dataset). The majority of the discarded molecules (16
133 out of 18
904 molecules, or 85.3%) contained ‘+’ and/or ‘−’ in their SMILES string, which implies non-Kekuléan structure. Only 14.7% of the discarded molecules were removed due to problems in the optimization process.
Finally, we used the NFOD metric72 to remove any molecules with significant open-shell/diradical character. We note that, in a recent contribution, Lischka and coworkers demonstrated that the NFOD metric is a reliable alternative to the more demanding multi-reference calculations usually needed to determine open-shell character.73 We previously benchmarked methods for identification of diradical character and established a threshold of NFOD = 1.3 as the cutoff value (we refer the reader to the ESI of ref. 63). Thus, molecules with NFOD ≥ 1.3 were removed from the COMPAS-3 datasets, providing a final tally of 39
482 molecules. It is notable that, of the initial 74
724 pc-PBHs generated by CaGe, approximately 44% do not have a closed-shell ground state according to these criteria. We also highlight that other methods exist to identify open-shell character, such as the pioneering method of Lischka and coworkers from 2016.74
The geometries of 8844 molecules were optimized with ORCA version 5.0.375,76 using the CAM-B3LYP77–81 functional with Grimme's D382 dispersion correction with Becke Johnson damping, in combination with the def2-SVP basis set.83,84 Single-point calculations were performed on the optimized geometries using the aug-cc-pVDZ85–87 basis set (in short: CAM-B3LYP-D3BJ/aug-cc-pVDZ//CAM-B3LYP-D3BJ/def2-SVP). These methods were selected following a literature search88 and a subsequent benchmarking procedure (see Section S2 of the ESI†). The resulting DFT-optimized geometries and properties form the dataset denoted as COMPAS-3D.
| Properties | COMPAS-3x | COMPAS-3D |
|---|---|---|
| HOMO | ✓ | ✓ |
| LUMO | ✓ | ✓ |
| HLG | ✓ | ✓ |
| SPE (neutral) | ✓ | ✓ |
| SPE (cation) | ✓ | ✓ |
| SPE (anion) | ✓ | ✓ |
| E rel (neutral) | ✓ | ✓ |
| ZPE (neutral) | ✓ | |
| ZPE (cation) | ✓ | |
| ZPE (anion) | ✓ | |
| aIP | ✓ | ✓ |
| aEA | ✓ | ✓ |
| Disp. corr. | ✓ | ✓ |
| Dipole moment | ✓ | ✓ |
| Corrected HOMO | ✓ | |
| Corrected LUMO | ✓ | |
| Corrected HLG | ✓ | |
| Corrected aIP | ✓ | |
| Corrected aEA | ✓ | |
| N FOD | ✓ | ✓ |
| y value | ✓ |
Table 2 lists the properties contained in the COMPAS-3x and COMPAS-3D datasets. The common properties are the energies of the highest occupied molecular orbital (HOMO) and lowest unoccupied molecular orbital (LUMO), the HOMO–LUMO gap (HLG), the dispersion-corrected single point energy (SPE)—i.e., the energy of the optimized structure without zero-point corrections—for the neutral and charged species, the relative energy (Erel)—i.e., the difference in SPE between each molecule and its lowest-energy isomer—for the neutral species, the adiabatic ionization potential (aIP), the adiabatic electron affinity (aEA), the dispersion correction (disp. corr.), the dipole moment, and the NFOD. The aIP and aEA represent the SPE difference between the optimized neutral species and optimized positively and negatively charged species, respectively.
COMPAS-3x contains the zero-point energies (ZPEs) for all species (neutral and charged ±1) while COMPAS-3D does not (we did not perform frequency calculations at the DFT level). ZPE corrections have been shown to not be highly method-dependent,89 and thus can be used across methods, if desired.
For several of the properties, the xTB values were corrected to DFT-level, using the respective fitting regressions (see Fig. 6 and Table 3). These values are labeled as “Corrected” in the COMPAS-3x dataset. Additional information on the regressions is given in Section S3 of the ESI.† An in-depth comparison between the two methods is described in the following section.
| Properties | COMPAS-1 | COMPAS-3 | ||
|---|---|---|---|---|
| Slope | Intercept | Slope | Intercept | |
| HOMO | 1.618 | 9.128 | 1.556 | 8.554 |
| LUMO | 1.256 | 8.482 | 1.286 | 8.740 |
| HLG | 1.424 | 2.519 | 1.422 | 2.527 |
| aIP | 1.262 | −7.441 | 1.442 | −9.578 |
| aEA | 1.059 | 5.509 | 1.216 | 6.425 |
| E rel | 1.490 | 0.077 | 1.513 | 0.037 |
To probe this behavior further, we plotted the Δz values from the two methods against one another (Fig. 4D–F) for the neutral, cationic, and anionic species. These plots reiterate the conclusion we reached on the basis of RMSD: the two methods have an excellent agreement on the extent of non-planarity only for molecules with Δz > 2 Å; the agreement is substantially poorer for molecules that are less distorted (i.e., more planar). Specifically, for such molecules, whereas the xTB values are spread out over the range [0,2] Å, the majority of DFT values are close to 0 Å. Meaning, DFT predicts almost completely planar geometries for these molecules while xTB predicts distortion from planarity.
This raises the question: what are the two methods treating differently, to arrive at these different geometries? One possible source of discrepancy could be the dispersion correction: our DFT calculations included Grimme's D3 dispersion correction, while xTB uses the D4 correction by default. Nonetheless, this possibility was ruled out, as the two different corrections actually show an excellent agreement, especially at smaller Δz values (see Fig. S21 in the ESI†). In principle, polycyclic aromatic systems should strive for planarity as a consequence of the sp2 hybridization of the comprising carbons. Moreover, planarity ensures better orbital overlap and therefore increased electron delocalization and aromatic stabilization. Such systems distort from planarity only when cove, fjord, and helix motifs are involved. For such motifs, the steric hindrance between hydrogens in the curved area forces the carbon scaffold out of planarity, incurring torsional strain. The fact that xTB predicts non-planar geometries suggests that it estimates this steric hindrance to be more costly than both the energetic cost of torsional strain and the stabilization gain of planarization. Conversely, the fact that DFT predicts planar geometries suggests that it either estimates the cost of torsional strain to be greater than the cost of the hydrogen–hydrogen steric hindrance, or estimates the gain of aromatic stabilization to be greater than the cost of steric hindrance. It is worth noting that previous results from our group and others have indicated that such small deviations from planarity have only a minor effect on aromatic stabilization.90,91 Thus, we believe the balance between torsional strain and steric hindrance is the more influential effect. We discuss this issue further in the section on Erel.
![]() | ||
| Fig. 5 Violin plots of xTB-calculated (blue) properties vs. DFT-calculated (purple) properties: (A) HOMO; (B) LUMO; (C) HLG; (D) aIP; (E) aEA; (F) Erel. All values are reported in eV. | ||
Despite these shifts, the KDE profiles of the xTB- and DFT-calculated properties (with the exclusion of Erel, which is discussed in further detail, vide infra) are very similar, as confirmed by the good linear correlations observed between the two computational methods (Fig. 6A–E). For comparison, these plots detail the correlations for both COMPAS-1 (blue) and COMPAS-3 (purple). We note, however, that the slopes of all linear regressions are not equal to 1 (see Table 3), meaning that the difference between the methods is not simply a constant offset. We also note that the individual fitting equations for the various properties are very similar for COMPAS-1 and COMPAS-3, with the exception of the aIP and the aEA. Additionally, for the latter two properties, the pc-PBHs show better agreement with the linear fits. We believe that the pc-PBHs show slightly better agreement because they tend to be more planar than the cc-PBHs (less opportunity to form helical motifs). Nevertheless, it is clear that for most properties, one equation per property is sufficient to “correct” xTB values to DFT ones for both the COMPAS-1 and COMPAS-3 datasets, allowing inexpensive generation of additional data in the future. We refer the reader to Section S5.2 of the ESI† for further discussion on the aIP and aEA calculations, including the relationship to non-planarity and additional analysis of the outliers seen in the aEA plot.
Based on our previous RMSD analysis, we can rule out that the differences in energies stem from differences in geometries (despite the disagreement around Δz = 1 Å for a small fraction of molecules, there is an overall excellent agreement between the xTB- and DFT-optimized geometries). Nevertheless, the special case of the close-to-planar molecules discussed above already hinted at the possible source of discrepancy between the methods.
One can interpret the difference in Erel as the sum of differences in aromatic stabilization and differences in strain between any given molecule and its lowest-energy isomer. Seen in this light, we may ask if the difference in Erel arises from (a) estimation of strain (steric and torsional), (b) estimation of aromatic stabilization, or (c) both?
In this regard, we note that we deliberately chose the CAM-B3LYP functional, which has been shown not to suffer from over-delocalization errors;92,93 such errors could lead to spurious results, including exaggerated planarity and over-estimation of aromatic stabilization. Nevertheless, to try to pinpoint the source of the discrepancy, we studied the relationship between the size of the molecule and the difference in relative energy, ΔErel = Erel(DFT) − Erel(xTB). We hypothesized that if the difference stems from the way aromatic stabilization energy is estimated, then increasing the number of rings/atoms should exacerbate the problem, because of the extension of the conjugated system. In contrast, large molecules do not necessarily incur strain (in particular, torsional/helical strain) simply because they are larger; it depends on their exact geometry. Our analysis showed that the effect of the number of rings is minimal, and the effect of the number of atoms is inconsequential (see Fig. S22 in the ESI†).
We next investigated whether the issue lies with the estimation of strain, by probing the relationship between ΔErel and Δz (the deviation from planarity, which corresponds to torsional strain). Fig. 7 presents the obtained correlation, which demonstrates that an increased deviation from planarity coincides with an increase in ΔErel. To highlight that the deviation from planarity is specifically due to the existence of helical motifs, we colored the individual data points according to the largest helicene present in the molecule ([n]Helicenes—where n represents the number of rings present in the helical structure). The obvious stratification of the colored data points shows this effect clearly.
![]() | ||
| Fig. 7 Scatter plot of ΔErelvs. Δz, colored by the longest [n]Helicene present in the molecule (0 indicates no helicene motifs). The red line shows the trendline of the data. | ||
To summarize, although we cannot affirmatively identify the source of the discrepancy in Erel between xTB and DFT, our results suggest that the issue lies in the estimation of steric hindrance versus torsional strain. This rationalization is relevant both to the Erel and to the geometry discrepancies described above for close-to-planar molecules. It is interesting to note that the two methods, xTB and DFT, have different areas of agreement when it comes to energies and geometries. Whereas the geometric differences are greatest for molecules with small deviations from planarity, the energy differences are largest for molecules that have much more pronounced non-planarity. This once again highlights that obtaining the optimized geometry for close-to-planar molecules is a subtle balance of effects.
The most prevalent peri-island (53%) is the 4-ring island, i.e., pyrene, which is also the smallest Kekuléan pc-PBH. As the numbers of rings in the molecules grow, larger peri-islands can form (Fig. 8B, left). At the same time, because the total number of rings is limited, larger peri-islands also preclude the existence of multiple cata-moieties (Fig. 8B, right).
Considering the structural similarity between the COMPAS-1 and COMPAS-3 molecules, it is not surprising that the ranges of properties for the two datasets are similar, as seen in the violin plots in Fig. 9 (COMPAS-1 is shown in light blue and COMPAS-3 is shown in purple). Nevertheless, they are not identical. For example, Fig. 9A–C show that the distributions of the cc-PBHs are more heavily weighted towards lower HOMO values, higher LUMO values, and higher HLG values than the pc-PBHs. COMPAS-1 also shows broader distributions for both aIP and aEA (Fig. 9D and E), as well as a shift of the distribution peaks towards higher values in both cases. We note that, to facilitate the comparison, we recalculated the COMPAS-1D dataset at the same level as COMPAS-3D (in the original publication of COMPAS-1D we used B3LYP-D3BJ/def2-SVP;60 for comparison between the two levels of theory for COMPAS-1, see Section S4 in the ESI†).
Thus, it is apparent from these data that despite the general similarity between the cc-PBH and pc-PBH sub-classes, the inclusion of peri-condensed components does have an affect on the molecular properties. In the following sections, we investigate these effects.
We began by analyzing the relationship between molecular size and molecular properties. To avoid ambiguity, we opted to use the ring count as the measure of size. This means that several molecular stoichiometries are contained in the same “size” category. Also, under this classification, coronene is considered part of the 7-ring family (it contains 6 peripheral rings and 1 central ring), even though its molecular formula assigns it as a 6-ring isomer.
Fig. 10 presents boxplots of the HLG, separated and colored according to multiple different structural features.
![]() | ||
| Fig. 10 Boxplots of the DFT-calculated values of the HLG, colored by: (A) number of rings, (B) number of rings in the largest peri-island, (C) number of cata-moieties, (D) longest contained [n]Helicene, and (E) longest stretch of linearly annulated rings. Plot A presents the data from all molecules in families 5–10. Plots B–E present data from family 10 only. The number of data point within each box can be found in Tables S8–S12 in the ESI.† | ||
Fig. 10A presents the effect of size on the range of HLG values, showing a trend whereby the distribution of values shifts to smaller gaps as the molecules grow larger. The differences between consecutive families become smaller as the size increases, and for the larger families (7- to 10-ring systems) the property ranges covered are highly overlapping. This is not unexpected; it is known that extending conjugation in fused polycyclic oligomers reduces HLGs in a 1/n manner (where n is the number of double bonds).94 To ensure that subsequent analyses were not tainted by this size dependency, the remaining plots B–E show only data for family 10 (i.e., 10-ring systems).
Increasing the size of the largest peri-island (Fig. 10B) demonstrates a similar size-dependency, whereby larger islands lead to smaller HLG values. However, in contrast to the previous trend, in this case all of the molecules are of the same size, thus this effect is clearly due to the size of the island itself, not of the overall molecule. Notably, all of the groups have a large degree of overlap, with the exception of the 4-ring systems (i.e., pyrene-based pc-PBHs), which tend to have a higher range of values than the other groups.
Conversely, increasing the number of cata-moieties appears to have a minimal effect on the HLG (Fig. 10C). Among the not strictly pc-PBHs, there is barely any differentiation. However, the strictly peri-condensed molecules (i.e., number of cata-moieties = 0) have noticeably smaller gap values. In other words, adding the first cata-moiety makes a significant change, but subsequent additions do not.
To further probe the effect of different cata-moieties, we differentiated between helical and linearly annulated cata-condensed components. In Fig. 10D we examine the effect of the longest helical stretch in the molecule. As mentioned above, the longer the contained [n]Helicene, the more distorted from planarity the molecule becomes. Hence, this analysis can also be viewed as an indirect measure of non-planarity in the molecules. We observe a slight trend, whereby elongating the helicene leads to an increase in the HLG. Once again, however, there is a large degree of overlap between the groups. The effect of the longest linear stretch, which we found to be dominant in cc-PBHs60 is shown in Fig. 10E. We find that, for the pc-PBHs as well, elongating the linear stretch beyond 3 rings (i.e., a stretch of at least 4 rings) dramatically decreases the value of the HLG and substantially narrows the spread of possible HLG values. Of all features examined, this structural component also shows the best differentiation between groups, i.e., the least amount of overlap. Thus, it appears to be a dominant structural feature in pc-PBHs.
The main conclusions of our comparison between xTB and DFT are as follows: in general, the agreement between the methods is excellent for both optimized geometries and calculated properties, meaning that DFT-level accuracy can be reliably obtained from xTB calculations. However, the molecular properties, with the exception of Erel, cover vastly different ranges of values. xTB-Erel and DFT-Erel have an excellent linear correlation, but DFT-Erel is consistently greater. Furthermore, for the specific subset of close-to-planar molecules, we found that DFT flattens molecules that xTB predicts to have a deviation from planarity of approximately 1 Å. For both of these findings, our analysis suggests that the underlying cause of the discrepancy is linked to the different estimation of steric hindrance and torsional strain made by each of the methods. Specifically, DFT estimates the torsional strain to be more costly than the hydrogen–hydrogen steric hindrance; the opposite is true for xTB. We also emphasize that all of our observations are in line with what we previously showed for COMPAS-1. While this may appear trivial, it is not obvious that cata- and peri-condensed PBHs should show similar tendencies and trends, nor that the two chosen levels of theory should have similar correlations for them, given the complexity inherent in large conjugated systems
The main conclusions of our structure–property analysis are as follows: for several of structural motifs we examined, there are apparent trends for the HLG. Namely, the HLG decreases with an overall increase in molecule size, but it also decreases with an increase only in the size of the largest contained peri-island. The number of cata-moieties does not appear to have marked effect, with the exception of going from strictly peri-condensed to not strictly peri-condensed. However, the type of cata-moiety does have an effect—elongation of helical motifs shows a slight tendency to increase HLG while elongation of the longest linear stretch shows a strong tendency to decrease the HLG.
Despite these trends, the individual groups have a large extent of overlap and cannot be easily differentiated. The two exceptions are the pyrene-based pc-PBHs, which appear to have noticeably larger HLGs, and pc-PBHs containing linear stretches of four or more rings. In both of these cases, these structural motifs separate the molecules from the distributions of the rest of the data. Thus, our analysis has helped to pinpoint promising directions for further development of design principles. In the future, we plan to continue investigating these two effects, including their interplay, and how they can be used to tune the molecule properties of pc-PBHs.
To conclude, this work provides two new datasets that can assist in further data-driven investigations and inverse design of promising functional molecules. Moreover, the insights gained from our analysis deepen our understanding of these prevalent and important molecules, and can inform future rational design of PBH-based systems.
Footnote |
| † Electronic supplementary information (ESI) available: Details of general computational methods, templates for xTB and DFT calculations, benchmarking procedure for choosing the DFT level of theory, comparison of the COMPAS-1 dataset using the two levels of theory (this report versus the original publication), extended discussion of the outliers in the aIP and aEA plot, comparison of D3 and D4 dispersion corrections, and additional discussion of the relative energy and structure–property analyses. See DOI: https://doi.org/10.1039/d4cp01027b |
| This journal is © the Owner Societies 2024 |