Open Access Article
Stefan
Gugler
,
Jon Paul
Janet
and
Heather J.
Kulik
*
Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA. E-mail: hjkulik@mit.edu; Tel: +1 617 253 4584
First published on 4th July 2019
Despite being attractive targets for functional materials, the discovery of transition metal complexes with high-throughput computational screening is challenged by the amount of feasible coordination numbers, spin states, or oxidation states and the potentially large sizes of ligands. To overcome these limitations, we take inspiration from organic chemistry where full enumeration of neutral, closed-shell molecules under the constraint of size has enriched discovery efforts. We design monodentate and bidentate ligands from scratch for the construction of mononuclear, octahedral transition metal complexes with up to 13 heavy atoms (i.e., metal, C, N, O, P, or S). From >11
000 theoretical ligands, we develop a heuristic score for ranking a chemically feasible 2500 ligand subset, only 71 of which were previously included in common organic molecule databases. We characterize the top 20% of scored ligands with density functional theory (DFT) in an octahedral homoleptic ligand database (OHLDB). The OHLDB contains i) the geometry optimized structures of 1250 homoleptic octahedral complexes obtained from the enumerated pool of ligands and an open-shell transition metal (M(II)/M(III), M = Cr, Mn, Fe, or Co) and ii) the resulting high-spin/low-spin adiabatic electronic energy differences (ΔEH–L) obtained with hybrid DFT. Over the OHLDB, we observe structure–property (i.e., ΔEH–L) relationships different from those expected on the basis of ligand field arguments or from our prior data sets. Finally, we demonstrate how incorporating OHLDB data into artificial neural network (ANN) training improves ANN out-of-sample performance on much larger transition metal complexes.
Design, System, ApplicationWe develop a strategy for octahedral transition metal complex design by enumerating potential ligands from scratch. Such complexes are relevant as molecular sensors and switches that can change their ground state spin in response to light, heat, or other stimuli. They are also models of catalytic active sites. We show how our de novo enumeration both produces complexes with novel electronic structure and provides training data that improves the baseline performance of our machine learning models (here, artificial neural networks) in out-of-sample tests. |
To accelerate discovery in open-shell transition metal chemistry, we have developed4,17,27,28,56,57 ML models for predicting quantum mechanical properties (here, computed with DFT), including spin-splitting energies,17,27,56 redox or ionization potentials,28,56 metal–ligand bond lengths,17,28 frontier orbital energies,4 and reaction energetics.58 A key outstanding challenge for ML model improvement, especially in inorganic chemistry, is the generation of large data sets. While most organic chemistry ML models have been trained24,59–65 on large (>100k points) data sets19,66,67 of molecules consisting of up to 9 heavy (i.e., C, N, O, or F) atoms, the higher computational cost associated with the larger number of electrons and added complexity of open-shell wavefunctions have limited data set sizes for inorganic chemistry.28 Despite these limitations in data set size, ML models for inorganic chemistry on modest data sets are predictive to 1–3 kcal mol−1 as judged by test set errors.17,56 However, limited coverage of the wide range of chemical bonding in transition metal complexes means that ML model prediction error can rise rapidly (to ca. 6–10 kcal mol−1 (ref. 17 and 27)) when applied to complexes dissimilar from training data. Thus, an approach to efficiently increase data set diversity would benefit ML models in inorganic chemistry.
Here, we take inspiration from organic chemistry, where systematic enumeration of small, drug-like molecules68–73 paved the way for large data sets suitable for ML. However, the enumeration of possible inorganic complexes will necessarily differ from such prior organic enumerations, as the properties that make a molecule a good ligand for an inorganic complex (e.g., an unsaturated atom that can freely complex to a metal) are not the same as the characteristics that define closed-shell organic molecules. To construct such a set we note that, whether through ad hoc feature engineering17 or systematic feature selection,56 the most predictive and transferable feature sets for open-shell transition metal complex properties emphasize metal-local features. For spin-splitting energetics in particular, we found that the 24 most informative heuristic features obtained from feature selection28,56 were primarily (80%) comprised of properties from atoms within two bonds of the metal on the molecular graph. Thus, by selecting small ligands and systematically varying their properties, we expect to capture the most important variations needed to improve coverage for ML model training.
In this work, we enumerate de novo octahedral transition metal complex ligands, study properties of homoleptic complexes of these ligands with first-principles DFT simulation, and demonstrate the improvement of our inorganic ML models through the inclusion of this data. The rest of this work is as follows. In section 2, we describe our rules for enumerating monodentate and bidentate ligands composed of up to two and four heavy atoms, respectively. In section 3, we describe the Computational details of our simulation methodology. In section 4, we analyze the results of DFT geometry optimizations of homoleptic octahedral complexes built from these ligands and demonstrate how these data points improve ML model performance. Finally, in section 5, we provide our outlook and conclusions.
For M1 ligands, we assigned three scores, s, for the charge (scharge), sterics (i.e., presence of H atoms, ssterics), and accordance with the octet rule (soctet). We favored neutral and weakly negative ligands (scharge = 3 for −2 ≤ c ≤ 0) over positively charged ligands (scharge = 0 for c = 1), due to the fact that ligands will be complexed with positively charged metal centers (Fig. 2). We penalized h = 4 (ssterics = 0) passivation over all other choices (ssterics = 3) because large numbers of H atoms hinder the heavy atom from coordinating to the metal center (Fig. 2). We reduced the soctet of a valence, v, in proportion to its violation of the octet rule (Fig. 2):
| soctet = 4 − |8 − v| | (1) |
| stot = scharge + ssterics + soctet | (2) |
Common M1-type ligands in the spectrochemical series76 include OH−, H2O, and NH3, all of which were top scoring (i.e., stot = 10, Fig. 3 and ESI† Table S1). M1 ligands with stot = 8 that have been observed experimentally or invoked in the spectrochemical series include methylene77 (CH2) and elemental sulfur78 (Fig. 3). Still lower scores (i.e., stot = 7) arose primarily due to penalties for: i) steric repulsion, as in methane (CH4) or ii) charge, as in sulfonium (SH3+) and hydronium (OH3+) (Fig. 3). The lowest scoring ligand (stot = 3), OH42−, simultaneously violates steric and octet rules (Fig. 3). After filtering on score, we retained only the 29 M1 ligands with stot ≥ 8, all of which were neutral (9) or negatively charged (20, see ESI† Fig. S1).
In addition to eliminating ligands that produce odd numbers of electrons as in the M1 ligands, we also considered the expected bonding between the two heavy atoms in the ligand for filtering and scoring of the M2 ligands. To determine M2 candidate ligand bond order, we first assigned individual c1, c2 charges by choosing from all combinations that could produce the ctot value under constraint that the valence, v, of each ith atom was satisfied:
| vi = vei − ci − 2·li − hi | (3) |
| b = min (v1, v2) | (4) |
After assembling these M2 ligands, we discarded i) highly positively charged ligands (i.e., ctot > 1), ii) metal-coordinating heavy atoms (i.e., atom 1) with h > 3, and iii) ligands with b = 0 predicted between the two heavy atoms. We then scored the remaining 1171 ligands with scores adapted and augmented from the M1 case. In comparison to M1 charge scoring, the M2 ligand charge score was biased toward neutral ligands (scharge = 3 for ctot = 0) and weakly penalized intermediate, negatively charged ligands (scharge = 2 for ctot = −1 or −2, scharge = 1 for ctot = −3, see Fig. 2). Sterics of the M2 ligand were scored only for the metal-coordinating atom, but we penalized h = 3 or higher due to the presence of the second heavy atom (i.e., ssterics = 3 for h < 3, Fig. 2). The M2 soctet score was applied over atom 1 and 2 v values (Fig. 2):
| soctet = 5 − |v1 − v2| | (5) |
We introduced two bond-specific scores for the bond between the two heavy atoms in the M2 ligands: i) bond order, sbond and ii) charge polarization, spol. We already excluded b = 0 molecules, but in scoring we favored b = 1 (sbond = 3) weakly over b > 1 (sbond = 2) to avoid oversampling high bond-order, few-atom M2 ligands (Fig. 2). We disfavored unphysically ionic bonds in the present ligands by scoring highest the cases with low atom-wise charge assignment (spol = 3 for |c1| + |c2| ≤ 2), assigning intermediate scores for moderate polarization (spol = 1 for |c1| + |c2| = 3) and penalizing the highest formal charges (spol = 0 for |c1| + |c2| = 4, see Fig. 2). A total M2 ligand score was thus:
| stot = scharge + ssterics + soctet + sbond + spol | (6) |
The 55 highest scoring ligands (stot = 17) include the common metal-complexing ligands methylamine (NH2CH3),79 hydrogen peroxide (H2O2), and both hydroxylamine (NH2OH)80 and its experimentally-observed analogue, thiophosphinous acid81 (PH2SH, Fig. 3). Both reactive peroxide82 (O22−) and nitrosyl83 (NO−) (stot = 15) score highly as do stable small molecules such as N2 and HCN (stot = 16) and methanol (CH3OH) (O-coordinating stot = 17, C-coordinating stot = 14, see Fig. 3). Most M2-type ligands in the spectrochemical series84 have high scores (e.g., CN−, stot = 15 and CO, stot = 16), with NO+ (stot = 13) scoring the lowest due to its positive charge. Other stot = 13 ligands include sterically hindered ligands (e.g., NH3NH32−) or likely unstable, positively charged species, such as CCH+ (Fig. 3). The lowest score (stot = 3) was only assigned to 2 ligands when steric repulsion, octet rule violation, and unfavorable charges were all present in a single molecule (e.g., NH3C4−: soctet = 1, sbond = 2, all other scores are zero, see Fig. 3). From the scored subset of M2 ligands, we therefore retained the 494 M2 ligands with stot ≥ 14 for further characterization with DFT (sec. 4). Most of the retained ligands have single or double bonds between the two heavy atoms and are neutral or negatively charged (ESI† Fig. S2).
We next determined the b1–2 and b2–2′ bond orders simultaneously by iterating over allowed values between single and triple bonds:
![]() | (7) |
CH has b1–2 = 2 with c1 = −1 and c2 = 1. The resulting B4 ligand formed from two of these M2 ligands is NH2–CH
CH–NH2, which has the same net charge but neutral atom-wise charge assignments with b1–2 = b1′–2′ = 1 and b2–2′ = 2.
After assembling all theoretical B4 ligands, we discarded i) ligands with two consecutive double bonds or any cases with triple bonds that would be unable to form a 1–2–2′–1′ dihedral necessary to enable bidentate metal coordination, ii) strongly charged ligands with total charge (i.e., twice the charge of the original M2 ligand) higher than 4 or more negative than −8, iii) cases in which any bond orders were assigned to be zero, and iv) ligands with connecting atoms that had three or more passivating hydrogen atoms. These down-selection steps left a pool of 1356 B4 ligands suitable for further scoring with a five-component score similar to that applied to the M2 ligands. Both scharge and ssterics were unchanged from the M2 case but were evaluated, respectively, on the charge only of the original M2 fragment (i.e., half of the total B4 ligand charge) or a single relevant connecting atom (Fig. 2). The B4 soctet was also scored only for a single building block using eqn (5) but using the revised valency after charge redistribution (Fig. 2). After redistribution of charges, we scored a single 1–2 pair, penalizing strong polarization (spol = 0 for |c1| + |c2| > 1), favoring completely neutral atomic charges (spol = 2 for |c1| + |c2| = 0), and assigning an intermediate score (spol = 1) for slight polarization (i.e., |c1| + |c2| = 1, see Fig. 2). We also computed sbond only on the 1–2 bond and reduced it with respect to M2 scoring for higher bond orders (sbond = 3 for b = 1, sbond = 1 for b = 2, see Fig. 2).
The five B4 metrics were combined as in eqn (6) for a total score with a range of 2 to 16 across ligands retained for scoring (Fig. 1). Ethylenediamine (en, C2H8N2) is the only B4-type ligand in the spectrochemical series,84 and it has a maximum score (stot = 16) as does its phosphorus analogue, 1,2-ethanediyldiphosphine (C2H8P2, stot = 16, Fig. 3 and ESI† Table S4). Other common ligands that score highly include ethylene glycol (C2H6O2, stot = 16) and 1,2-ethanediimine (C2H4N2, stot = 14), which contains a bonding pattern analogous to that observed in the common bipyridine ligand (Fig. 3). Stable organic molecules that are sterically hindered and would make poor ligands, e.g., butane (C4H10, stot = 13), score more poorly than unstable but likely metal-coordinating molecules, e.g., tetraoxidane85 (H2O4, stot = 16, Fig. 3). Other molecules with intermediate scores (stot = 12) typically score lower due to a combination of steric hindrance, charge, and violations of the octet rule (e.g., S4, or HOC2OH, Fig. 3). The lowest scoring ligands are charged, have polarized bonds, and are octet violating (e.g., CCH3CH3C8−, stot = 2, Fig. 3). Following these holistic observations, we retained only the 47 B4 ligands with stot ≥ 14 for subsequent DFT calculations (ESI† Fig. S3).
325 theoretical M1, M2, and B4 ligands, we scored 2577 ligands and identified the top 20% (570 ligands) as good candidates for DFT characterization (Fig. 1 and see sec. 4). The other 8748 ligands were eliminated prior to scoring due to disqualifications ranging from unpaired electrons on the ligand, zero bond order between heavy atoms or unsuitable bond order for bidentate coordination, high net positive charge that would prevent coordinating a positively charged metal, or strong steric repulsion from a high number of hydrogen atoms on the metal-coordinating atom that would prevent metal coordination. All such excluded ligands are provided in the ESI† and can be interpreted as scoring zero in comparison to the retained ligands that all score higher. We next compared characteristics of these ligands to molecules in other curated data sets that satisfy M1, M2, or B4 ligand definitions. We focused on three representative databases: ChEMBL;86 the generated database of enumerated organic molecules with up to 9 heavy atoms, GDB-9;69 and a database of experimentally and computationally characterized diatomic molecules, DiRef.87 Across these databases, 71 of the 2577 ligands were observed in at least one database, 17 occurred in more than one database, and most are diatomic molecules found only in DiRef87 (ESI† Table S5). The majority (55) of database ligands are M2 type, and 46 of these M2-type molecules are present twice in the enumerated ligand set, distinguished only by the metal coordinating atom and corresponding to a single chemical species. Thus, the 71 ligands identified in existing databases (DBs) are 48 chemically distinct species (ESI† Table S5).
To simplify comparison of scores across the three ligand types, we obtained scaled scores, sscaled, between the minimum and maximum observed stot values within each ligand type:
![]() | (8) |
The scaled threshold for retention is lowest for M1 ligands (>71%) and highest for the B4 cases (>86%, see Fig. 4). Over all ligands, 75% (53 of 71 overall; 36 of 48 unique) of compounds found in DBs were above the relevant ligand cutoff (Fig. 4). The majority (13 of 18, 7 of 12 unique) of below-cutoff ligands are M2 type and correspond to positively charged diatomics from DiRef87 (e.g., NO+ and PS+ ESI† Table S5). Although such ligands are relevant for heteroleptic transition metal complexes, our focus on homoleptic octahedral complexes with high valent metals motivated penalizing positively charged ligands (see Fig. 2). Our restriction to octahedral complexes is motivated by their relevance in catalysis and functional materials, but study of heteroleptic or lower coordination number complexes in future work will motivate alternate scoring. The remaining database ligands below threshold typically exhibit steric hindrance (e.g., M1: CH4 or B4: C4H10, see Fig. 4). Conversely, known good inorganic ligands such as M1 ammonia, M2 methylamine, or B4 ethylenediamine are all among the top scoring database ligands (Fig. 4 and ESI† Table S5).
Overall, the enumeration recovered small molecules previously observed in other databases that are likely ligands in inorganic chemistry. Beyond validation of scoring heuristics, the large number of enumerated and retained ligands not present in other databases (i.e., around 90% or 517 of 570) suggests that our data set contains unique chemical bonding configurations. Thus, it will be useful to identify through first-principles DFT simulation the extent to which complexes could be formed from these ligands.
All complexes were generated with four metals (M = Cr, Mn, Fe, and Co) in M(II) and M(III) oxidation states. The differences between high (H) and low (L) spin states, ΔEH–L, was computed as the electronic energy difference between the two geometry-optimized states (i.e., the adiabatic energy difference). This choice is motivated by the fact that spin-state change is slower than other processes (e.g., optical excitations) for which a vertical energy evaluation may be more appropriate. The high-spin/low-spin definitions for the metals studied in this work are: quintet-singlet for d6 Co(III)/Fe(II), sextet-doublet for d5 Fe(III)/Mn(II), quintet-singlet for d4 Mn(III)/Cr(II), and quartet-doublet for both d3 Cr(III) and d7 Co(II). Although thermodynamic and solvent corrections are known to be important in making direct comparison with experimental spin state ordering,97 the two corrections typically have compensating effects,96,97 and we therefore focused on relationships between ligand identity and DFT adiabatic, electronic ΔEH–L energies. All open-shell calculations (i.e., non-singlet spin states) were carried out with level shifting98 using spin-up and spin-down level shifts of 1.0 and 0.1 Ha, respectively. Geometry optimizations used the L-BFGS algorithm in translation rotation internal coordinates (TRIC)99 as implemented in TeraChem to the default tolerances of 4.5 × 10−4 Hartree/Bohr for the maximum gradient and 1 × 10−6 Hartree for the change in energy between steps.
To streamline and improve the quality of data ingested during high-throughput simulation3 of transition metal complexes, we recently introduced4 automated checks of geometry and properties of the wavefunction. The geometric checks, as outlined in ref. 4, focus on preservation of metal–ligand bond lengths in the coordination sphere, ligand detachment, and ligand distortion. In practice, all simulations run with mAD4 are run in 24 hour increments, with geometry checks being carried out at each resubmission as well as on the final optimized structure. From the 6336 initial geometry optimizations, 22% (1387) of all geometry optimizations completed successfully, a somewhat reduced success rate with respect to the range reported in our prior work on transition metal complexes.17,56,57 The majority of unsuccessful calculations corresponded to those that failed geometry checks initially (214 or 3%), during resubmission (3816 or 60%), or on the fully optimized structure (919 or 15%). Such ligand detachment or strong asymmetry in metal–ligand bond lengths can be attributed to Jahn–Teller distortion, which in extreme cases would lead to unstable transition metal complexes. Of all excluded calculations, 1537 exhibit strong bond asymmetry and 1037 exhibit ligand detachment, although many of these calculations had at least one other failure mode as well. Some cases showed bond rearrangement within the ligand, which could lead to an alternative feasible complex, but we judged success here as only cases where the original connectivity in the ligands was preserved.
Over the 1387 converged complex results, completion rates are roughly evenly distributed over the M(II) (716) and M(III) (671) oxidation states as well as metals (Cr: 348, Mn: 341, Fe: 361, and Co: 337, see ESI† Fig. S4). Some bias is observed for successful convergence of low spin states (792 singlets or doublets) vs. their high spin counterparts (595 quartets, quintets, or sextets), likely due to the weaker bonding in high-spin complexes. Separating convergence by ligand reveals that of the 396 ligands we initially selected, only 185 converged successfully in at least one metal, oxidation state, or spin state (Fig. 5). The full ranges of retained scores are observed for successful ligands of all types, but the average score among the 185 ligand set is slightly higher than in the original 396 ligand set: M1 9.3 vs. 8.9 average score, M2 15.8 vs. 15.4 average score, and B4 15.7 vs. 15.4 average score. Because the applied geometry check penalized strong metal–ligand distortions or ligand detachment, 97% (1347) of optimized structures have metal–ligand bond asymmetries (i.e., the difference between maximum and minimum metal–ligand bond lengths) below 0.4 Å and 80% (1137) have differences below 0.2 Å (structures provided in the ESI†).
Dividing further by charge of the individual ligand, we observe that none of the 45 positively charged ligands of the M2 type in our original set led to a productive geometry optimization, further justifying our penalties on positively charged ligands during initial scoring (sec. 2). The highest success rate (37% or 77 of 210) is observed for neutral (M1, M2, or B4) ligands, but approximately 20% of the negatively charged ligands (29 of 141) also produced at least one stable complex. Of the 1387 converged geometries, we also discarded 134 for 〈S2〉 values that deviated from the anticipated value by more than 1 μB, as in prior work, to ensure limited symmetry breaking and localization of the spin on the metal. Finally, because ΔEH–L evaluation requires two successful geometry optimizations (the high-spin and low-spin states) of a given octahedral complex, a total of 343 new ΔEH–L evaluations were obtained for 106 unique ligands (see ESI† for all energies and structures).
In addition to being the most abundant ligands among the set selected for DFT characterization, N-coordinating ligands had the highest overall success rate, followed by P-coordinating ligands (Fig. 5 and ESI† Fig. S4). Generally, the second row analogues (P or S) of first row (N or O) complexes had lower overall success rates, but the final ligand set remained relatively balanced over all coordinating-atom types (Fig. 5). Because the majority of the selected ligand set is of the M2 type, this is also the greatest share of the successful ligands, but M2 ligands do not have the highest success rate (Fig. 5). For both M2 and B4 ligands, around 20% of ligands had at least one successful geometry optimization but no spin splitting energy pair, likely due to the greater flexibility of these ligands that increases the probability that at least one spin state fails to pass geometry checks. Within M1 ligand types, only 7 of the 19 retained ligands converged with spin-splitting energies, and most were complexes that are well known or that we had previously incorporated into ML data sets4,17,56,57 (e.g., NH3, PH3, H2S, and H2O). One exception was a Co(III) complex of the ammonia analogue, CH3− (Fig. 5). More diversity is observed in the successful B4 ligands, despite the relatively small size of the retained B4 ligand set, with the phosphorus analogue of the bipyridine core converging for multiple metal centers (Fig. 5). Finally, a large number of stable M2 ligand chemistries are observed for which spin-splitting energies were obtained, including a wide array of neutral (e.g., SHNH2) and negatively charged complexes (Fig. 5).
From the successfully converged complexes that make up our curated octahedral ligand database (OHLDB), we next quantified the extent to which these systematically enumerated complexes reflected chemistry divergent from the 1901 ΔEH–L values we had previously obtained for artificial neural network (ANN) model training.4,56 To compare diversity in the chemical structures, we featurized each new complex with the revised autocorrelation (RAC-155) representation56 (ESI† Text S1). The RAC-155 representation consists of products and differences of heuristic properties on the molecular graph and has shown good performance4,28,56,58,96 for predicting inorganic chemistry properties, including ΔEH–L. Although OHLDB complexes primarily lie within the convex hull of the first two principal components (PCs) in the RAC-155 representation, the overall Euclidean norm distance in feature space averaged over the ten nearest neighbors in existing data is quite large (>20) for a number of the complexes (Fig. 6). The complexes indeed fall outside the convex hull of the pre-existing data but do so especially at higher PCs (i.e., 7–8), where the first eight PCs generally contain the vast majority of the variance (89%, Fig. 6 and ESI† Fig. S5).
An alternative measure of data diversity is in property space, which we assessed first by determining if a previously trained RAC-155/ANN model112 could have predicted the ΔEH–L values exhibited by the OHLDB complexes (Fig. 6 and ESI† Table S6). Overall, although a large number of complexes were well predicted, significant (e.g., >60 kcal mol−1) over- and underestimations of ΔEH–L are indicative of limited prior knowledge by the ANN (mean absolute error, MAE = 14.3 kcal mol−1) of the chemistry of the OHLDB complexes (Fig. 6 and ESI† Table S6). Indeed, high error points are both chemically distinct and exhibit unexpected spin-state ordering, such as an Fe(II)(HNO)6 complex (ΔEH–L ANN: −17.1, DFT: 50.1 kcal mol−1), which contains an NO motif adjacent to the metal that had been absent from prior training complexes and is erroneously predicted by the ANN to be weak field in nature (Fig. 6). Similarly, no phosphorus-coordinating metal complexes and few sulfur-containing ligands had been in training data, leading to large errors for an Fe(II) complex with bidentate PH2SSPH2 ligands (ΔEH–L ANN: −27.8, DFT: 15.2 kcal mol−1, Fig. 6). Although phosphorus ligands are known to be low-spin directing, their absence from our training data means that accurate ANN predictions on such complexes cannot be expected. Finally, in some cases, the coordinating atom may be present in training data, but the chemistry is still unusual, as is the case for a strongly high-spin favoring Mn(III)(CH2CH3−)6 complex (Fig. 6). Although the ANN correctly predicts this complex to be high spin, it cannot predict the strong high-spin stabilization observed in the DFT calculation (ΔEH–L ANN: −11.8, DFT: −72.0 kcal mol−1) for this saturated, negatively charged carbon ligand that is distinct from other C-coordinating ligands (e.g., CO) in our prior training data sets.
Indeed, across a broad range of metals, oxidation states, and ligand coordinating atoms, the range of OHLDB ΔEH–L values exceeds that seen in our prior data sets (Fig. 7 and ESI† Fig. S6 and S7). Expected trends are observed, such as carbon- and phosphorus-coordinating complexes generally corresponding to low-spin-directing, strong field ligands, especially for Mn(II), Fe(II/III), or Co(II/III) complexes (Fig. 7 and ESI† Fig. S7). Although N-coordinating ligands generally form high-spin complexes, especially with Cr(II/III) or Mn(II/III) metals, notable exceptions are observed including low-spin Cr(II)(NSH)6 (ΔEH–L = 23.1 kcal mol−1) and Mn(II)(NNH2)6 (ΔEH–L = 20.6 kcal mol−1) complexes (Fig. 7). Given the dearth of low-spin Cr database complexes, the OHLDB therefore can be expected to enhance ML model predictions of ΔEH–L (Fig. 7).
![]() | ||
| Fig. 7 Boxplots of M(II) (M = Cr, Mn, Fe, or Co) ΔEH–L (in kcal mol−1) for ligands grouped by metal and by ligand-coordinating-atom (C in gray, N in blue, O in red, P in orange, and S in yellow, as shown in inset legend). Each box indicates the median by a horizontal line, the interquartile range (IQR), and whiskers indicate 1.5× the IQR. Here, DB (white boxplot) corresponds to range of ΔEH–L values from prior work in ref. 4 and 56 for the relevant metal, regardless of coordinating atom. | ||
Unsaturated carbon-coordinating ligands (e.g., CCH2 in Mn(II)(CCH2)6 ΔEH–L = 72.6 kcal mol−1) are known to be low-spin directing ligands but most were absent from our earlier data set, as were more unusual low-spin-directing ligands such as CHOH or CHNH2 (Mn(II) ΔEH–L = 47 to 62 kcal mol−1, Fig. 7). The sulfur-coordinated Fe(II) complexes in our new data set span from low-spin (e.g., monodentate SHOH: ΔEH–L = 11.0 kcal mol−1 or bidentate SC2H2S: ΔEH–L = 10.3 kcal mol−1) to high-spin (e.g., SC: ΔEH–L = −41.1 kcal mol−1), corresponding to a range that we had not observed in prior Fe(II) data (Fig. 7). Since sulfur is considered a soft element, low-spin sulfur complexes are somewhat surprising. Examination of the OHLDB confirms that the bidentate SC2H2S ligand also forms low-spin complexes with Fe(III) and Co(II/III) but forms high-spin complexes with Cr(II) and Mn(III) (see ESI†). Saturating the sulfur (i.e., SHC2H2SH) and the carbon backbone (i.e., SHC2H4SH) instead yields the expected, uniformly high-spin complexes regardless of metal center and oxidation state (see ESI†). Although oxygen-coordinating ligands are known to be weak-field, high-spin directing in nature, diverse ligand chemistry in the OHLDB yields, in addition to those previously observed, unexpectedly high-spin complexes e.g., Co(II)(OHP2H2OH)3 (ΔEH–L = −22.8 kcal mol−1, Fig. 7). This bidentate ligand and the isoelectronic OHS2OH ligand produce among the most high-spin-favoring Cr(II)/O-coordinating complexes (ΔEH–Lca. −49 to −52 kcal mol−1, see ESI†).
The OHLDB also enables examination of how isovalent and isoelectronic variations in ligands alter spin-state ordering (Fig. 8). Here, we focus on the widely-studied CO ligand and related isovalent and isoelectronic species, including those in the spectrochemical series76 (e.g., CN−) and other common molecules (e.g., HCN and N2, Fig. 8). Given ligand definitions, ΔEH–L can in principle be obtained for either orientation of asymmetric M2 ligands, but only 14 of these ligands in practice yielded at least one ΔEH–L value for any metal or oxidation state (Fig. 8). High-spin Fe(II) or Fe(III) complexes are formed from either homonuclear ligands (e.g., N2 and P2) or cases where the weaker-field element coordinates the metal (e.g., SC, OC), despite being isovalent or isoelectronic with low-spin directing CO (Fig. 8). These effects are not additive by element, where the NP ligand is low-spin directing, in spite of the high spin preferences of N2 and P2 (Fig. 8). Although CO is often invoked as one of the strongest field ligands (ΔEH–L = 20–30 kcal mol−1 for Fe(II) or Fe(III) complexes), five ligands form even more low-spin-favoring complexes, including those where O is replaced by anionic (e.g., CN−, CP−) or less electron withdrawing species (e.g., CCH−, CNH, or CS, Fig. 8). The trends observed for iron complexes generally hold for other metals, with Co(III) complexes exhibiting more low-spin bias for the same ligands, Mn(III) complexes remaining uniformly high spin, and all other metals and oxidation states generally residing within these two bounds (Fig. 8). Thus, a very wide range of ΔEH–L values (ca. −60 to +80 kcal mol−1) can be obtained simply by adjusting the charge and elemental identities in M2-type ligands isoelectronic or isovalent to a common ligand.
Finally, we considered the extent to which OHLDB data could be used to improve ML model predictions on large, diverse complexes112 by improving the chemical coverage of the ML model training data. We recently curated112 a 116 complex out-of-sample test set from the Cambridge Structural Database (CSD)113 for testing ΔEH–L predictions with a RAC-155/ANN model. Because the CSD complexes were chosen to be distinct from the 1901 complexes used in the training of the ANN, the CSD set ΔEH–L MAE of 8.6 kcal mol−1 was much poorer than set-aside test set errors (ca. 1–3 kcal mol−1 (ref. 17 and 56)) or uncertainty-controlled, out-of-sample prediction errors (ca. 4.5 kcal mol (ref. 27)). Notably, very high ΔEH–L prediction errors, either due to over or underestimation, were observed on the order of 20–50 kcal mol−1 (Fig. 9). Incorporating OHLDB data and retraining the RAC-155/ANN eliminated many of these highest error points and reduced CSD set MAE to 6.7 kcal mol−1 (Fig. 9, ESI† Text S1, Table S6, and Fig. S8 and S9). Despite the fact that most of the CSD complexes are much larger in size, significant improvements are observed for complexes that had metal-adjacent coordination environments present in the OHLDB but absent in our prior data, such as coordination by NO species (CSD ID: CEYSAA, Fig. 9). In most cases model performance improved, but for select complexes model performance remained the same or worsened slightly in a manner that is not dependent on the metal center (CSD ID: COBWEX, Fig. 9 and ESI† Fig. S9 and S10). Given that most of the CSD curated set112 is multidentate in nature, whereas the OHLDB is weighted toward monodentate ligands, further improvement could likely be achieved through continued systematic enumeration of a greater number of ligands of higher denticity.
![]() | ||
| Fig. 9 Swarm plot of RAC-155/ANN signed errors (in kcal mol−1) on an out-of-sample 116 CSD structure data set112 (original, left) and after retraining with OHLDB data (retrained, right). The single most improved (CSD ID: CEYSAA) and worsened (CSD ID: COBWEX) points are shown in green and red insets, respectively, and have data points colored in the same manner. | ||
000 theoretical monodentate or bidentate ligands comprised of C, N, O, P, or S heavy atoms, we identified a 2500-ligand subset for scoring. Based on analysis of ligand feasibility by score, we identified cutoffs and retained a high-scoring, 570-ligand subset as most promising for subsequent calculations. Only a small number (71 of over 2500) of our ligands were in prior databases, and most (75%) of those ligands remained within our high-scoring cutoff.
We next characterized with DFT all of the feasible mononuclear, homoleptic octahedral transition metal complexes formed from combinations of the 396 ligands in complex with a choice of eight metal/oxidation state combinations in each of two spin states. Over the calculations that comprise the OHLDB, we obtained and analyzed nearly 350 spin-splitting energies. We observed unexpected combinations of metal/coordinating atom and spin-state ordering, including those that extended the ranges sampled in our prior databases of octahedral transition metal complexes. We showed how these complexes reflected chemical compositions previously absent from our machine learning (i.e., artificial neural network) models for predicting spin splitting. After enriching machine learning models with OHLDB data, we showed improved machine learning model prediction performance on an out-of-sample test set consisting of transition metal complexes much larger in size. We anticipate that the OHLDB will be a good testbed both for the application of high-scaling, correlated wavefunction theory methods and for representation development for machine learning in inorganic chemistry both in spin-splitting energy predictions and beyond.
Footnote |
| † Electronic supplementary information (ESI) available: SMILES strings and components of scores for M1, M2, and B4 ligands; details of M1, M2, and B4 ligand scoring along with properties of retained ligands; comparison of ligand scores to properties of ChEMBL, DiRef, and GDB-9 molecules; principal component analysis of OHLDB data and comparison to prior calculations; success rate of DFT calculations; spin-splitting properties of new complexes, including M(III) boxplot results; analysis of improvement in out-of-sample CSD ANN performance with retrained ANN (PDF). Geometry optimized structures and properties of OHLDB complexes; geometry optimized structures and properties of 1901 training complexes, geometry optimized structures and properties of 116 CSD test case complexes; ANN predictions on OHLDB structures; ANN predictions on CSD structures before and after retraining; ANN models, ANN models, weights and scaling information before and after retraining (ZIP). See DOI: 10.1039/c9me00069k |
| This journal is © The Royal Society of Chemistry 2020 |