Open Access Article
Ilia
Kevlishvili
a,
Roland G.
St. Michel
ab,
Aaron G.
Garrison
a,
Jacob W.
Toney
a,
Husain
Adamji
a,
Haojun
Jia
ac,
Yuriy
Román-Leshkov
ac and
Heather J.
Kulik
*ac
aDepartment of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA. E-mail: hjkulik@mit.edu
bDepartment of Materials Science and Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
cDepartment of Chemistry, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
First published on 26th June 2024
The breadth of transition metal chemical space covered by databases such as the Cambridge Structural Database and the derived computational database tmQM is not conducive to application-specific modeling and the development of structure–property relationships. Here, we employ both supervised and unsupervised natural language processing (NLP) techniques to link experimentally synthesized compounds in the tmQM database to their respective applications. Leveraging NLP models, we curate four distinct datasets: tmCAT for catalysis, tmPHOTO for photophysical activity, tmBIO for biological relevance, and tmSCO for magnetism. Analyzing the chemical substructures within each dataset reveals common chemical motifs in each of the designated applications. We then use these common chemical structures to augment our initial datasets for each application, yielding a total of 21
631 compounds in tmCAT, 4599 in tmPHOTO, 2782 in tmBIO, and 983 in tmSCO. These datasets are expected to accelerate the more targeted computational screening and development of refined structure–property relationships with machine learning.
Prior efforts have been made to curate datasets of TMCs,13–16 their constituent ligands,17–23 and relevant reactions.4,24–27 Many of these datasets are based on entries from the Cambridge Structural Database (CSD),28 a digital repository of experimental crystal structures, including molecular crystals of thousands of TMCs. However, challenges exist with current datasets, which primarily fall into one of two classes. The first class of datasets are exceptionally large but contain properties of limited relevance to applications-oriented molecular discovery.10,13,24,29–36 The second class of datasets are highly focused, containing relevant information very specific to local regions of chemical space, and as such are not easily generalized to new chemical applications.14,19–23,27,37,38 The transition metal quantum mechanics (tmQM) dataset is an example of a large, nonspecific dataset of interest to transition metal chemistry, containing 86
665 mononuclear TMCs.16 These structures were extracted from the CSD and subjected to additional filtering by retaining only structures with at least one C and H atom, and only those which contain allowed non-metal elements (i.e., B, Si, N, P, As, O, S, Se, F, Cl, Br, and I). Furthermore, oxidation states were assumed for metals to ensure closed-shell character where possible and only structures that had a net charge of no more than +1 or less than −1 were retained. Their geometries were optimized at the semiempirical extended tight binding (xTB) level of theory, and DFT single-point energy calculations were performed on the resulting structures. While the tmQM is a valuable dataset in computational chemistry workflows for investigating TMCs,17,39,40 a key limitation is the absence of a mapping between molecular structures and the relevant areas of chemistry. This hinders further investigation into structures that are particularly promising for applications in catalysis, photochemistry, or other fields of interest. In contrast, datasets such as the ligand knowledge base (LKB) curated in pioneering work by Fey et al.19–22 and kraken later developed by Gensch et al.23 are examples of detailed, applications-focused datasets with limited transferability. Both datasets primarily consist of organophosphorus ligands, include relevant physicochemical descriptors useful in building quantitative structure–property relationships, and are based on commercial and virtual libraries.19–23,41 However, these do not generalize well to other areas of chemistry beyond organophosphorus ligands, exemplifying a limitation of such datasets and a need for large datasets linked to targeted chemical applications.
The curation of a chemically targeted and synthetically accessible TMC dataset relies on systematically reviewing literature on the TMCs that are contained within existing databases. The broad scope of TMC literature, however, would make manual processing arduous, prompting the use of natural language processing (NLP) techniques42,43 for efficient analysis. NLP has been utilized extensively in the extraction of material properties and material synthesis parameters from the literature.37,44–51 More recently, large language models (LLMs) coupled with prompt engineering have gained increasing popularity in automating scientific text mining for chemical information due to their more user-friendly nature.52–56 A crucial aspect of text mining for classifying text based on chemical domain involves topic modeling,57 which is the identification of underlying themes in large sets of scientific text. For tasks of this nature, prompt engineering typically requires a priori definition of the possible latent topics. Nevertheless, LLMs can still be leveraged to obtain contextualized embeddings of the text that capture semantic information58,59 and subsequently cluster text based on semantic similarity.60,61 Here, each cluster corresponds to a latent topic, as facilitated by algorithms like bidirectional encoder representations from transformers for topic modeling (BERTopic).62 Simpler topic modeling approaches, such as latent Dirichlet allocation (LDA),63 which utilizes bag-of-words and statistical patterns of co-occurring words to infer latent topics, can also be employed to cluster manuscripts in a corpus. While these unsupervised NLP methods have been leveraged for summarizing research trends in chemistry,64 with an emphasis on biochemical and medicinal research65–68 as well as in the classification of large biomolecular datasets,69 they have yet to be extended to the space of transition metal chemistry and in the development of application-specific TMC datasets.
To construct chemically targeted TMC datasets, we conduct text mining on manuscripts associated with synthesizable TMCs from the tmQM database, focusing only on their titles and abstracts, and leverage both simple NLP tools as well as transformer models to process the text. Using topic modeling, we segment the structures in the tmQM database based on distinct chemistry applications. Through this process, we introduce four new TMC datasets – tmCAT containing catalytically-relevant TMCs, tmPHOTO with photoactive TMCs, tmSCO comprising TMCs with magnetic properties, and tmBIO containing biologically-relevant TMCs. Additionally, we performed substructure analysis to compare trends in metal-local structures among tmQM TMCs and the four curated datasets to subsequently enrich each chemistry-specific dataset by adding additional tmQM TMCs that could potentially be suitable for the given application.
Abstracts were preprocessed using the NLTK v.3.8.1 package.74 The text was cleaned by lowercasing and removing punctuation, URLs, and numbers. The cleaned text was then tokenized using a regular-expression tokenizer, RegexpTokenizer, implemented in the NLTK package. Tokenized text was filtered using stop words with standard English stop words. For unsupervised clustering, an additional set of stop words was introduced to avoid clustering based on chemical languages (see Section 3.3). The filtered text was stemmed with the Snowball Stemmer and lemmatized with the WordNet Lemmatizer, both implemented in the NLTK package.
While BERT-based models can, in principle, be applied without text preprocessing, this in practice leads to clustering by transition metal or material (ESI Table S2†). To avoid this, we introduced stop words before text processing to avoid dependence on specific materials or metals, with a full list of stop words provided on Zenodo.79
:
20 train/test split was used to evaluate the model performance on a set-aside test set. Hyperparameter optimization does not significantly affect the model performance. Full grid search cross-validation summary can be accessed through the Zenodo repository.79 Default random forest hyperparameters were used for training without hyperparameter optimization, with the exception of the minimum samples required to split a node that was set to 10, and the number of trees that was set to 1000 (ESI Table S3†). The minimum sample split was increased from the default to avoid overfitting, and the number of models in the ensemble was increased to improve model performance. Dimensionality reduction of high-dimensional feature vectors for unsupervised clustering (5-dimensional vector) and visualization (2-dimensional vector) were carried out using uniform manifold approximation and projection for dimension reduction (UMAP) with the UMAP package v0.5.5.77 We use the 5-dimensional vector to identify dense clusters, and we interpret topic assignments through cluster-level TF-IDF vectors. Clustering on the reduced 5-dimensional vector was carried out using HDBSCAN,80 a hierarchical density-based clustering algorithm, using the HDBSCAN v0.8.33 package. The topic assignment of dense clusters was achieved using the BERTopic package v0.16.0 with a modified class-based TF-IDF vector (c-TF-IDF).62 Clustering using LDA was carried out with the implementation in the scikit-learn package.
All machine learning models, scripts, Jupyter notebooks, datasets, and their associated structures are provided on Zenodo.79 Geometry assignment was carried out using functionality implemented in the molSimplify geometry_changes branch, and the script is included in the Zenodo repository.79
665 unique CSD refcodes as a starting point, and we obtained the manuscripts associated with each CSD refcode using the ArticleDownloader package.72 Through this procedure, we curated a corpus consisting of 28
394 unique manuscripts, accounting for 50
968 crystals in the dataset. To utilize natural language processing models for identifying catalysis manuscripts, we focused on manuscript abstracts because they are information-dense texts that tend to avoid the discussion of broader topics and are therefore well suited for identifying whether a manuscript will discuss catalysis (i.e., versus introductions). Furthermore, abstracts are publicly available, typically on the article’s DOI-accessible HTML webpage, which enables retrieval of abstracts that could not be obtained using the ArticleDownloader package.72 If abstracts could not be extracted automatically from the manuscript (here, this occurred for 4682 of 28
394 manuscripts), we used titles instead.
To train a sentiment analysis model that could identify whether an abstract is associated with catalysis, we first identified the subset of manuscripts related to catalysis based solely on whether the manuscript titles contained the keyword “catal” but did not contain false-positive keywords (e.g., “uncatal”, “acid catal” or “base catal”, ESI Table S3†). These steps produced a 4585-manuscript subset where titles were confidently labeled as addressing catalysis (ESI Table S3†). We then confidently identified non-catalytic manuscripts by excluding manuscripts with catalytic and associated keywords (i.e., “catal”, “turnover”, and “polymer”) that occurred at least once in either the title or abstract (ESI Table S4†). This non-catalytic set consists of 20
557 manuscripts, the majority of manuscripts not identified as catalytic in our initial step, leaving only 3252 unlabeled manuscripts. To analyze the label assignment, we randomly sampled 50 catalysis and 50 non-catalysis manuscripts and checked them manually. We find that 48 of the positive and 49 of the negative labels were correctly assigned using our approach.
Despite our efforts to carefully label manuscripts confidently, simple pattern matching leads to a small number of incorrect label assignments. Our approach to avoid false positive labels by only looking at patterns in the title is expected to miss true positive hits, motivating a more robust approach for identifying catalyst-focused manuscripts. We next pursued a more systematic approach by developing a classifier model using natural language processing. Prior to training the classifier, we first created a balanced dataset of catalysis and non-catalysis manuscripts by subsampling the non-catalytic manuscripts to achieve a balanced set of 4585 non-catalysis manuscripts to match the 4585 catalysis manuscripts. Among the set of 9170 manuscripts, 1312 manuscripts (364 catalysis, 948 non-catalysis) did not have a defined abstract, and the title was instead used for training the classifier model. We then separated this set of 9170 manuscripts into training and test sets using a stratified random split of 80
:
20. Using the abstract text of each of these papers, we preprocessed the text by elimination of uppercase letters, removal of punctuation, lemmatization, stemming, and removal of stop words to make the text suitable for natural language processing tasks (see Computational details). We first featurized the papers based on the term frequency-inverse document frequency (TF-IDF) vectorizer82 from the training set (see Computational details). Subsequently, a random forest classifier model was trained on the reduced TF-IDF feature vectors to predict whether an abstract is related to catalysis, achieving a high accuracy of 0.97 and strong separation between the two classes and ROC-AUC of 0.98 (Fig. 1).
Despite the strong overall performance, we next analyzed the specific cases where the model failed to ensure that it was a suitable tool for confidently assigning a catalysis focus to manuscripts. Out of 1834 unique abstracts in the test set, 35 were incorrectly labeled. Only eight of these abstracts were given a false positive label and were manually inspected to prevent dataset contamination. Among these eight abstracts, five had labels that were inaccurately assigned by the rule-based method (i.e., missing the “catal” and additional keywords in the abstract/title) and thus should have been labeled as catalysis manuscripts. Additionally, one entry lacked an abstract, and its title had been used for training instead, likely contributing to the failure of the model to correctly predict the non-catalytic label for the manuscript. Furthermore, 27 complexes were given false negative labels. Among these, 14 had labels that were inaccurately assigned by the rule-based method. This analysis reveals that only a negligible fraction of manuscripts were wrongly labeled as catalysis-focused by the classifier model, and our method can detect complexes that the rule-based approach missed. Similarly, the model can detect true positive labels that were missed by the rule-based method. However a negligible but non-zero number of catalysis manuscripts might be missed by the model. Detailed analysis of all false labels are provided in the Zenodo repository.79 Unsurprisingly, the feature importance analysis of TF-IDF vectors showed that the most crucial features are keywords directly related to catalysis. However, several other significant word features were also identified, including activity, polymerization, hydrogenation, coupling, selectivity, Suzuki, and enantioselective, among others (Fig. 2). To test the effectiveness of these additional tokens related to catalysis, we developed a separate random forest model that was trained on a TF-IDF feature vector that excluded the direct catalysis keywords (ESI Table S5†). This second random forest model still achieved 89% accuracy, demonstrating that other relevant keywords effectively identify catalysis-related manuscripts (ESI Table S6 and Fig. S1†).
Given the promising performance of the classifier, we utilized the random forest model trained on the full TF-IDF feature vector to identify additional catalysis papers from the superset of all unlabeled manuscripts. This unlabeled set comprised any manuscript from the clean corpus not included in the original training/test set, which includes 3252 manuscripts that were not labeled as either catalysis or not-catalysis, the excluded non-catalytic manuscripts absent from the subsampled set, and manuscripts titles/abstracts mined from HTML source if they could not be obtained through the ArticleDownloader package. In total, this set comprised 20
449 manuscripts associated with 30
345 unique CSD refcodes. By applying the random forest classifier to this unlabeled set, we identified 6208 additional manuscripts in the corpus as associated with catalysis (ESI Fig. S2 and S3†). With this added set, we identify the final tmCAT dataset, which consists of 10
793 unique manuscripts and the structures of 19
250 associated TMCs. While NLP will identify abstracts that are associated with catalysis, it is not necessarily the case that all crystal structures associated with these papers are relevant for catalysis (i.e., they could correspond to non-catalytic cations or catalyst precursors), and further analysis of the chemical composition of the dataset is merited (see Section 3.4).
Next, we analyzed the differences and similarities between the tmCAT and the rest of the tmQM dataset (i.e., the non-catalytic portion) in terms of descriptors that had been computed during the curation of the tmQM dataset.81 We first focus on electronic descriptors, such as molecular orbital energetics90 and metal charges that are commonly employed in the screening of transition metal catalysts.90,91 The relevant descriptors available in the original tmQM set81 include the HOMO and LUMO energies, the HOMO–LUMO gap, and the transition metal center partial charge. These properties were computed using the TPSSh meta-GGA hybrid functional with empirical D3(BJ) dispersion correction and a def2-SVP basis set based on GFN2-xTB optimized geometries. Interestingly, the distribution of these descriptors shows no major differences between the catalytic (tmCAT) subset and the non-catalytic subset of the tmQM dataset (ESI Fig. S4–S7†). These observations suggest that when considering a diverse set of complexes, with wide-ranging transition metals, oxidation states, coordination environments, and ligands, descriptors based on frontier orbitals or metal partial charges may be insufficient for inferring reactivity. These observations are consistent with past work showing that frontier orbital energies alone struggle to generalize to catalyst activities across multiple metals and oxidation states.92
We next considered geometric descriptors that evaluate the steric environment defined by ancillary ligands93 as another commonly employed class of descriptors for catalyst screening. We would expect an active catalyst (i.e., not a precatalyst) to feature an open site that can lead to the association of a reactant to the active site. However, active catalysts with open metal sites are usually not energetically stable intermediates and tend to have a sacrificial ligand or a solvent coordinated to the open site. To compare tmCAT structures to the non-catalytic subset of tmQM, we computed the percent buried volume of the metal for all complexes in the tmQM dataset. Analysis of the distribution of this parameter shows no significant difference in buried volume between the tmCAT and the rest of the tmQM dataset (ESI Fig. S8†). We anticipate this lack of distinction is attributable to the fact that deposited crystal structures are likely precatalysts that need to undergo an activation process to form an active catalyst, meaning that geometric descriptors on CSD structures are unlikely to be useful for identifying catalysis-capable complexes.
Beyond steric metrics, one might anticipate other differences in the metal-local coordination that might distinguish the tmCAT and non-catalysis tmQM subsets. We hypothesized that even though pre-catalyst complexes should be heavily featured in tmCAT, some noticeable differences could still be observed between catalysis and non-catalysis datasets when comparing coordination geometries because some geometries are less probable for precatalysts (e.g., those with six monodentate ligands or three bidentate ligands). We assigned metal coordination geometries by examining the geometric deviations from possible ideal transition metal geometries and assigning a geometric class with the lowest deviation (see Computational details). When a haptic ligand is encountered (e.g., an alkene bound via its π bond to a metal center), a single occupancy was assigned at the geometric centroid of the haptic ligand. The most noticeable difference between tmCAT and the non-catalysis tmQM subset is due to the significant reduction in the number of octahedral complexes in tmCAT accompanied by a significant enhancement in the frequency of square planar complexes (Fig. 4). Despite a lack of difference between tmCAT and the non-catalytic tmQM in terms of steric descriptors, this enhancement of square planar over octahedral structures is consistent with our expectation of enhancing coordinatively unsaturated complexes in the tmCAT dataset as well as the fact that more octahedral structures are likely to be less compatible with catalysis due to higher-denticity ligands. Furthermore, the relative frequencies of other coordinatively unsaturated geometries, such as square pyramidal and trigonal planar complexes are also enhanced in the tmCAT dataset.
We utilized the BERTopic model,62 a method that clusters bodies of text based on their semantic embedding and then assigns topics using a modified, class-based TF-IDF (c-TF-IDF) vector or other interpreter (see Computational details). Importantly, we identified several subtopics that could be associated with non-catalytic applications, including biological activity, photoactivity, magnetism, self-assembly, and X-ray crystal structure characterization (Fig. 5 and S9 ESI†). Additionally, many of the topics uncovered by this analysis are complementary to our earlier labeling of catalyst papers, as other uncovered subtopics include several catalysis-related areas with more specific applications including polymerization, hydrogenation, chiral catalysis, and cross-coupling (Fig. 6 and S10 ESI†). Even though the number of subtopic clusters and cluster composition is distinct across different models (i.e., because there are multiple topic models that differ based on a random seed), all major subtopic clusters we identified are conserved. Using this additional information, we now introduce three additional datasets consisting of complexes associated with manuscripts consistently categorized across all five models as follows: (i) tmPHOTO, which consists of photoactivity-associated complexes, (ii) tmSCO, which consists of compounds exhibiting properties relevant to studies of magnetism, and (iii) tmBIO, which consists of complexes with biologically relevant activities. We identified each of these datasets by inspecting the most significant tokens associated with each of the clusters. For example, the tmPHOTO set is derived from the cluster that consists of manuscript abstracts that discuss phosphorescence, emission, and quantum yield, indicating that photophysical activity is discussed throughout these manuscripts. Similarly, we identified that tmSCO manuscript abstracts are associated with tokens that are related to spin-crossover, magnetic properties and hysteresis, all keywords that are related to changes in the spin state of a TMC. Likewise, we determined that manuscript abstracts associated with the tmBIO cluster discuss properties such as cytotoxicity, cancer, cell, and apoptosis, which are all related to biological activity relevant to pharmaceutical applications.
Based on how semantic embedding works, we expect it to place/arrange subtopics with greater similarity closer to each other in the reduced dimensional space. To analyze the performance of the unsupervised learning approach and visualize how different topics relate to each other, we employed UMAP for further dimensionality reduction on the SBERT embedding, reducing the embeddings to two dimensions better suited for visualization while retaining the global structure of the data for distance comparison. We find that catalysis sub-topics are predominantly clustered closely to each other, suggesting that the wording used throughout catalysis-associated abstracts is highly similar. Furthermore, catalysis topics that are more closely related to each other, such as polymerization and metathesis, or hydrogenation and hydroboration/boration, each of which is related to olefin functionalization, are more closely clustered (Fig. 7). As expected, the other, non-catalysis topics are more distant in the UMAP-reduced space (Fig. 7). The biologically active cluster, arguably the most different from other applications due to its relevance in biological and pharmaceutical applications, stands out as the most distinct cluster. Manuscripts related to photoactivity and magnetism are comparatively clustered close to each other, which can be expected because both topics are associated with the transition to an excited state via external stimulus (Fig. 7). Alternative dimensionality reduction techniques, such as t-distributed stochastic neighbor embedding (t-SNE),95 lead to similar conclusions, although we avoided using t-SNE because it is less effective at preserving the global data structure (ESI Fig. S11†).
![]() | ||
| Fig. 7 UMAP dimensionality reduction of SBERT embedding vectors colored by different cluster topics for different catalysis applications (top) and general applications (bottom). | ||
To support the findings from the BERTopic model, we employed an additional topic modeling approach, latent Dirichlet allocation (LDA).78 LDA is a Bayesian method that iteratively assigns two probabilities: one indicating that a given token belongs to a topic and the other indicating that a topic belongs to a document. This LDA analysis produces semantically similar clusters to BERTopic (ESI Fig. S12†). Using the LDA approach, several catalysis clusters are identified, including polymerization, chiral catalysis, cross-coupling catalysis, olefin functionalization catalysis, and mechanism-focused catalysis. Some non-catalysis topics are also conserved, such as photoactivity, biological activity, magnetism, and X-ray crystal structure characterization. Feature reduction of the token count vector, obtained using UMAP reduction with Hellinger distance, leads to a similar mapping, demonstrating the close relationship between the identified clusters and the shorter distances between catalysis-related clusters (ESI Fig. S13†).
Even with the expanded subsets identified by unsupervised clustering using BERTopic or using the LDA analysis, a significant portion of the tmQM complexes are either unlabeled or assigned to difficult-to-interpret clusters. Furthermore, these datasets curated using simple natural language processing methods are not fully context-aware and don’t account for more detailed information present in the manuscript, such as discussion of failed experimental attempts, meaning that they may contain “negative” examples. That is, these subsets could contain complexes that were either used as counterexamples, found to be ineffective for a given activity, or represent a precursor structure that was crystallized before the in situ assembly of chemically relevant species. Accordingly, we carried out analysis of the structures present in each dataset to both support the composition of these datasets and enhance them with chemically relevant species (see Section 3.4). Such species might not have been originally used for a given application and therefore be missed by NLP approaches, but due to chemical similarity, they could have a complementary application.
We first analyzed common structural motifs in the tmCAT dataset. Breaking down the dataset into structural motifs at d = 2 shows that while most complexes are unique when analyzing metal-local character, several structural motifs appear relatively frequently in the tmCAT dataset (ESI Fig. S14†). Overall, 19
250 complexes in the tmCAT dataset are represented by 10
094 unique d = 2 (i.e., metal-local) substructures. In particular, palladium dichlorides bound to either two phosphine ligands or to bidentate nitrogen-coordinating ligands are very common in the tmCAT dataset, appearing 131 and 89 times, respectively (Fig. 8). These metal-local motifs are likely derived from complexes that have been widely studied for their ability to catalyze cross-coupling reactions.99 Similarly, linear gold chlorides bound to NHC or phosphine ligands are common in the tmCAT dataset, appearing 112 and 100 times, respectively (Fig. 8). Linear gold complexes are known to catalyze various π-functionalization and annulation reactions, among others.100 The high abundance of these gold and palladium motifs is in line with their occurrence in the tmQM superset, where the popularity of these complexes has led to their widespread examination in many contexts. On the other hand, several frequently occurring motifs were identified that are almost exclusively studied for catalysis (Fig. 8). These include nickel catalysts with four mixed N, C, O, and P coordinating ligands, i.e., where each ligand type coordinates the metal once; iron dichloride catalysts with tridentate nitrogen coordinating ligands; and ruthenium dichloride catalysts with an NHC ligand and carbene ligand with a chelating oxygen group101 (i.e., likely derived from the second-generation Hoveyda–Grubbs catalyst102). Surprisingly, all these catalysts have been primarily studied for polymerization: Ni complexes are utilized for ethylene copolymerization with carbon monoxide,103 Fe complexes are used as catalysts for linear homo-polymerization of ethylene for the synthesis of high-density polyethylene,104 and Hoveyda–Grubbs catalysts are utilized for ring-opening metathesis polymerization (ROMP).105 This highlights how some motifs, despite their limited range of application, have been the focus of a great deal of study as a result of their industrial relevance.
Next, we expanded our substructure analysis to the photochemistry-relevant tmPHOTO subset. Analysis of metal identity in tmPHOTO reveals that iridium, platinum, and copper complexes are significantly amplified in this dataset (ESI Fig. S15†). Iridium106 and platinum107 complexes have been common targets for photophysical applications due to spin–orbit coupling that allows intersystem-crossing, which leads to high quantum yields. On the other hand, copper complexes108 have been explored as an earth-abundant alternative to more rare and expensive iridium and platinum complexes. Substructure mapping shows that 3043 complexes in the tmPHOTO dataset are represented by 1150 unique d = 2 structural motifs. Several commonly recurring substructures can be observed in the tmPHOTO set (ESI Fig. S16†). These include iridium complexes with a coordination number of six with two bidentate C⁁N coordinating ligands and two oxygen-coordinating ligands (i.e., structural analogs of Ir(ppy)2(acac)), which is a substructure that appears in tmPHOTO 101 times (ESI Fig. S17†). Platinum complexes with a coordination number of four, a bidentate C⁁N type ligand, and two oxygen-coordinating ligands are other commonly recurring structural motifs (i.e., structural analogs of Pt(ppy)(acac)), with this substructure appearing 89 times in tmPHOTO. Another similarly commonly recurring motif is a 4-coordinate copper complex with mixed nitrogen and phosphorus coordinating atoms, including a bidentate nitrogen ligand, which appears 89 times in the tmPHOTO dataset (ESI Fig. S17†). Examples of structural motifs that are almost exclusively studied for photophysical properties include Ir complexes with two bidentate C⁁N type ligands and a bidentate N⁁N type ligand, as well as platinum complexes with one bidentate C NHC-type ligand and two oxygen-coordinating ligands (ESI Fig. S17†).
Moving on to the spin-crossover relevant subset, we note that there are necessary differences for TMCs that exhibit switchable magnetic behavior. Here, iron, manganese, nickel, and cobalt complexes occur with higher relative frequency in the tmSCO set than in the tmQM superset (ESI Fig. S15†). These metals are all third-row transition metals that tend to have relatively low d-orbital splitting energy (Δ) and, depending on the oxidation state, they are expected to have multiple accessible spin states. Substructure mapping shows that 834 complexes in the tmSCO dataset are represented by 534 unique d = 2 structural motifs. A few d = 2 structural motifs are representative of this dataset through multiple recurrences (ESI Fig. S18†). These recurring motifs include manganese complexes with four nitrogen-coordinating ligands, including a bidentate sp3 hybridized ligand and two oxygen-coordinating ligands, which appear 39 times in tmSCO and only six additional times (i.e., 45 in total) across the entire tmQM dataset. This highlights how these complexes are nearly exclusively targeted for magnetic properties. Another common structural motif includes an iron center with four sp2 hybridized nitrogen coordinating ligands and two nitrogen coordinating ligands that could be either isocyanides or cyanates (ESI Fig. S19†). These complexes appear in the tmSCO set 27 times, and in the tmQM superset a total of 44 times. Interestingly, a relatively common structural motif includes iron bound to a tetradentate nitrogen-coordinating ligand with two additional isocyanides/cyanate ligands, which appears in the tmQM dataset 9 times, all of which are in the tmSCO subset, suggesting that these motifs have only been studied for applications related to magnetism (ESI Fig. S19†).
Finally, we analyzed the substructures in the biological activity subset, which we expect to be the most diverse due to the broad nature of this set. Ruthenium and platinum are the most heavily represented metals in the tmBIO dataset (ESI Fig. S15†). Overall, substructure mapping shows that 1808 complexes in the tmBIO dataset are represented by 974 unique d = 2 structural motifs. A high interest in platinum for biological applications can be attributed to cisplatin,109i.e., cis-diamminedichloroplatinum(II), the first inorganic small molecule approved as a pharmaceutical anti-cancer drug. Cisplatin-resistant cancers110 have led to the search for alternate platinum complexes as anti-cancer drugs111 and are represented in the tmBIO dataset. Furthermore, the high toxicity of cisplatin has led to the search for alternate inorganic and organometallic complexes that could be used as anti-cancer medications. In particular, the high promise of ruthenium arene 1,3,5-triaza-7-phosphaadamantane (RAPTA)112 compounds has led to a significant effort in screening ruthenium-based piano stool complexes as potential anti-cancer drugs.113 These efforts are consistent with the makeup of the tmBIO dataset. In fact, the three most commonly recurring structural motifs feature ruthenium arene complexes with one chloride and two additional ancillary ligands (ESI Fig. S20 and S21†). These complexes, cumulatively, appear in the tmBIO dataset 82 times and 208 in the tmQM dataset. Furthermore, a motif that is closely related to cisplatin, with platinum, two chloride ligands, a single ammonia, and an organic N-coordinating ligand, appears in the tmBIO set 17 times and has been exclusively studied for biological applications (ESI Fig. S21†).
Finally, we analyzed if any structural motifs appear frequently across different datasets to identify if they are frequently studied for multiple applications. To achieve this, we first created a subset of each dataset, consisting of commonly recurring motifs in each of the dataset (i.e. five or more recurrences for tmCAT and 3 or more recurrences for other datasets). We then analyzed overlaps among the datasets. These motifs are mostly exclusive to a given set, with more than 82% of metal-centered motifs only appearing in one of the four sets exclusively (Fig. 9). However, a single metal-centered substructure appears at least five times across all datasets. This motif consists of a nickel metal center with mixed N⁁N,O,O coordinating atoms and is reminiscent of salen complexes, despite the inability of radius 2 subgraphs to capture the entirety of tetradentate salen ligands (Fig. 9, left inset). Salen complexes are a common enzyme mimic often used in asymmetric catalysis114 but could be used as targets for photoactive complexes due to the rigid nature of the ligand, as magnetic complexes with the nickel as the metal center, and the square planar coordination environment of salen ligands with Ni complex can be targets for DNA intercalation. Furthermore, several structural motifs were identified to reside among tmCAT, tmPHOTO, and tmBIO sets but not tmSCO. A noteworthy example includes an iridium motif with two bidentate C⁁N-type ligands and single bidentate N⁁N type ligand (i.e., Ir(ppy)2(bpy) analogs). These complexes are commonly used as triplet sensitizers, which can be applied for photocatalysis115 to access excited states and for biological applications to target reactive singlet oxygen formation116 (Fig. 9, right inset).
Analysis of common application-specific structural motifs reveals that, for a given motif, there are complexes within the tmQM dataset that contain the motif, but the associated manuscripts do not indicate the complex has been assessed for that specific application. Therefore, we supplemented each dataset using structural mapping to identify chemically similar structures to add to our data subsets. To complete this augmentation, we note that for most of these application-specific datasets, multidentate ligands can play an important role, such as in defining the steric environment and introducing added stability for catalysis or inhibiting thermal relaxation pathways for photoactive compounds. This is consistent with several common multidentate motifs (i.e., five-membered rings) when considering d = 2 substructures. However, d = 2 substructures cannot capture bidentate ligands that form six-membered metallacycles. Therefore, to avoid introduction of less relevant complexes in each dataset, we identify matching metal-centered substructures in the tmQM dataset with d = 3. Unsurprisingly, increasing the radius of metal-centered substructures leads to an increase in the number of recurring substructures, with, e.g., 13
696 unique d = 3 motifs (vs. 10
094 for d = 2) in the tmCAT dataset (ESI Fig. S22†). To only introduce additional structures with high relevance to a given application, only motifs with high recurrence (i.e., 5 or more for tmCAT, 3 or more for the other subsets) were supplemented. Using structural mapping, we augmented the tmCAT, tmPHOTO, tmSCO, and tmBIO datasets with 2381, 1556, 149, and 974 additional chemically relevant complexes, respectively. The final, application-specific datasets we curated can be accessed on Zenodo.79
665 TMCs, to their respective applications. Using the manuscript abstracts, we first trained a classifier model to identify manuscripts that are related to catalysis with an accuracy of 0.97. Using this model, we curated a dataset of catalysis-related transition metal complexes, called tmCAT, which initially consisted of 19
250 unique complexes. Analysis of common electronic and geometric descriptors revealed that commonly used descriptors fail to distinguish between catalytic and non-catalytic sets. However, the analysis of coordination geometry of catalytically relevant complexes showed that geometries with open metal sites were significantly enhanced in the tmCAT set.
Using topic modeling, an unsupervised clustering method often used in natural language processing, we further curated three additional initial datasets: tmPHOTO, a dataset consisting of 3043 unique complexes with photophysical relevance, tmBIO, a dataset consisting of 1808 unique complexes with biological relevance, and tmSCO, a dataset consisting of 834 unique complexes with relevance to magnetism. Analyzing the chemical substructures within each dataset identified frequently targeted complexes for their designated applications, such as bidentate N⁁N palladium dichlorides for catalysis, iridium complexes with two C⁁N ligands or platinum complexes with one C⁁N ligand for photophysics, and platinum dichlorides or ruthenium piano-stool complexes for biologically relevant complexes. By mapping these substructures to their applications, we identified previously synthesized complexes that had strong chemical similarity to those already identified for each application. We used these additional complexes to supplement the textually curated datasets, leading to 2381 additional tmCAT complexes, 1556 additional tmPHOTO complexes, 974 additional tmBIO complexes, and 149 additional tmSCO complexes in the final data sets.
The curated tmCAT, tmPHOTO, tmBIO, and tmSCO datasets are expected to enable more focused high-throughput computational screening and development of predictive machine learning models while still allowing for exploration across diverse chemical spaces. The language models employed in this study also have the potential for broader application, such as to curate subsets of other classes of materials, including metal–organic frameworks.
Footnote |
| † Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d4fd00087k |
| This journal is © The Royal Society of Chemistry 2025 |