Open Access Article
Ilia Kevlishvili
* and
Devmin Dorabawila
Department of Chemistry and Biochemistry, Baylor University, Waco, Texas, USA. E-mail: ilia_kevlishvili@baylor.edu; Tel: +1-254-710-4272
First published on 27th May 2026
Canonical string representations have transformed organic cheminformatics, yet transition-metal complexes (TMCs) lack an equivalent that captures coordination geometry, stereochemistry, and donor topology. We introduce Trans-pair Relations EXpression (T-REX), a canonical line notation encoding geometry, topology, and metal-centered chirality (@/@@, Δ/Λ) via trans-pair maps. Applied to 63
375 DFT-optimized structures from the tmQMg dataset, T-REX identifies five distinct isomer classes (coordination, enantiomeric, linkage, hemilabile, and geometric) and reveals that fewer than 1.2% of complexes capable of stereoisomerism are resolved as such in crystallographic data. Combinatorial enumeration expands these parent structures into 149
228 unique topological variants; modular ligand substitution generates millions of additional candidates. Across one bond-only baseline and four geometry-aware architectures, encoding the T-REX coordination map consistently improves prediction of HOMO, LUMO, gap, and dipole moment. Dipole moment shows the largest gains (R2 = 0.845 vs. 0.715 for the baseline), and three architecturally distinct models with a direct coordination-sphere readout achieve equivalent performance, confirming that T-REX topology, not architecture choice, drives the improvement. Geometry-aware models reach equivalent accuracy with roughly four times less training data, positioning T-REX as both an interoperable data format and an ML-ready representation for transition-metal chemistry.
Extending this success from organics to transition-metal complexes (TMCs) is nontrivial.12 Metal complexes span multiple coordination numbers, each with distinct, chemistry-relevant geometries; they carry additional electronic descriptors (oxidation state, multiplicity); and they exhibit a larger number of coordination isomers (e.g., an octahedral complex with 6 unique ligands has 30 stereoisomers, 15 enantiomeric pairs). Haptic and polydentate ligation further complicate topology. At the same time, the surge of string-focused advances (e.g., SELFIES10,13 for robust small-molecule generation; BigSMILES14–16 for stochastic polymers) underscores a broad community desire for representations that can carry richer chemical domains, motivating an inorganic-aware string that captures geometry, stereochemistry, and donor identity without sacrificing ML-readiness.
Recent efforts point in this direction but leave key gaps. Rasmussen et al. introduced an RDKit-parsable SMILES workflow for TMCs that converts 3D structures into SMILES,17 improving interoperability but not canonically resolving geometry/symmetry or electronic-state labeling at the string level. In parallel, the automated coordination complex conformer generator, MetalloGen,18 proposed m-SMILES: an input dialect that encodes the metal, per-ligand strings, explicit coordination sites, and a geometry tag to drive 3D conformer generation, powerful for building, but dependent on non-canonical site numbering. Meanwhile, descriptor families like RACs19 have been widely used to predict TMC properties.19–25 Ligand-derived features have commonly been used as a representation, but fail to generalize across different TMCs.26–30 Geometry/quantum-aware features deliver accuracy but require 3D coordinates31,32 or QM features,33 often from DFT, limiting their scale. Furthermore, string representations have been actively used in LLM-driven optimization34 and structure generation.35–44 These threads collectively motivate the need for a canonical string-level solution, one where a specific chemical species maps to exactly one deterministic string, ensuring database integrity and preventing duplicate bias in machine learning models. Canonical string-level resolution is critical because it makes the representation a chemically meaningful key: identical complexes collapse to one string, while distinct coordination geometries, stereoisomers, linkage modes, oxidation states, and spin states remain separable. This prevents duplicate bias during dataset merging and avoids conflating isomeric species in enumeration or ML workflows.
In this work, we (i) introduce Trans-pair Relations EXpression (T-REX), a canonical line notation for monometallic CN ≤ 7 complexes that encodes geometry, coordination topology, and metal-centered chirality (@/@@, Δ/Λ) via trans-pair maps and stereochemical flags; (ii) develop a structure to string extraction pipeline that converts over 63
000 literature structures into canonical T-REX strings and classifies their isomer relationships across five distinct categories; (iii) show how these strings enable systematic enumeration of coordination isomers and enantiomers, as well as ligand-substitution neighborhoods, yielding hundreds of thousands of topological variants and millions of chemically plausible complexes; and (iv) demonstrate across five neural network architectures that T-REX-derived coordination topology consistently improves property predictions, with the largest gains on shape-sensitive properties like dipole moment, and that a direct coordination-sphere readout provides a roughly four-fold improvement in data efficiency over bond-only baselines. In contrast to prior TMC string dialects that prioritize structure generation or RDKit interoperability, T-REX is designed from the outset to be canonical, geometry-aware, and ML-ready at the string level.
T-REX is a modular, line-based notation composed of separable blocks, each delimited by a vertical bar | to make parsing trivial. The header encodes the central transition metal, its oxidation state, and an optional spin multiplicity (default is interpreted as multiplicity = 1 if omitted). Next, the ligand block lists every coordinated ligand's identity. The third block is the map, which records which coordinating atoms (catoms) are trans to one another (pairs) and which donors have no trans partner (singles). Two optional blocks may follow: a geometry flag (to make the intended idealized CN geometry explicit when helpful) and a central-chirality flag (to disambiguate metal-centered or Δ/Λ stereochemistry). In short, the representation has the following structure:
The header begins with the element symbol and encloses electronic state in curly braces:
or
(the latter explicitly sets multiplicity M; if M is absent, multiplicity defaults to 1). Oxidation state is mandatory; whereas multiplicity implicitly defaults to a closed-shell species. This compact header keeps the electronic specification orthogonal to topology, so downstream tools can read or ignore it without touching connectivity.
The ligand list is introduced by L = and wrapped in square brackets:
. Each ligand is preceded by a payload tag that declares how to interpret its string, enabling modular growth of the string representation. For example,
: supports RDKit-centric workflows and structure generation; a future extension to
: is a natural representation for generative models; a future semantic payload can maximize human readability. The current software relies on
: and separates ligands by commas, e.g.
,
,
. Donor atoms (“catoms”) are indexed relative to each individual ligand string (1-based) and referenced in the map. The ligand payload itself can be changed while updating coordinating atom indices in the map, without breaking the T-REX string.
The map captures the local metal topology using only trans pairs and singles, which is sufficient to disambiguate the vast majority of coordination geometries and all cis relationships up to CN ≤ 6 (Fig. 1). Simultaneously, the map reveals ligand denticity (multiple catoms from the same ligand) and hapticity (groups of catoms treated as a single coordination site). The block begins with
A trans pair is written as
, where
and
are ligand indices (from the ligand list order) and a, b are their catom indices (from each ligand's string), all 1-based. For example,
means ligand 1 atom 11 is trans to ligand 2 atom 4. Singles (donors without a trans partner) are written as
entries. Haptic donors are grouped, e.g.
indicates ligand 1's atoms 2 and 3 act as a single haptic site (η2) and are trans to ligand 2 atom 1.
Although the pair/single map alone typically fixes geometry for CN ≤ 6, an explicit geometry flag (e.g., G:O, G:TP) can be appended to guard against rare edge cases (in particular, when clarifying intended CN = 6 geometry families and avoiding accidental conflation of octahedral geometry with less common alternatives like trigonal prismatic geometry). A central chirality flag further locks in metal-center stereochemistry when two enantiomeric assignments share the same pair pattern. T-REX distinguishes two mechanistically distinct types of metal-centered chirality. Point-central chirality (@/@@) is computed via the determinant method: four sites are selected according to geometry-specific rules, and the sign of their scalar triple product assigns handedness. Achirality is detected before computation through geometry-specific checks. For instance, equivalent trans partners within a pair or equivalent pair sets in octahedral complexes preclude chirality. In contrast, equivalent sites that do not share such relationships do not (full conditions for each geometry are given in SI, Text S5). Helical chirality (Δ/Λ) arises in octahedral complexes bearing multidentate ligands: tris-bidentate, cis-bis-bidentate, and fac–fac bis-tridentate, where point-central chirality is absent, but a propeller-like twist exists. The chirality flag is assigned during structure-to-string conversion and is preserved through canonicalization. We check the point-central chirality first, then fall back to helical chirality. While enantiomers exhibit identical scalar properties in achiral environments, resolving them at the string level is essential for applications in asymmetric catalysis and biological recognition, and ensures that each physically distinct species maps to a unique T-REX string (Fig. 2).
![]() | ||
| Fig. 2 Examples of chiral molecules represented in T-REX. Atoms used to compute point-central chirality are labeled. | ||
Importantly, T-REX strings do not need to be extracted from 3D structures. The modular block design allows direct construction from chemical intent where either a user or generative algorithm specifies the metal, oxidation state, ligand set, and desired coordination map, enabling bottom-up dataset construction for hypothetical complexes that have never been synthesized or computationally optimized.
T-REX is designed for CN ≤ 7 where trans-pair semantics cleanly characterize geometry and coordination isomerism; nothing in the syntax forbids higher CN, but complete conformer disambiguation may require additional relations beyond trans pair enumeration. For example, to avoid the collapse of pentagonal bipyramidal geometry (CN = 7) stereoisomers, equatorial ligands (singles in T-REX) are listed in counterclockwise direction, when looking down from the first ligand in the axial position. T-REX also intentionally discretizes continuous coordination environments into idealized topology classes. Therefore, complexes with the same donor set, geometry label, and trans-pair map but different bond lengths, angular distortions, or Jahn–Teller elongation/compression map to the same T-REX topology unless the distortion changes the assigned coordination geometry or trans-pair relationships. This makes T-REX appropriate for canonical topology, isomer enumeration, dataset curation, and geometry-aware ML, but not a replacement for 3D coordinates when quantitative distortion amplitudes are required. Future work will focus on addressing these limitations and expanding grammar for the multinuclear MUL-T-REX, which generalizes the same block structure while preserving the canonical, edit-friendly design.
Extending T-REX to multinuclear complexes introduces several algorithmic challenges beyond simply allowing multiple metal headers. A multinuclear representation must preserve each metal's local coordination topology while also encoding intermetal relationships, including metal–metal bonds and bridging ligands that participate in more than one coordination sphere. This makes canonicalization more complex because both ligand order and metal-center order can permute, and equivalent metal centers must be recognized without changing the local maps. A natural MUL-T-REX extension would therefore treat a cluster as a graph of local T-REX-like metal environments connected by shared ligand sites and explicit metal–metal edges. This preserves the same canonical, edit-friendly philosophy while adding the additional bookkeeping required for clusters, bioinorganic cofactors, and multinuclear catalysts.
(1) Ligand Sorting: ligands are first canonicalized individually (using RDKit standard canonicalization for SMILES payloads) and coordinating atom indices are remapped to the updated SMILES string. They are then sorted within the Ligand Block based on a priority rule set of decreasing denticity, hapticity, coordinating atom atomic number, ligand molecular weight and increasing ligand hash, in that order.
(2) Map Minimization: once ligand order is fixed, the topology map is sorted to minimize the numerical indices of the trans-pairs and singles (lexicographic sorting). For polydentate ligands with internal symmetry, canonicalization ensures that equivalent donor permutations map to a single T-REX string.
For example, in a square planar Pd(Cl)2(NH3)2 complex, the T-REX algorithm ensures that the chloride ligands (higher atomic number) are always listed before amines, and the map is ordered such that cis and trans isomers yield deterministic, non-overlapping strings. This guarantees that T-REX is invariant to atom indexing in the source file.
547 starting structures, the conversion pipeline successfully parsed 72
733 (97.5%), with rejections arising from charge-assignment parser failures (1,661) and processing timeouts (153). Parsed structures were then subjected to a geometry-agreement filter in which the coordination geometry inferred independently from the T-REX trans-pair/singleton pattern was compared against the molSimplify RMSD-based classification; only complexes where both methods agreed were retained, yielding 66
525 structures. Given that tmQMg reports properties for closed-shell singlet electronic structures, the same electronic-state assignment was used for the T-REX dataset and ML benchmarks. Although T-REX can encode spin multiplicity explicitly, the present benchmarks do not evaluate spin-state ordering, spin-crossover energetics, or open-shell alternatives. A multi-step cleanup further refined this set: 94 entries with erroneous explicit-hydrogen placement were corrected, element counts in each T-REX string were audited against the source XYZ file to detect ligation-state misassignments, and 57 structures were rescued by identifying cases where the extended Hückel charge-assignment workflow had misinterpreted perchlorate ligands as oxo species. After removing the remaining 2007 atom-count mismatches and applying canonical deduplication (1143 duplicates), the final dataset comprised 63
375 unique T-REX strings spanning all major coordination geometries, establishing T-REX as a robust format for large-scale curation of inorganic and organometallic data. Additionally, the same pipeline was applied to four previously published functional datasets derived from tmQMg45 (tmCAT, tmPHOTO, tmBIO, and tmSCO), yielding 18
855, 4,061, 2,542, and 8209 unique complexes, respectively.
The curated tmQMg library spans 52 transition metals across oxidation states from −3 to +7 (SI Fig. S8), with Pd (7,194), Pt (5,832), Ru (5,501), Ni (5,211), and Zn (5,058) as the most represented. Because T-REX jointly encodes metal identity, oxidation state, and coordination geometry, the dataset enables direct quantification of metal–geometry coupling. Several metals exhibit strong geometric preferences. For example, Pd is 94% square planar and Au is 87% linear, while others display increased diversity. Ru splits nearly evenly between tetrahedral (51%) and octahedral (40%), and Zn populates multiple distinct geometry families (SI Fig. S9). These distributions reflect the complexity of transition metal complexes that T-REX captures at scale.
375 unique tmQMg complexes, revealing five distinct classes of isomer relationships among structures sharing the same metal and ligand set. Coordination isomers with complexes differing only in their trans-pair map were the most prevalent, with 254 sets (516 structures, including sets of up to three resolved diastereomers). Enantiomeric pairs resolved by the chirality flag accounted for 92 sets (184 structures), while linkage isomers, in which the same ligand coordinates through a different donor atom, comprised 52 sets (104 structures). Hemilabile isomers, in which a multidentate ligand partially dissociates to change its effective denticity, accounted for 11 sets (22 structures). It is noteworthy that while several hemilabile ligands have been identified in the CSD in past work,46–48 a very limited number are characterized as hemilabile complexes in the same coordination environment. Geometric isomers, where the same composition adopts an entirely different coordination geometry, appeared as 8 sets (16 structures). Two additional sets flagged as identical (4 structures) are pentagonal bipyramidal complexes where the current canonicalization is surjective but not injective, confirming that the representation is otherwise bijective across all supported geometry families. Representative examples of enantiomeric pairs and coordination isomers are shown in Fig. 2 and 3, respectively, and other examples are shown in SI Fig. S10–S12.
Notably, of the 63
375 unique complexes, 29
491 (46.5%) are theoretically capable of coordination isomerism or enantiomerism, yet only 338 unique sets contained resolved coordination isomers or enantiomeric pairs; 8 of these are triplets in which a coordination isomer pair is accompanied by a resolved enantiomer within one of the diastereomeric forms (e.g., Co(III)(en)2(N3)2, Fig. 3). This scarcity systematically obscures the geometric and stereochemical contrast necessary for ML models to learn isomer-dependent properties for transition metal complexes. By applying a combinatorial enumeration algorithm that permutes trans-pair and singleton assignments while preserving multidentate constraints, and assigns both enantiomeric forms where chirality is present, we expanded the 63
375 parent structures into 149
228 unique canonical T-REX strings, capturing the topological diversity that crystallographic databases leave unresolved. Separately, the 12
370 complexes identified as chiral during the conversion were reflected to generate their mirror-image enantiomers, yielding 12
370 paired structures with explicit chirality labels (@/@@, Δ/Λ) and DFT-quality geometries. This enantiomer library49 is provided as a standalone dataset for applications in asymmetric catalysis and bioinorganic design (T-REX-ent).
000 unique ligands into just 185 relaxed or 722 strict classes, providing a structured, data-driven basis for combinatorial design (SI Table S1).
To demonstrate this utility for large-scale discovery, we focused on metal hydride complexes within tmCAT, expanding a small parent set into millions of candidates. Starting from just 658 structures containing a metal–H bond, strict substitution generated approximately 717
000 unique canonical strings, while the relaxed approach yielded over 2.3 million unique strings. This massive expansion, spanning several orders of magnitude, confirms that T-REX can rapidly populate the “near” and “far” neighborhoods of synthetically plausible complexes using only valid ligand components (SI Fig. S13).
Furthermore, the integration of SMARTS logic allows for the imposition of specific chemical rules during generation, as demonstrated on cisplatin analogs in the tmBIO dataset. By restricting substitutions to maintain a cis-N motif on a subset of 60 parents, we generated nearly 20
000 strict and 25
000 relaxed variants that effectively bridge the chemical space between distinct clusters in the parent dataset (Fig. 4). We extended this workflow to the full tmSCO dataset, where 819 spin-crossover complexes were expanded into a library of ∼160
000 unique T-REX strings, illustrating the method's generalizability across diverse inorganic domains. (SI Fig. S14).
These results demonstrate that ligand classification and substitution enumeration via T-REX can drive the generative design of massive combinatorial datasets. Furthermore, the integration of SMARTS logic allows for the imposition of desirable chemical rules during generation. We defined two different ligand classification approaches, with “strict” classification envisioned as a more appropriate tool for local optimization and “relaxed” classification more appropriate for scaffold hopping and discovery. We envision that this modularity will enable efficient genetic algorithm (GA) optimization strategies, where high-level T-REX information defines “genes” for metal topology and electronic structure, while ligands serve as modular “subgenes” for local optimization.
700 complexes in the training set and 6337 complexes in each of the validation and test sets. For each architecture, we performed hyperparameter optimization using Optuna and report ensemble predictions averaged over five independently seeded runs.
We contrasted a bond-only baseline, which ablates hyperedge message passing, but retains the full molecular graph, against four geometry-aware architectures that differ in how they process and route T-REX-derived coordination topology. The baseline Message Passing Neural Network (MPNN) utilizes the GINE convolutional architecture,51 with node features including one-hot element encodings, RDKit-derived properties, Pauling electronegativity, and chirality tags, while edge features consist of standard bond types. This model captures bond topology but is effectively blind to stereochemical relationships like cis/trans isomerism. The four geometry-aware architectures augment this bond graph with hyperedges52–55 constructed directly from the T-REX trans-pair map, where each hyperedge connects two coordination sites (A and B) through the metal center (M), labeled as cis or trans with a discrete ideal-angle class. They differ along two architectural axes. First is the hyperedge processing mechanism which includes attention-based GRU pooling56 (HyperMPNN and LF-GNN), DeepSets-style permutation-invariant aggregation57 (DeepSets), and absorption into virtual graph nodes58 processed by standard message passing (Virtual Node). Second is the readout pathway, where HyperMPNN routes hyperedge information exclusively through atom features before pooling, whereas LF-GNN, DeepSets, and Virtual Node architectures maintain a direct hyperedge-to-head channel that concatenates atom-level and coordination-level pooled representations (SI, Text S7).
For frontier orbital energies, all geometry-aware architectures proved highly effective and tightly clustered. HOMO prediction yielded R2 values of 0.980 (LF-GNN), 0.979 (DeepSets), 0.978 (Virtual Node), and 0.977 (HyperMPNN), compared to 0.968 for the bond-only MPNN (SI Fig. S15–S24). LUMO predictions followed a similar pattern, with geometry-aware models spanning R2 0.972–0.978 versus 0.966 for the baseline (SI Fig. S25–S34). The HOMO–LUMO gap, while still largely dictated by ligand-field strength, showed a wider spread: R2 values ranged from 0.898 to 0.903 for the direct-readout architectures, 0.884 for HyperMPNN, and 0.868 for the baseline (Table 1, SI Fig. S35–S44).
| Property | Dipole (MAE/R2) | HL gap (MAE/R2) | HOMO (MAE/R2) | LUMO (MAE/R2) |
|---|---|---|---|---|
| MPNN | 1.272/0.715 | 0.212/0.868 | 0.160/0.968 | 0.160/0.966 |
| HyperMPNN | 1.097/0.813 | 0.191/0.884 | 0.133/0.977 | 0.142/0.972 |
| LF-GNN | 0.968/0.845 | 0.177/0.898 | 0.125/0.980 | 0.127/0.978 |
| DeepSets | 0.979/0.843 | 0.173/0.902 | 0.131/0.979 | 0.132/0.976 |
| Virtual node | 1.002/0.837 | 0.173/0.903 | 0.132/0.978 | 0.130/0.977 |
However, a stark performance hierarchy emerged when predicting the dipole moment, a vector property intrinsically sensitive to the spatial arrangement of ligands around the metal center. The bond-only MPNN, which treats coordination isomers as identical, achieved R2 = 0.715 (MAE = 1.27 D). HyperMPNN, which encodes T-REX-derived cis/trans hyperedges but routes their information through atom features, improved substantially to R2 = 0.813 (MAE = 1.10 D). The three architectures with a direct coordination-sphere readout channel performed best and were effectively interchangeable: LF-GNN (R2 = 0.845, MAE = 0.97 D), DeepSets (R2 = 0.843, MAE = 0.98 D), and virtual node (R2 = 0.837, MAE = 1.00 D) (Fig. 5, SI S45–S54 and Table 1). The consistency across three fundamentally different hyperedge processors indicates that the T-REX topology itself, rather than the specific neural architecture, is the primary factor improving the performance of shape-sensitive properties.
![]() | ||
| Fig. 5 Parity plots of predicted vs. calculated dipole moment on the set aside test set using the LF-GNN (top), HyperMPNN (middle), and MPNN (bottom) architectures. | ||
Notably, the performance hierarchy reveals that two architectural choices matter independently: encoding coordination topology at all (MPNN to HyperMPNN, ΔR2 ≈ 0.10 on dipole), and providing that topology a direct path to the prediction head (HyperMPNN to LF-GNN, ΔR2 ≈ 0.03 on dipole). The benefit of a dedicated coordination-sphere readout channel parallels the established advantage of separating metal-centered features19 (mc-RAC) from full-complex descriptors, here realized as learnable two-body representations derived from the T-REX trans-pair map. Crucially, all five architectures operate on identical atom and bond featurization. The only variable is whether and how the model accesses the coordination map encoded in the T-REX string, confirming that trans-pair encoding at the string level is sufficient to recover the geometry dependence of strongly shape-sensitive properties without explicit 3D coordinates.59–61
As a standard cheminformatics comparison, we also trained ECFP4/random-forest models from T-REX-derived RDKit molecular graphs with and without T-REX-derived virtual trans bonds. For HOMO–LUMO gap prediction, the bond-only ECFP4/RF model gave R2 = 0.644, close to previously reported RDKit-SMILES fingerprint baselines,17 while adding virtual trans bonds gave R2 = 0.625. For dipole moment, the same comparison improved from R2 ≈ 0.51 without trans bonds to R2 = 0.582 with trans bonds. These results show that T-REX-derived trans-pair topology can benefit even classical fingerprints for geometry-sensitive properties, while also confirming that the bond-only MPNN is a stronger learned graph baseline than ECFP/RF.
To assess data efficiency, we trained ensembles of three models at 10%, 25%, 50%, 75%, and 90% of the training data and evaluated on the full test set for all four properties (Fig. 6, SI S55–S57). The magnitude and onset of the geometry-aware advantage scaled directly with the shape-sensitivity of the target property. For dipole moment, the three direct-readout architectures at 25% of the training data (∼12
700 complexes) already exceeded the bond-only MPNN trained on the full dataset (R2 ≈ 0.73 vs. 0.72), representing a roughly four-fold reduction in labeled data needed to reach equivalent accuracy. For the HOMO–LUMO gap, the geometry-aware advantage was consistent but diminished, with clear separation emerging by 25% of training data. For HOMO and LUMO energies, all architectures converged rapidly and showed minimal separation across data fractions, consistent with the weaker geometry dependence of frontier orbital energies, though geometry-aware models maintained a small but consistent edge at full training set size across all properties.
Finally, we tested isomer-resolved prediction using a stricter isomer-holdout split in which all 516 coordination-isomer structures identified in tmQMg were assigned to the test set, while all remaining complexes were split into train and validation sets. This split prevents memorization of family-specific isomer offsets and instead tests whether models can generalize topology-property relationships across chemically distinct isomer families. For dipole prediction, the bond-only MPNN performed poorly on this diagnostic, giving R2 = −0.139 and MAE = 3.03 D, whereas LF-GNN achieved R2 = 0.871 and MAE = 1.16 D (Fig. 7). Pairwise Δ-dipole prediction within isomer families showed an even larger separation: MPNN gave R2 = 0.111 with the majority of pairwise contrasts compressed toward Δ = 0, whereas LF-GNN gave R2 = 0.918 and recovered the direction of large isomer effects (Fig. 7). The MPNN does not collapse every contrast because the ablated graph still includes a metal-centered chirality/achirality feature, which provides symmetry information. For HOMO–LUMO gap, both models retained strong absolute performance on the isomer-holdout set, but within-family Δ-gap prediction was more subtle; LF-GNN improved pairwise Δ-gap R2 from 0.090 to 0.237 and Spearman correlation from 0.175 to 0.559 (SI Fig. 58). These results show that T-REX-derived coordination topology is most critical for strongly geometry-sensitive targets such as dipole moment, while more subtle isomer-dependent orbital-energy shifts likely require larger, deliberately completed isomer-resolved training datasets.
We have introduced T-REX, a canonical line notation that encodes transition-metal complexes as modular strings combining metal identity, electronic state, ligand payloads, a trans-pair map, and a metal-centered chirality flag that together uniquely specify coordination topology and stereochemistry for monometallic CN ≤ 7 species. An extraction pipeline converts over 63
000 literature structures from the tmQMg dataset into canonical strings, and systematic isomer classification reveals five distinct classes of structural relationships including coordination isomers, enantiomers, linkage isomers, hemilabile isomers, and geometric isomers, while confirming that crystallographic databases dramatically underrepresent the full space of accessible topological variants. By treating T-REX strings as both compact keys and manipulable objects, we enumerate 149
228 unique coordination isomers and enantiomers, construct large libraries of chemically plausible complexes via ligand-class substitutions, and generate a dedicated enantiomer dataset (T-REX-ent) from the 12
370 chiral complexes identified during conversion. Interfacing T-REX with RDKit enables information-enriched graphs and hypergraphs: across five neural network architectures, we show that encoding T-REX-derived coordination topology consistently improves predictions of calculated properties, with the largest gains on dipole moment (R2 = 0.845 vs. 0.715 for bond-only baselines), and that this advantage persists at reduced training set sizes, reflecting a roughly four-fold improvement in data efficiency. Together, these results position T-REX as both an interoperable data format and an ML-ready representation for transition-metal chemistry, providing a foundation for more systematic dataset curation, geometry-aware learning, and generative design across catalysis, materials, and bioinorganic discovery.
Canonicalization added modest overhead during dataset processing. On the 63
375-complex tmQMg-derived dataset, end-to-end parsing and full RDKit-aware canonicalization required a median of 1.56 ms per complex, corresponding to 101 s total runtime, while a lightweight canonicalization without RDKit required a median of 0.077 ms per complex and 5.0 s total runtime. Timing was calculated on a MacBook Air with an M4 processor using a single core.
Supplementary information (SI): an example of a linear T-REX; an example of a bent T-REX; an example of a trigonal planar T-REX; an example of a seesaw T-REX; an example of a square pyramidal T-REX; an example of a trigonal bipyramidal T-REX; an example of a piano-stool complex T-REX; the distribution of oxidation states for the 20 most common metals; distribution of coordination geometries and coordination numbers for the 20 most frequently occurring metals in the tmQMg dataset; examples of linkage isomers; examples of hemilabile isomers; examples of geometry isomers; ligand classification and generative datasets; t-SNE visualization of the chemical space covered by metal hydrides; t-SNE visualization of the chemical space covered by tmSCO expansion; parity plots of the test set for the HOMO energy across 5 independent seeds and the ensemble average using the MPNN architecture; error distribution of the test set for the HOMO energy across 5 independent seeds with the MPNN architecture; parity plots of the test set for the HOMO energy across 5 independent seeds and the ensemble average using the HyperMPNN architecture; error distribution of the test set for the HOMO energy across 5 independent seeds with the HyperMPNN architecture; parity plots of the test set for the HOMO energy across 5 independent seeds and the ensemble average using the LF-GNN architecture; error distribution of the test set for the HOMO energy across 5 independent seeds with the LF-GNN architecture; parity plots of the test set for the HOMO energy across 5 independent seeds and the ensemble average using the DeepSets LF-GNN architecture; error distribution of the test set for the HOMO energy across 5 independent seeds with the DeepSets LF-GNN architecture; parity plots of the test set for the HOMO energy across 5 independent seeds and the ensemble average using the Virtual Node LF-GNN architecture; error distribution of the test set for the HOMO energy across 5 independent seeds with the Virtual Node LF-GNN architecture; parity plots of the test set for the LUMO energy across 5 independent seeds and the ensemble average using the MPNN architecture; error distribution of the test set for the LUMO energy across 5 independent seeds with the MPNN architecture; parity plots of the test set for the LUMO energy across 5 independent seeds and the ensemble average using the HyperMPNN architecture; error distribution of the test set for the LUMO energy across 5 independent seeds with the HyperMPNN architecture; parity plots of the test set for the LUMO energy across 5 independent seeds and the ensemble average using the LF-GNN architecture; error distribution of the test set for the LUMO energy across 5 independent seeds with the LF-GNN architecture; parity plots of the test set for the LUMO energy across 5 independent seeds and the ensemble average using the DeepSets LF-GNN architecture; error distribution of the test set for the LUMO energy across 5 independent seeds with the DeepSets LF-GNN architecture; parity plots of the test set for the LUMO energy across 5 independent seeds and the ensemble average using the Virtual Node LF-GNN architecture; error distribution of the test set for the LUMO energy across 5 independent seeds with the Virtual Node LF-GNN architecture; parity plots of the test set for the HOMO–LUMO Gap across 5 independent seeds and the ensemble average using the MPNN architecture; error distribution of the test set for the HOMO–LUMO Gap across 5 independent seeds with the MPNN architecture; parity plots of the test set for the HOMO–LUMO Gap across 5 independent seeds and the ensemble average using the HyperMPNN architecture; error distribution of the test set for the HOMO–LUMO Gap across 5 independent seeds with the HyperMPNN architecture; parity plots of the test set for the HOMO–LUMO Gap across 5 independent seeds and the ensemble average using the LF-GNN architecture; error distribution of the test set for the HOMO–LUMO Gap across 5 independent seeds with the LF-GNN architecture; parity plots of the test set for the HOMO–LUMO Gap across 5 independent seeds and the ensemble average using the DeepSets LF-GNN architecture; error distribution of the test set for the HOMO–LUMO Gap across 5 independent seeds with the DeepSets LF-GNN architecture; parity plots of the test set for the HOMO–LUMO Gap across 5 independent seeds and the ensemble average using the Virtual Node LF-GNN architecture; error distribution of the test set for the HOMO–LUMO Gap across 5 independent seeds with the Virtual Node LF-GNN architecture; parity plots of the test set for the Dipole Moment across 5 independent seeds and the ensemble average using the MPNN architecture; error distribution of the test set for the Dipole Moment across 5 independent seeds with the MPNN architecture; parity plots of the test set for the Dipole Moment across 5 independent seeds and the ensemble average using the HyperMPNN architecture; error distribution of the test set for the Dipole Moment across 5 independent seeds with the HyperMPNN architecture; parity plots of the test set for the Dipole Moment across 5 independent seeds and the ensemble average using the LF-GNN architecture; error distribution of the test set for the Dipole Moment across 5 independent seeds with the LF-GNN architecture; parity plots of the test set for the Dipole Moment across 5 independent seeds and the ensemble average using the DeepSets LF-GNN architecture; error distribution of the test set for the Dipole Moment across 5 independent seeds with the DeepSets LF-GNN architecture; parity plots of the test set for the Dipole Moment across 5 independent seeds and the ensemble average using the Virtual Node LF-GNN architecture; error distribution of the test set for the Dipole Moment across 5 independent seeds with the Virtual Node LF-GNN architecture; learning curves for HOMO–LUMO Gap prediction; Learning curves for HOMO Energy prediction; learning curves for LUMO energy prediction; isomer-holdout evaluation of T-REX-derived coordination topology; general software model; string parsing; canonicalization algorithm; 3D structure to T-REX translation; chirality detection; isomer classification; coordination isomer enumeration; ligand substitution and generative expansion; graph neural networks. See DOI: https://doi.org/10.1039/d6dd00129g.
| This journal is © The Royal Society of Chemistry 2026 |