Open Access Article
Ankur K. Gupta†
*,
Caitlin V. Hetherington†‡
and
Wibe A. de Jong
*
Applied Mathematics and Computational Research Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA. E-mail: ankur@lbl.gov; wadejong@lbl.gov
First published on 26th November 2025
The separation of rare-earth metals, vital for numerous advanced technologies, is hampered by their similar chemical properties, making ligand discovery a significant challenge. Traditional experimental and quantum chemistry approaches for identifying effective ligands are often resource-intensive. We introduce a machine learning protocol based on an equivariant neural network, Allegro, for the rapid and accurate prediction of binding energies in rare-earth complexes. Key to this work is our newly curated dataset of rare-earth metal complexes—made publicly available to foster further research—systematically generated using the Architector program. This dataset distinctively features functionalized derivatives of proven rare-earth-chelating scaffolds, hydroxypyridinone (HOPO), catecholamide (CAM), and their thio-analogues, selected for their established efficacy in binding these elements. Trained on this valuable resource, our Allegro models demonstrate excellent performance, particularly when trained to directly predict DFT-level binding energies, yielding highly accurate results that closely correlate with theoretical calculations on a diverse test set. Furthermore, this strategy exhibited strong out-of-sample generalization, accurately predicting binding energies for an isomeric HOPO-derivative ligand not seen during training. By substantially reducing computational demands, this machine learning framework, alongside the provided dataset, represent powerful tools to accelerate the high-throughput screening and rational design of novel ligands for efficient rare-earth metal separation.
In the same context, selective precipitation16,17 has emerged as a promising strategy for separating REEs. This approach involves precipitating individual REEs from aqueous mixtures by binding them with organic ligands (or complexing agents), followed by isolation through filtration. The success of this method hinges on discovering novel ligands that bind to REEs with high specificity and selectivity. While hydroxypyridinone (HOPO) based ligands have demonstrated serendipitous success, the vast chemical space remains largely unexplored, offering significant potential for the discovery of more efficient ligands that could revolutionize rare-earth separation processes.18–22
To address the need for efficacious ligands for REE extraction and separation, the exploration of the vast chemical space via a high-throughput screening approach is essential. A key factor in determining a ligand's efficacy in REE separation is its binding affinity to a given REE, which can be deduced experimentally through titration techniques and subsequently quantified using the resultant equilibrium constants and IC50 values.23 However, the synthesis and experimental evaluation of thousands of novel ligands for REE separation are time-consuming, resource-intensive, and cost-prohibitive, rendering wet-lab-based high-throughput screening impractical. Calculated metal–ligand binding energies are thermodynamically related to equilibrium constants; for a given metal and ligand series, a more negative binding energy implies a favorable exchange toward that ligand and thus greater extraction efficacy.23 Our study focuses on the computational prediction of binding energy, serving to rank and prioritize the most promising ligand candidates for subsequent, targeted experimental validation through techniques such as spectrofluorimetric titration. Thus, quantum chemistry-based calculations offer a viable alternative for high-throughput screening of unexplored ligands through a virtual platform. This approach allows precise control over molecular structures and parameters, such as charge, oxidation state, and ligand properties, irrespective of their complexity or synthetic accessibility. However, for accurate computation of rare-earth complex (REC) properties, the choice of theoretical method is crucial due to their complex electronic structure arising from the involvement of f-electrons. This complexity results in varied spin multiplicities, polarization effects,24 and multi-reference character across the REE series. While computationally intensive correlated wave function theory (e.g., CCSD(T)) and multi-reference methods (e.g., CASSCF) offer higher accuracy, density functional theory (DFT) with a reasonably sized basis set provides a practical balance between cost and accuracy for computing reliable binding energies and other properties for RECs. Nevertheless, applying DFT to potentially hundreds of thousands of RECs remains computationally challenging, particularly since RECs can exhibit noisy and prolonged self-consistent field (SCF) convergence, further exacerbating computational time and resource demands. Static correlation could also be incorporated during dataset generation for RECs using correction methods over DFT, such as DFT + U or static correlation correction (SCC),25,26 though, while relatively cost-effective, additional accuracy challenges could arise.
In response to the computational challenges posed by quantum chemistry calculations, machine learning techniques offer a promising avenue for rapid and cost-effective predictions of REC properties. However, the success of any machine learning method fundamentally depends on the quality and breadth of the dataset on which it is trained. Despite the potential of machine learning, the field is currently hampered by a significant lack of datasets that are specifically curated for REC structures and properties, as most existing datasets mainly focus on organic or main group elements.27 To bridge this gap, we propose and develop a rigorous protocol to generate datasets targeting RECs, particularly those involving ligands known to be effective in REE separation. Additionally, we demonstrate various machine learning strategies, utilizing state-of-the-art equivariant neural networks, to predict REC binding energies, paving the way for accelerated property prediction and ultimately, the discovery of novel ligands for efficient REE separation through the application of molecular inverse design algorithms.
![]() | ||
| Fig. 1 Molecular geometries of HOPO, thio-HOPO, CAM, and thio-CAM ligands (grey: C, red: O, blue: N, yellow: S, white: H). Functionalization was carried out at the C3 position. | ||
The data curation protocol comprised of five steps, as illustrated in Fig. 2. The first step involved functionalizing HOPO- and CAM-based ligands (Fig. 1) with diverse substituents at position C3, i.e. diametrically opposite the metal coordinating atom that is adjacent to the side chain (Fig. 2, step 1). This diversification of the dataset enabled the investigation of the influence of various chemical and electronic environments on REE binding affinities. The selected substituents (Fig. 3) were all uncharged and encompassed a range of electron-donating groups (e.g., alcohols, amines) and electron-withdrawing groups (e.g., carboxylic acids, esters), allowing for the exploration of ligand polarity effects on REE binding affinity. Furthermore, to examine how different heteroatoms modulate metal–ligand binding strength, we included substituents containing nitrogen (e.g., amino), oxygen (e.g., hydroxyl), sulfur (e.g., thiol), and phosphorus (e.g., phosphonate). Substituent size and complexity were systematically varied, ranging from simple methyl groups to larger, more complex moieties like 2-methylbenzoic acid, to generate a diverse array of ligand sizes and shapes. Halogen substituents, such as chloride and fluoride, were also incorporated to investigate their impact on REC acidity. Notably, due to its higher electronegativity, fluoride increases the complex's acidity (lower pKa) compared to chloride through inductive electron withdrawal. Furthermore, the dataset focused exclusively on mononuclear RECs, with the central metal atom being the sole heavy metal present in the complexes.
![]() | ||
| Fig. 3 Geometries of substituents that are attached to the ligands (grey: C, red: O, blue: N, yellow: S, white: H, pale green: F, bright green: Cl). | ||
In the second step, 3D geometries of RECs were generated using Architector. Each complex consisted of the previously discussed ligands (Fig. 1) attached to a REE metal center, and identical ligands were used for each individual REC. This process utilized the SMILES strings of the ligands and Architector-specific parameters as input (Fig. 2, step 2). The Architector setup involved determining the optimal metal center symmetry and 3D ligand geometry to generate a diverse set of REC conformers. The following settings were employed: GFN2-xTB32 (version 6.6.0) for structure optimization and energy-based ranking of the generated structures, and a coordination number of eight for all complexes, consistent with common coordination environments observed for RECs.33 Up to ten different metal-centered symmetries were explored during the generation and optimization of multiple conformers. To enhance conformational sampling and ensure identification of the most stable structures, the Conformer-Rotamer Ensemble Sampling Tool (CREST, version 2.12)34 integrated into Architector was employed in conjunction with GFN2-xTB. CREST initiated a further search for lower-energy conformations, starting from the lowest-energy REC conformer obtained in the previous step. This approach facilitated a more thorough exploration of the conformational space. The resulting lowest-energy conformer from the CREST protocol was then utilized in subsequent steps, as detailed below.
To optimize computational efficiency and minimize processing time, only lanthanum (La) complexes were initially constructed during the REC structure generation step. La serves as an excellent representative for the entire series of REEs due to their shared chemical properties. The selection of La as the starting point for generating RECs was motivated by two primary factors. Firstly, La possesses the largest atomic radius among REEs, facilitating greater adaptability when subsequently replacing La with other REEs in each REC structure. This allowed for structural modifications without significantly altering metal–ligand coordination behavior. Secondly, La's natural and only stable oxidation state of +3 is universally accessible across all REEs, ensuring consistency in electronic configuration throughout the REE series.
To simulate a realistic aqueous environment while ensuring computational feasibility, we employed a hybrid solvation strategy. In addition to an implicit model for the bulk solvent, we explicitly included water molecules in the primary coordination sphere to model the crucial competition for metal binding. To capture the diversity of potential coordination environments, we generated four distinct types of REC geometries by varying the ratio of bidentate ligands to explicit water molecules, all while maintaining a coordination number of eight (Fig. 4). This approach not only enhanced the physical realism of our models but also increased the overall diversity and size of the dataset. The first geometry type consisted of four bidentate ligands attached to the metal center (Fig. 4a). The remaining three types of complexes incorporated water molecules, reflecting their potential to bind to the metal in aqueous media. The second geometry type featured three ligands and two water molecules (Fig. 4b), while the third type comprised two ligands and four water molecules (Fig. 4c). The final geometry type included one ligand and six water molecules coordinated to the metal center (Fig. 4d). This comprehensive set of geometries allowed for a thorough exploration of various ligand–water combinations in the coordination sphere, providing a more realistic representation of RECs in aqueous environments.
Following the CREST search, the lowest-energy lanthanum (La)-ligand complexes were used as templates for further exploration. La was systematically substituted with 15 other REEs (Fig. 2, step 3): yttrium (Y), cerium (Ce), praseodymium (Pr), neodymium (Nd), promethium (Pm), samarium (Sm), europium (Eu), gadolinium (Gd), terbium (Tb), dysprosium (Dy), holmium (Ho), erbium (Er), thulium (Tm), ytterbium (Yb), and lutetium (Lu). This substitution strategy expanded the dataset, essential for developing robust machine learning models. Throughout this process, the ligand identity and molecular charge remained unchanged; only the spin multiplicity was adjusted to reflect the properties of the substituted REEs. Since the ligands are closed-shell, the spin multiplicity of each complex was assigned based on the corresponding isolated REE ion in its +3 oxidation state.
All generated structures were optimized using the GFN2-xTB32 method, employing the analytical linearized Poisson–Boltzmann (ALPB) implicit water solvation model35 to better simulate experimental conditions (Fig. 2, step 4). This computationally efficient semiempirical method was chosen to balance geometric accuracy with the high-throughput demands of generating several thousand optimized complex structures.27 To ensure high-quality data for the machine learning model, a quality control procedure was also implemented. During GFN2-xTB geometry optimizations, some RECs underwent changes in atom connectivity, potentially leading to incorrect ligand valencies. To identify and exclude such erroneous structures, we converted the ligands into their hashed International Chemical Identifiers (InChIKeys).36 As InChIKeys function as unique molecular identifiers, a complex was retained only if the InChIKeys of its ligands matched those of the corresponding free ligands; otherwise, it was excluded. Only a small number of complexes were filtered out during this process, yielding a total number of 5356 complexes.
Despite being robust for the structure optimization of large transition-metal complexes,37 xTB has limited applicability to REEs in which f-electrons play a crucial role in determining their properties, due to its semi-empirical and highly parameterized nature, leading to reduced accuracy. A notable limitation of the current GFNn-xTB methods is their lack of spin-dependent energy expressions, which prevents proper differentiation between high-spin and low-spin states. Therefore, to achieve higher fidelity in energy calculations, we computed single-point energies and gradients using density functional theory (DFT) at the B3LYP-D438,39 level of theory. The def2-SVPD40 basis set was used for the ligands, and the def2-TZVP basis set was employed for the metals, utilizing the ORCA41,42 program (version 5.0.4). Effective Core Potentials43 (ECPs) were applied to account for the core electrons of the metals. The RIJCOSX approximation was employed to efficiently compute Coulomb integrals and perform numerical integration for Hartree–Fock exchange. Additionally, the SMD44 implicit solvation model was used to simulate aqueous conditions more accurately.
To quantify the affinity of a ligand to the metal in RECs, we calculated the metal–ligand binding energies using the absolute energies obtained from DFT. Theoretically computed binding affinities are thermodynamically related to equilibrium constants (ΔG = −RT
ln(K)). Here binding energy is evaluated with matched fragment stoichiometry, charge, and spin (and consistent protonation states), providing a robust proxy for the free energy when comparing ligand efficacy for metal extraction. Thus, the binding energy (Ebinding) for an REC can be defined as
![]() | (1) |
An equivariant neural network is a model that respects the physical symmetries of a molecule by operating directly on 3D atomic coordinates. Its internal representations are designed to transform consistently with the molecule's rotation or translation, ensuring that predicted scalar properties (like energy) remain unchanged (invariant), while predicted vector properties (e.g. forces, dipoles) transform in the exact same way as the molecule (equivariant). This approach has been shown to significantly improve data efficiency and accuracy for property prediction. The inherent E(3) equivariance of the Allegro framework is key to its exceptional performance, allowing it to achieve high accuracy with less training data—a crucial benefit when using datasets derived from computationally expensive, high-fidelity quantum chemical calculations. We also note that while architectures like Allegro are often used as interatomic potentials, their design as general-purpose equivariant networks makes them highly adept for direct property prediction tasks,49 as demonstrated in this work.
For model development, a standard data partitioning strategy of 80
:
10
:
10 was implemented, randomly allocating samples for the training, validation, and test sets, respectively. All Allegro models were configured with a 6 Å radial cutoff. This cutoff was chosen to effectively capture the relevant atomic interactions within the metal complexes, encompassing both the primary coordination sphere interactions and pertinent secondary non-covalent effects. The training process utilized the mean absolute error (MAE) in energy as the loss function. Parameter optimization was performed using the Adam algorithm, coupled with a ReduceLROnPlateau learning rate scheduler that initiated at a rate of 0.01. Training for each model proceeded until the learning rate adaptively decreased to 10−5, signaling convergence. Comprehensive hyperparameter specifications for the Allegro models tested are detailed in Table S3 of the SI.
We evaluated two distinct modeling strategies using the Allegro architecture (detailed in Section 2.2) to predict the binding energies of the RECs. The first approach, hereafter referred to as Strategy 1, involved training the ML model on the absolute energies of the complexes computed via DFT. The binding energies were then derived from the model's predicted absolute energies using eqn (1), requiring separate DFT calculations for the energies of the isolated metal ions and ligands performed at the same level of theory. The second approach, termed Strategy 2, trained the ML model to predict the binding energies directly, using pre-computed binding energies as target labels, thus eliminating the need for post-processing calculations.
Fig. 6 presents parity plots comparing the DFT-calculated (ground truth) binding energies with the Allegro-predicted values for both strategies on the test set. While Strategy 1 (predicting absolute energies) yielded a strong correlation with a coefficient of determination (r2) of 0.91 and a mean absolute error (MAE) of 11.3 kcal mol−1 (Fig. 6a), Strategy 2 (predicting binding energies directly) proved superior. The direct prediction method employed in Strategy 2 achieved both a higher correlation (r2 = 0.96) and a significantly lower MAE of 6.1 kcal mol−1 (Fig. 6b), indicating it provides more accurate binding energy predictions for RECs. Results obtained with a simpler Allegro architecture (employing only 2 tensor features), are presented in Fig. S1 of the SI. This simpler model variant yields lower performance for Strategy 1 (absolute energy prediction, r2 = 0.85), while Strategy 2 (direct binding energy prediction) maintains comparable accuracy (r2 = 0.96).
Overall, the model employing 4 tensor features and trained using Strategy 2 is well-suited for its primary purpose of high-throughput virtual screening, a context where the ability to correctly rank potential ligands is the most critical metric. Its high correlation with DFT reference data (r2 = 0.96) demonstrates a strong capacity to reliably distinguish between strong and weak binders, enabling the rapid down-selection of promising candidates from vast chemical libraries for more rigorous and costly evaluation.
In this study, we employed the semi-empirical GFN2-xTB method as the low-cost baseline and DFT (B3LYP-D4) as the high-cost target level. Within the Δ-ML framework, the energy at the target DFT level (EML-DFT) is estimated by adding an ML-predicted correction (ΔML) to the baseline GFN2-xTB energy (EGFN2-xTB),
| EML-DFT = EGFN2-xTB + ΔML. | (2) |
The correction term, ΔML, represents the learned difference between the high-fidelity DFT energy (EDFT) and the low-fidelity GFN2-xTB energy,
| ΔML = EDFT − EGFN2-xTB. | (3) |
We applied this Δ-ML concept to our two primary modeling strategies:
• Strategy 1 (Δ-ML on absolute energies): the Allegro ML model was trained to predict the difference in absolute energies between DFT and GFN2-xTB (ΔML from eqn (3)). The predicted ΔML for the test set complexes was added to their GFN2-xTB energies (calculated separately) according to eqn (2) to estimate the absolute DFT energies. Subsequently, binding energies were calculated using eqn (1), requiring the corresponding isolated metal and ligand energies at the DFT level.
• Strategy 2 (Δ-ML on binding energies): this approach offers a more direct route to the target property. The Allegro model was trained to learn the difference between the binding energies calculated at the DFT level and the GFN2-xTB level (ΔEbinding = Ebinding,DFT − Ebinding,GFN2-xTB). The predicted ΔEbinding for the test set complexes was then added to their GFN2-xTB binding energies (calculated separately) to estimate the binding energy at the DFT level.
Fig. 7 displays the parity plots comparing the Δ-ML predicted binding energies against the ground truth DFT values for the test set. Applying Δ-ML to the first strategy (absolute energies, Fig. 7a) resulted in an r2 of 0.92 and an MAE of 9.9 kcal mol−1. This represents a modest improvement in the metrics compared to the direct approach for Strategy 1 (r2 = 0.91, MAE = 11.3 kcal mol−1, Fig. 6a). For the second strategy (direct binding energy prediction, Fig. 7b), the Δ-ML approach yielded an r2 of 0.95 and an MAE of 6.6 kcal mol−1. Compared to the direct prediction of binding energies without the delta correction (Strategy 2: r2 = 0.96, MAE = 6.1 kcal mol−1, Fig. 6b), the Δ-ML approach in this case resulted in a slightly lower correlation and a slightly higher MAE, indicating a comparable but not superior performance for REC binding energy predictions. Fig. S2 in the SI details the performance of a simplified Allegro model variant with 2 tensor features, where Strategy 2 (direct binding energy prediction) sustains a similar level of accuracy (r2 = 0.95), while Strategy 1 (absolute energy prediction) experiences a notable decrease in performance (r2 = 0.83).
O) groups are interchanged relative to HOPO. Under our binding conditions, the N-hydroxy is deprotonated (–N–O−), and coordination remains O,O-bidentate via the deprotonated N-hydroxy oxygen and the carbonyl oxygen (see Fig. 8 for structure).53 The structural alteration in HDEV creates a chemical environment distinct from the ligands (HOPO, CAM, and their thio-analogues from Fig. 1) used in the model's training set, making it well-suited for this validation. Additionally, HDEV has been identified experimentally for its selective binding to REEs, establishing its relevance as a candidate for synthetically feasible rare-earth separation.53 For this out-of-sample evaluation, 325 RECs without any coordinated water molecules, featuring HDEV—functionalized at the C3 position with the same range of substituents shown in Fig. 3—were generated using the high-throughput protocol detailed in Section 2.1 and illustrated in Fig. 2. These HDEV complexes formed a dedicated out-of-sample test set, and binding energy predictions for these complexes were made using the Allegro models previously trained on 80% of the original dataset (which excluded any HDEV complexes).
![]() | ||
| Fig. 8 Optimized geometry of HDEV (grey: C, red: O, blue: N, white: H). Functionalization was carried out at the C3 position. | ||
The predictive performance of the different modeling strategies on this HDEV test set is illustrated by the parity plots in Fig. 9, which compare Allegro-predicted binding energies against the ground truth DFT values. When employing Strategy 2 (direct prediction of binding energies), the model demonstrated excellent correlation for the HDEV complexes. As shown in Fig. 9a, this approach yielded an r2 of 0.92 and an MAE of 9.4 kcal mol−1. These results indicate that directly training on precomputed binding energies enables robust predictions for RECs with ligands not encountered during training. In stark contrast, Strategy 1 (predicting absolute energies first) proved less effective for generalizing to new, unseen RECs. This approach yielded a significantly lower r2 value of 0.10 (refer to Fig. S3a in the SI) and exhibited a large deviation from the ideal y = x correlation. This shift from ground truth binding energies may stem from inconsistencies between the ML-predicted absolute energies of the RECs and the DFT-calculated absolute energies of the free ligands used to compute the final binding energies.
We also tested the Δ-ML approach where the model learns the difference in binding energies (ΔEbinding) between DFT and GFN2-xTB (Δ-ML Strategy 2). The results for this strategy, depicted in Fig. 9b, show an r2 of 0.87 and an MAE of 19.7 kcal mol−1. While still showing a reasonable correlation, this Δ-ML strategy was less accurate for the HDEV test set compared to the ML model trained directly on binding energies (Fig. 9a). Furthermore, Δ-ML Strategy 1, which predicts absolute energy differences, showed improved correlation over the original Strategy 1 with an r2 of 0.34 (Fig. S3b in the SI) but remained insufficiently accurate for practical applications. Overall, the high accuracy achieved with the HDEV out-of-sample test set, particularly by the Strategy 2 model trained directly on precomputed binding energies, underscores the model's generalizability and its promising potential for reliable high-throughput screening of novel ligand candidates.
Key findings indicate that machine learning models, even when trained on moderately sized datasets, can achieve strong correlations with high-fidelity DFT calculations. Specifically, we demonstrated that training the ML model to predict binding energies directly is more effective and accurate (r2 = 0.96, MAE = 6.1 kcal mol−1 on the initial test set) than predicting absolute energies first. While the Δ-ML approach, learning the correction between GFN2-xTB and DFT, showed promise by improving upon the absolute energy prediction strategy, it did not surpass the performance of directly predicting DFT-level binding energies for our primary test sets. Crucially, the model predicting binding energies directly also exhibited encouraging out-of-sample generalization when tested on complexes with the isomeric HDEV ligand (r2 = 0.92, MAE = 9.4 kcal mol−1), highlighting its potential for broader applicability in screening novel candidates.
Despite these promising results, certain limitations warrant acknowledgment. While Architector aids in generating diverse structures, the vastness of chemical space means our current dataset, though substantial, represents only a fraction of potential ligand motifs and substituent combinations. Future investigations should aim to broaden the scope and robustness of these predictive models. Extending the dataset with a wider array of ligand backbones, diverse functional groups, and potentially different metal oxidation states or molecular charges would be beneficial. A particularly exciting avenue for future work lies in the development of self-supervised foundation models.56–59 By training on large-scale datasets of REC geometries—which can be generated more readily using tools like Architector without the immediate need for computationally expensive DFT labels—these models could learn fundamental representations of metal–ligand interactions, potentially enhancing transferability and applicability to a wider range of systems and tasks with minimal fine-tuning. Additionally, while this study has focused on the prediction of static binding energies, we envision this work as a foundational step toward the development of a full interatomic potential (IP). Such a model, trained on forces from intermediate geometries, would enable dynamic simulations and geometry optimizations. The high accuracy achieved here validates the use of equivariant models for this chemical space and provides a clear path for prioritizing promising candidates for which the significant computational investment required to develop a full IP is warranted.
In summary, this work demonstrates the significant potential of equivariant neural networks to accelerate the computational screening of ligands for REE extraction. The developed models and insights pave the way for integration into high-throughput virtual screening workflows and, ultimately, toward the inverse design of novel, highly selective ligands, thereby contributing to more sustainable and efficient REE separation technologies.
Supplementary information (SI): tables of ligand charges and rare-earth element spin states (Tables S1 and S2); table of hyperparameters used in Allegro models (Table S3); parity plots for Allegro models (features = 2) for both direct binding energy and Δ-ML prediction approaches (Fig. S1 and S2); results for Strategy 1 (absolute energy prediction) applied to the HDEV out-of-sample complexes (Fig. S3); schematic of a graph neural network for chemistry (Fig. S4). Scripts and instructions for model training and evaluation can be found at: https://github.com/cvhetherington/Rare-Earth-Net. See DOI: https://doi.org/10.1039/d5dd00286a.
Footnotes |
| † Ankur K. Gupta (AKG) and Caitlin V. Hetherington (CVH) contributed equally to this work. |
| ‡ Present address: Institute for Advanced Computational Science and Department of Chemistry, Stony Brook University, Stony Brook, New York 11794, USA. |
| This journal is © The Royal Society of Chemistry 2026 |