Open Access Article
Hyun Kil Shin
*ab and
Youngho Sihn
c
aPrediction Model Research Center, Korea Institute of Toxicology, Daejeon 34114, Republic of Korea. E-mail: hyunkil.shin@kitox.re.kr
bHuman and Environmental Toxicology, University of Science and Technology, Daejeon, 34113, Republic of Korea
cKorea Atomic Energy Research Institute, Daejeon, Republic of Korea
First published on 19th May 2025
A quantitative structure–activity relationship (QSAR) model for predicting the stability constant of uranium coordination complexes to accelerate the discovery of novel uranium adsorbents was developed and evaluated. Effective uranium adsorbents are crucial for mitigating environmental and health risks associated with uranium wastewater, an unavoidable byproduct of nuclear fuel production and power generation, as well as for sequestering uranium from seawater. QSAR modeling addresses the limitations of quantum mechanics calculations and offers a time- and cost-efficient computational approach for exploring vast chemical spaces. The QSAR model was built using a dataset of 108 uranium complexes, incorporating features such as physicochemical properties, coordination numbers of ligands, molecular charge, and the number of water molecules. Catboost regressor achieved an R2 of 0.75 on the external test set after hyperparameter optimization. Applicability domain analysis was conducted to evaluate model predictive performance. The QSAR model predicts stability constants from the molecular composition alone and is a valuable tool for the efficient design of safer and more sustainable uranium adsorption materials, potentially improving uranium collection processes.
000 years.1 This immense resource, however, is distributed at a very low concentration of about 3.3 ppb,2 making its extraction technically challenging. The retrieval of uranium is also associated with safety concerns since the traditional methods of uranium extraction, primarily terrestrial mining, pose significant environmental and health risks. Uranium mining and milling generate substantial radioactive waste, leading to contamination of water, soil, and air, with long-term consequences for ecosystems and human health. Thus, effective treatment methods are needed for the safe and sustainable extraction and use of uranium.3 Uranium wastewater containing uranyl ions poses direct environmental and health hazards. With uranium wastewater being an unavoidable byproduct of nuclear fuel production and power generation, proper waste management is imperative.
While adsorption is an effective method for sequestering uranium from wastewater,4 the development of adsorbent materials that can efficiently capture dilute concentrations of uranium present in seawater, is a more difficult challenge. These materials must exhibit high selectivity for uranium over other metal ions, be resistant to biofouling, and maintain their performance over multiple adsorption–desorption cycles.5 The extraction of uranium from seawater has been explored for decades, but most of the adsorbents were ineffective except polymeric adsorbents6 and amidoxime-based materials.7 Amidoxime-based polymers have emerged as the leading material for uranium adsorbents due to their strong affinity for the uranyl ion (UO22+). The amidoxime functional group, which forms stable complexes with uranium, is central to the ability of amidoxime-based adsorbents to selectively adsorb uranium from seawater. However, the large-scale implementation of seawater uranium extraction remains limited by the high costs associated with the low efficiency of uranium extraction. Therefore, new adsorbents which can efficiently extract uranium from seawater must be continuously explored and developed.
The adsorption performance of uranium adsorbents can be assessed through the stability constant, which indicates the strength of the interaction between adsorbent material and uranium to form complexes.8 The stability constant is represented as follows:
![]() | (1) |
The quantitative structure–activity relationship (QSAR) model is an ML model whose input is a representation of the molecular structure and whose output is the activity of the input molecule (i.e., experimentally measured properties). In the QSAR model, features calculated from molecular structures (also called descriptors), are used to predict the activity variation of the molecule as a consequence of structural variation. To the best of our knowledge, prior to this work, only one QSAR model for β prediction has been reported. Zahariev et al. developed a QSAR model based on graph neural network models and traditional ML models for predicting β for metal–ligand complexes with the aim of designing new selective ligands for the target metal ion.11 The predictive accuracy of the QSAR model is heavily dependent on the applicability domain (AD) of the model.12 The predicted result is reliable when the input molecule has similar structural features with training data. If the specific target molecules only take small portion in the entire training set, the structural pattern of such molecules are not well trained; therefore, there is a high possibility that the model can't produce reliable prediction results on the molecule. Thus, for novel uranium adsorbent development, the developed model focuses on the chemical space of uranium complexes can provide better and reliable prediction results than the general model.
In this study, we developed a QSAR model to predict β for uranium complexes. In total, 108 uranium complexes were collected with their stability constants. Descriptors used in the model are the physicochemical properties, coordination numbers according to ligand atom, charge number, and the number of water molecules due to hydroxylation. The molecular formula of the uranium complexes was used to calculate four physicochemical properties, specifically, water solubility, boiling point, melting point, and pyrolysis point using a neural network model for inorganic compounds. Catboost achieved the best prediction performance in the external test set (R2: 0.75). Therefore, the evaluation results confirm that the model built in this study is capable of discovering novel uranium adsorbents.
β) was collected from OECD-NEA thermochemical database and research articles with the structure information of uranium coordination complexes (ESI†). In data collection, we compared the different log
β values from different research articles and selected values with no significant discrepancies. However, data obtained from OECD-NEA thermochemical database, a widely accepted and rigorously evaluated dataset, were used even if single value was available. The log
β values in our dataset are considered representative, reducing concerns over variability. The molecular formula and ligand atoms were used to represent the molecular structure of uranium coordination complexes. The total data size was 108, and the data set was divided into training and test sets with a ratio of 8
:
2; thus, the training set comprised 86 data points and the test set comprised 22 data points.
S), melting point (mp), boiling point (bp), and pyrolysis point (pp). These four physicochemical properties were predicted based on the molecular formula of the uranium coordination complexes using a neural network model.13 The physicochemical property prediction models used here were developed for inorganic compounds, in contrast to most existing models focusing on organic molecules. These models use the electron configuration of inorganic molecules based on the composition to calculate the four physicochemical properties. In the collected dataset, the coordination complexes have four different ligand atoms such as N, O, F, and Cl. Thus, the number of ligand atoms was used as a feature. Prepared features are available in ESI tables (training set: Table S1, and external test set: Table S2†).
Model was further validated through y-randomization test18 to check if the model's performance is achieved by coincidence or not. In the test, the model was trained on the randomized endpoint, and compare the model's performance between original endpoint and shuffled endpoint. Z-score was calculated as
| hi = xi(XTX)−1xiT | (2) |
![]() | (3) |
S, mp, bp, and pp) were calculated base on the composition of the molecules. Wide range of each features showed that the molecular composition was also diverse. Molecular formular of the molecules is available in Table S1 (training set) and Table S2 (test set).† Training and test data split is important issue since the model's performance may not be correctly evaluated according to the data distribution in training and test set. Data was randomly split multiple times to test if random sampling may cause discrepancy in data distribution between training and test set; however, test set was never biased. Therefore, training and test data was prepared with random split. Chemical space analysis showed that training and test sets were well diversified, which indicated that the model performance evaluated by external test set can be trusted within AD of the model (Fig. 2).
The QSAR model prediction reliability was evaluated via AD analysis. This involves checking the chemical space which represents the structural diversity of the compounds in each dataset. Normally, the chemical space is visualized based on the molecular weight and the water/octanol partition coefficient (log
P) for organic molecules; however, the log
P model does not apply to inorganic molecules. In this study, four physicochemical properties were calculated from the molecular formulas of the uranium coordination complexes; therefore, the chemical space of the data set was compared based on the molecular weight and the four physicochemical properties. The chemical spaces of the training set, test set, and candidate materials were compared as shown in Fig. 2. The training and test sets show a similar distribution in the chemical space. Also, the chemical space of the candidate material was similar to the training and test data sets; therefore, we can conclude that the model trained and validated with the dataset can be reliably used to predict the β of the candidate materials.
| Abb. | Model | Q2 (bootstrappinga) | |
|---|---|---|---|
| Mean | Stdb | ||
| a 200 times sampling was applied in the internal validation.b Standard deviation. | |||
| Catboost | Catboost regressor | 0.70 | 0.10 |
| XGBoost | Extreme gradient boosting | 0.62 | 0.17 |
| RF | Random forest | 0.61 | 0.16 |
| kNN | K-Nearest neighbors regressor | 0.56 | 0.19 |
| LR | Linear regressor | 0.56 | 0.17 |
| SVR | Support vector regressor | 0.23 | 0.10 |
| Catboost | R2 | RMSE | NRMSE | Endpoint range |
|---|---|---|---|---|
| Train | 0.99 | 0.04 | 0.05% | 86.4 |
| External validation | 0.75 | 10.28 | 12.31% | 83.5 |
Feature importance was analyzed by developing models with composition features, and physicochemical properties alone. When catboost was evaluated with five physicochemical properties (MW, log
S, bp, mp, and pp) alone, the model only achieved 0.6 for Q2 in bootstrapping (200 round sampling). With the six composition features alone (i.e., ligand_N, ligand_O, ligand_F, ligand_Cl, charge, and H2O), the model achieved Q2 0.68. The composition features have low variance; therefore, there is a clear limitation to represent structural diversity of uranium coordination complex with the composition features alone. According to the experiment, composition features and the physicochemical properties both were needed to represent molecular structure of uranium coordination complex; thus, the model performed best when all the features were used together (Table 3).
| Catboost | Q2 (bootstrappinga) | |
|---|---|---|
| Mean | Stdb | |
| a 200 times sampling of bootstrapping.b Standard deviation.c PhysChem: physicochemical properties (molecular weight, water solubility, melting point, boiling point, pyrolysis point).d Composition: coordination number of ligand (N, O, F, Cl), charge, and the number of water molecules. | ||
| PhyChemc alone | 0.60 | 0.12 |
| Compositiond alone | 0.68 | 0.11 |
| PhysChem & composition | 0.70 | 0.10 |
The charge state of the uranium complex significantly impacts β. Higher charges typically result in stronger electrostatic interactions between the uranium ion and the ligands, leading to more stable complexes. This is consistent with the observed feature importance, where the inclusion of molecular charge as a feature notably improved the model performance. The coordination number, which reflects the number of ligand atoms bonded to the central uranium ion, directly influences the stability of the complex. Ligands with a high coordination number can donate more electrons to the uranium ion, stabilizing the complex through stronger bonding interactions. The sensitivity of the model to changes in coordination number underscores its critical role in determining complex stability. Additionally, the number of water molecules in the system is significant, as hydration can either stabilize or destabilize the complex depending on specific interactions with the central metal ion and surrounding ligands. These composition features, when incorporated into the model, allow for a more comprehensive representation of the physicochemical factors governing complex stability, thus explaining the significant improvement in model performance.
AD analysis was applied to the candidate materials. Even though the candidate materials were in the chemical spaces of the training and test sets, there is always uncertainty in the prediction values since the model was previously not exposed to the candidate material. Therefore, we checked the feature ranges of the candidate material data and compared them to the training set. Even if the molecule was found in AD according to the leverage value, we checked the range of 11 features between the candidate data and the training set. If the molecule is out of range even in a single feature, the molecule was also marked as an outlier. After leverage analysis, 9 molecules were found in AD. Two of these had features out of the range in descriptors of the training set: the coordination number of N and aquatic solubility. As a result, 7 molecules were considered as reliably predicted by the model (Table 4). Detailed AD analysis can be found in Table S3.† The model can make reliable prediction as long as the query molecule has feature values within the range of each feature provided in the training set. Therefore, any molecule having feature values exceeding the range of each feature of the model shouldn't be used to make prediction. Moreover, predicted log
β should not exceed the range of log
β in the training set. Table 5 shows maximum and minimum value of log
β and each feature.
| AD analysis | In-domain | Out-of-domain |
|---|---|---|
| Leverage | 9 | 7 |
| Feature range | 11 | 5 |
| Reliable prediction | 7 | |
| Values | Max. | Min. |
|---|---|---|
log β |
54 | −32.4 |
| Ligand N | 3 | 0 |
| Ligand O | 6 | 0 |
| Ligand F | 4 | 0 |
| Ligand Cl | 2 | 0 |
| Charge | 2 | −6 |
| H2O | 12 | 0 |
| MW | 1199.157 | 287.034 |
log S |
0.112395 | −4.21574 |
| MP | 1875.558 | 176.3154 |
| BP | 1887.21 | 391.5719 |
| PP | 1619.382 | 222.5845 |
The identification of seven candidate materials within the model's AD suggests that ML-driven QSAR models can be a powerful tool for guiding experimental efforts in uranium adsorbent discovery. The ability to screen potential adsorbents computationally reduces the need for labor-intensive experimental screening, making the discovery process more efficient. However, experimental validation of these candidate materials is necessary to confirm their real-world adsorption efficiency and stability under marine conditions.
Footnote |
| † Electronic supplementary information (ESI) available. See DOI: https://doi.org/10.1039/d5ra02220g |
| This journal is © The Royal Society of Chemistry 2025 |