Open Access Article
Durbek Usmanov
ab,
Priyanka Yadavc,
Gerardo M. Casanola-Martin
a,
Akshat S. Mallya
c,
Seyedehelham Shirvanihosseinia,
Allison Hubel
cd and
Bakhtiyor Rasulev
*ab
aDepartment of Coatings and Polymeric Materials, North Dakota State University, Fargo, ND, USA. E-mail: bakhtiyor.rasulev@ndsu.edu
bMaterials and Nanotechnology (MNT) Program, North Dakota State University, Fargo, ND, USA
cDepartment of Mechanical Engineering, University of Minnesota, Minneapolis, MN, USA
dDepartment of Biomedical Engineering, University of Minnesota, Minneapolis, MN, USA
First published on 10th April 2026
Natural Deep Eutectic Solvents (NADES) are a promising class of sustainable and environmentally-safe solvents with highly tunable physicochemical properties, including the glass transition temperature, which is critical for their functional performance, including ice control applications. Here, we present an interpretable machine learning (ML) framework to predict glass transition temperature from the molecular structure of NADES combination, integrating descriptor-based feature engineering, unsupervised clustering, and ensemble regression. Combination of components and their mixing ratios for forming NADES were utilized to generate specific multi-component descriptors to describe NADES for ML modeling. A set of multicomponent descriptors was calculated based on individual descriptors from chemically diverse components of NADES. As a result, a Random Forest (RF) model was developed to predict Tg values of NADES and the model achieved a very good performance with R2 values in a range of 0.87–0.93, for both training and test sets. The analysis of contributing factors by Shapley Additive exPlanations (SHAP) analysis identified key features highlighting contributions of 3D geometry, atomic mass distribution and electronic effects. Finally, our results demonstrate that ML approaches combined with mixture descriptors approach and interpretable modeling, enable accurate and chemically meaningful prediction of Tg, facilitating the rational design of NADES for applications in green chemistry and sustainable materials science.
Green foundation1. This work advances green chemistry by enabling data-driven design of natural deep eutectic solvents (NADES), biodegradable and low-toxicity alternatives to conventional organic solvents. We develop an interpretable machine-learning framework that predicts the glass transition temperature (Tg) directly from mixture-aware molecular descriptors, enabling virtual prescreening of formulations prior to synthesis.2. By reducing trial-and-error experimentation and unnecessary differential scanning calorimetry measurements, the approach lowers material consumption, solvent waste, and energy input associated with formulation development. This strategy shifts solvent discovery toward predictive, sustainability-driven design. 3. Importantly, the framework provides chemically interpretable guidance that links hydrogenbond network topology and molecular flexibility to thermal behavior, thereby supporting the rational selection of sustainable compositions. Future work will expand compositional diversity and integrate multi-objective optimization (e.g., viscosity, toxicity, water tolerance) to further strengthen environmentally responsible solvent design. |
One of the crucial thermophysical properties that significantly influences the materials, particularly in coatings, encapsulation, and flexible electronics, is the glass transition temperature.21–23 Tg defines the temperature range at which a material transforms from a rigid, glassy state to a soft, rubbery, phase.24 This transition is essential for determining thermal stability, mechanical integrity, and usability of materials under varying environmental conditions. For NADES-based systems, controlling the Tg is vital in ensuring functionality in temperature-sensitive applications; however, systematic studies on their Tg behavior remain limited due to the complex, multicomponent nature of their hydrogen-bonded networks.21–23 In cryopreservation applications, higher cryoprotectant Tg values have been linked to lower critical cooling and warming rates for vitrification,25 reduced likelihood of damaging ice crystal formation,26 and the reduction of thermal stresses in vitrified samples, leading to lower chances of catastrophic mechanical failure.27 While Tg is commonly measured using experimental techniques such as differential scanning calorimetry (DSC), thermomechanical analysis (TMA), and dynamic mechanical analysis,28 computational modelling and machine learning are increasingly being adopted to predict Tg from molecular descriptors.29,30
These data-driven approaches offer valuable insights into the relationship between molecular structure and thermophysical behavior, accelerating the design of NADES and other next-generation materials for advanced technological applications. However, despite the growing interest in NADES, comprehensive studies on their glass transition behavior remain limited, primarily because of the complexity of their multicomponent hydrogen-bonded networks and the variability in their physicochemical properties.8,31 In recent years, accurately estimating the glass transition temperature of materials, particularly in confined geometries, has become increasingly important due to Tg's strong dependence on both composition and environmental factors.32 However, the experimental determination of Tg for newly developed material systems remains a significant challenge, as such measurements are often labor-intensive, time-consuming, and costly.23 Fortunately, the availability of experimental Tg datasets has opened the door to benchmarking computational approaches that offer a more efficient and scalable means of predicting Tg.
Beyond property prediction, machine learning is increasingly being used to optimize sustainable chemical processes, including biomass fractionation workflows.33,34 These advances underscore the broader potential of data-driven approaches in green chemistry, where predictive modeling can reduce experimental burden, accelerate optimization, and improve decision-making in complex multivariable systems. In this context, machine learning (ML)-based quantitative structure–property relationship (QSPR) models as efficient tools for solving complex challenges in chemistry, biology and materials science.35–44 These models directly link molecular structures to their physicochemical properties, offering a faster and more cost-effective alternative to traditional trial-and-error experimentation. Scientists have successfully applied QSPR models to a wide range of chemical systems, including the prediction of Tg, significantly advancing the rational design of new materials.30,45–51 Although many studies have predicted Tg values for various materials, researchers have yet to develop a dedicated dataset or model for Tg prediction in NADES. These advancements underscore the increasing effectiveness of ML-based approaches in capturing complex structure–property relationships within NADES systems. However, a review of the current literature reveals that most studies have predominantly focused on melting point,52,53 density,54,55 and viscosity of NADES,56,57 but only limited attention has been given to Tg. This gap highlights the urgent need for development of dedicated computational and predictive models for Tg, specifically designed to address the unique structural and physicochemical complexities of NADES.
Considering these limitations, in this study, we developed a ML-based QSPR model for predicting the glass transition temperature of NADES. We integrated an advanced combination of computational techniques to improve predictive accuracy and model interpretability. The Random Forest algorithm was selected as the primary modeling tool due to its robustness in capturing complex, nonlinear relationships across multivariate descriptor sets. The dataset, which comprises Tg values ranging from −122.87 to −52.54 °C, was analyzed using a combination of ML and dimensionality reduction methods. Specifically, Uniform Manifold Approximation and Projection (UMAP) was employed to explore the chemical space of NADES compounds, revealing meaningful structural clusters of NADES in the visualized chemical space. This unsupervised clustering guided the splitting of the dataset into training and test sets, thereby ensuring a structurally diverse and representative evaluation framework. To interpret the model predictions and assess the influence of individual molecular features on Tg, a SHAP approach was applied. This enabled a quantitative understanding of the most significant descriptors influencing the thermal behavior of NADES. This integrated approach not only enables accurate prediction of Tg but also provides valuable insights into the molecular determinants of glass transition behavior in eutectic solvent systems. Also, since today's materials science increasingly prioritizes environmentally friendly and sustainable alternatives, the developed model offers a powerful tool for the rational design of novel NADES formulations with tailored thermophysical properties for green and environmentally safe applications.
This approach produced an initial set of 5666 descriptors per each NADES system, spanning a broad spectrum of physicochemical, topological, geometrical, and electronic properties. These descriptors encompassed multiple-dimensional categories, including 0D, 1D, 2D, and 3D, and were drawn from various descriptor classes, such as constitutional indices, topological and connectivity metrics, 2D autocorrelations, geometrical parameters, atom-centered fragments, and quantum-chemical features.
To minimize redundancy and eliminate non-informative variables, descriptors exhibiting zero variance or near-constant values across the dataset were removed. Following this filtering process, a final set of 4316 descriptors was retained for further machine learning modeling. Importantly, this descriptor set was not used directly in its entirety for interpretation or final decision-making. Instead, feature relevance was further refined in a data-driven manner using Random Forest-based learning and SHAP analysis, which effectively reduced the dimensionality to a compact subset of influential descriptors governing Tg prediction. A detailed summary of the initial descriptor set is provided in the SI (Table S2).
These curated descriptors formed the input features for all subsequent predictive modeling, providing a high-dimensional and chemically meaningful representation of the NADES space for accurate prediction of the glass transition temperature. Besides, Fig. 1 illustrates the general workflow.
![]() | ||
| Fig. 1 Overview of the research workflow. Dataset preparation workflow (left), and machine learning workflow. | ||
| pij = e−d(xi, xj)/σi |
| qij = (1 + a(yi − yj)2)−b |
Unlike other methods, UMAP minimizes a cross-entropy loss function defined as:
This formulation allows UMAP to preserve both local and global data structures, enabling it to capture meaningful relationships in complex high-dimensional datasets.
The two principal hyperparameters in UMAP are the number of neighbors and the minimum distances. The number of neighbors defines the size of the local neighborhood used for manifold approximation – lower values enhance the preservation of local structures, whereas higher values emphasize global relationships. The minimum distance controls how tightly data points are packed in the low-dimensional space: smaller values promote tight clustering, while larger values preserve more of the global data topology.
To assess the clustering structure in the UMAP – reduced space, k-means clustering was applied. The optimal number of clusters was determined using internal validation metrics, namely the Davies–Bouldin Index (DBI) and the Silhouette Score (SS).
The SS s(i) for each sample is defined as:
The DBI is defined as:
After selecting k = 6 as the optimal number of clusters, final cluster assignments were computed using k-means, and centroid coordinates were extracted from the UMAP embedding space. Prior to dimensionality reduction, all numerical descriptors were standardized using StandardScaler (zero mean, unit variance). UMAP was then applied using the umap-learn implementation with n_neighbors = 20, min_dist = 0.1, n_components = 2, metric = ‘euclidean’, and random_state = 42 to ensure reproducibility. UMAP projections and cluster validation were implemented using the UMAP-learn and Scikit-learn libraries in Python. These results confirm that the UMAP transformation not only preserves meaningful chemical relationships in the descriptor space but also facilitates robust unsupervised clustering of NADES formulations.
The RF model was implemented using the Scikit-learn library in Python. Hyperparameters – including the number of estimators (trees), maximum tree depth, and minimum samples per leaf – were optimized using grid search with 5-fold cross-validation to ensure generalization and minimize overfitting. The RF regression model was implemented in Python using the Scikit-learn library (RandomForestRegressor). Hyperparameter optimization was performed using grid search with 5-fold cross-validation (GridSearchCV, scoring = R2) and a fixed random seed (random_state = 42) to ensure reproducibility. The hyperparameter search space included the number of trees (n_estimators = 100–400), maximum tree depth (max_depth = 4–10), minimum number of samples required to split an internal node (min_samples_split = 2–10), and minimum number of samples per leaf (min_samples_leaf = 1–4). Cross-validation was conducted exclusively within the training set, following the UMAP-based structure-aware train–test splitting strategy. The held-out test set was not used during model selection and was reserved solely for final performance evaluation, thereby minimizing information leakage and overfitting.
To evaluate the performance of the model, the following metrics were calculated on the training, test, and external validation datasets:
Mean absolute error:64
Coefficient of determination (R2):
Root Mean Squared Error (RMSE):
These statistical indicators were selected to ensure a comprehensive evaluation of the model's accuracy, error distribution, and generalization capacity. A high R2, alongside low MAE and RMSE values, indicates the reliable predictive performance and robustness of the trained RF model.
Prior to SHAP analysis, standard descriptor preprocessing steps were applied, including the removal of constant descriptors and highly correlated variables, to ensure numerical stability and reduce redundancy in the descriptor space. After initial descriptor preprocessing, SHAP-based feature importance analysis was used to identify the most informative variables. The final predictive models were constructed using the top 35 descriptors ranked by SHAP importance, thereby substantially reducing the dimensionality of the descriptor space relative to the number of samples. This dimensionality reduction step mitigates the risk of overfitting and improves model interpretability.
It should be noted that SHAP provides an additive decomposition of model predictions and does not explicitly reconstruct the full nonlinear functional form learned by the underlying machine learning model. Consequently, SHAP-based interpretations should not be viewed as mechanistic or causal descriptions of molecular interactions.
Nevertheless, SHAP remains a powerful interpretability tool for identifying dominant structure–property trends, directional effects, and threshold-like behaviours embedded within nonlinear models. In this study, SHAP analysis is used to reveal consistent, chemically plausible relationships between molecular descriptors and Tg, rather than to infer explicit nonlinear interaction mechanisms. The observed SHAP dependence patterns (e.g., saturation and sigmoidal responses) reflect the nonlinear decision boundaries learned by the model, providing physically meaningful insight into the factors governing vitrification in NADES.
SHAP analysis allowed for a detailed decomposition of Tg predictions into additive feature attributions, revealing both global feature importance and local instance-specific effects. Visualization tools, including beeswarm and summary plots, were used to identify the most influential descriptors. Among the top contributors were 3D-MoRSE and GETAWAY descriptors (e.g., Mor08p, Mor26p, R6m+) and topological autocorrelations (e.g., MATS4i), highlighting the importance of molecular geometry, polarizability, and electronic distribution in modulating the thermal behavior of NADES systems.
This framework not only improved the interpretability of our model but also provided mechanistic insights into structure–property relationships within the NADES chemical space, thereby supporting rational design and formulation strategies.
After careful data curation, the dataset of NADES was used to develop predictive models, which were trained and evaluated using a “mixture-based split” strategy, ensuring that structurally related compounds were not included in both the training and test sets simultaneously. This approach provides a more realistic assessment of the ML model's generalizability, especially for prospective applications involving prescreening of novel NADES formulations. The overall workflow of the study is shown in Fig. 1.
:
EG 1
:
66) to −53.64 °C (Gly
:
Glu
:
Sorb
:
Water 1
:
1
:
1
:
3), lower than those observed for the water-diluted samples. These findings agree with previous reports showing that NADES remain in the liquid state across a broad low-temperature range.66,67
Water addition was found to disrupt the supramolecular organization of NADES, increase molecular mobility, and lower Tg.66,68 Tg decreased with increasing dilution up to 50% but remained nearly constant at higher dilutions.18 The reduction in Tg upon hydration reflected the well-known plasticizing role of water.5 Consistent with the Flory–Fox equation, Tg was observed to be directly proportional to molecular weight. NADES formulated with low-molecular-weight ethylene glycol reported the lowest Tg, whereas those with high-molecular-weight sorbitol exhibited higher Tg values.69
It is worth noting that water content can significantly influence the Tg value. To explore this, the importance of water concentration on the Tg within structurally similar NADES systems was investigated as well, and a comparative violin plot was generated (Fig. 3), highlighting various combinations of the same core components with differing water content. As can be noticed in the plot, the systems Bet
:
Ure
:
Wat, Bet
:
Sor
:
Wat, Bet
:
Sor
:
Ure
:
Wat, and Bet
:
Lys
:
Wat exhibit minimal shifts and smooth transitions in Tg values as water concentration increased. Water concentration was explicitly reported for each formulation (Table S2) and treated as part of the mixture composition during descriptor generation for the machine learning model. The consistently narrow violin shapes across these systems indicate that the incorporation of water had only a moderate impact on their thermal behavior. This stability likely reflects robust intermolecular interactions among the primary components (e.g., betaine, sorbitol, urea, lysine), which are only marginally affected by hydration. Conversely, other NADES systems demonstrated broader or more irregular Tg distributions, with pronounced changes in response to variations in water. This suggests that their physicochemical properties are more sensitive to hydration, possibly due to weaker or more dynamic hydrogen bonding networks. These findings emphasize that specific NADES formulations preserve structural and thermal coherence even under changing moisture conditions, making them attractive candidates for applications requiring thermophysical stability. This analysis underscores the chemical and functional diversity within the NADES formulations, highlighting the need to employ advanced machine learning techniques to capture the complex, nonlinear relationships between molecular structure and physicochemical properties. The applied methodology enhances the interpretability and robustness of the predictive framework, enabling more accurate and generalizable predictions of the glass transition temperatures across a structurally diverse range of NADES systems.
To further enhance the predictive performance of the RF model, we conducted an extensive hyperparameter optimization using a grid search approach. Key parameters, including the number of trees [n_estimators], maximum tree depth [max_depth], minimum number of samples required to split an internal node [min_samples_split], and minimum number of samples needed for a leaf node [min_samples_leaf], were systematically varied across a defined range. GridSearchCV with 5-fold cross-validation was employed to evaluate each parameter combination, with R2 as the primary scoring metric. The optimal configuration identified through 5-fold cross-validation consisted of 180 trees [n_estimators = 180], a maximum depth of 6, min_samples_split = 2, and min_samples_leaf = 1. The hyperparameter-tuned RF model preserved high predictive accuracy while improving computational efficiency and model stability, ensuring that predictions of Tg remain both interpretable and generalizable across a chemically diverse NADES space. Although RF provides built-in feature importance measures such as mean decrease in Gini impurity86,87 or permutation importance,88 these approaches are limited by their global scope, bias toward high-cardinality or continuous variables, and instability in correlated descriptor spaces, which are common in cheminformatics. To overcome these shortcomings, we employed SHAP approach,89 a framework that ensures mathematical consistency, delivers both global rankings and local, sample-specific contributions, and distributes importance more equitably across correlated descriptors. By directly linking descriptor effects to predicted Tg values at multiple levels, SHAP offers a more reliable and chemically meaningful interpretation of structure–property relationships. SHAP was therefore used to complement RF feature importance, providing unbiased insights that enhance mechanistic interpretability and strengthen confidence in the model's predictive performance across the chemically diverse NADES space.
Before engaging in a more in-depth interpretation of SHAP-based feature attributions, it is essential to recognize that several factors inherently bound the interpretability of any ML model. These include the model's architectural constraints (e.g., ensemble tree-based structure in RF), the nature of the input representation (in this case, applied molecular mixture-descriptors), and the specific chemical space spanned by the training dataset. Therefore, the SHAP values discussed herein should be understood as explanations tied to the Tg predictions within the scope of the NADES systems represented in this study. Importantly, the limited contribution or apparent absence of specific physicochemical descriptors traditionally considered relevant for glass transition phenomena does not imply that these features lack significance across broader or alternative molecular contexts. Instead, their marginal influence in our model may reflect dataset-specific patterns or descriptor redundancy within the confined chemical space.
With this framework in mind, a SHAP beeswarm plot (Fig. 4a) was generated to visualize the relative impact of the top 20 selected molecular descriptors on the predicted glass transition temperature of NADES. Among these, molecular topological descriptors such as Mor08p, R6m+, MATS4i, and Mor26p emerged as the most influential features, each displaying a distinctive effect on model output. In this context, descriptors with higher positive SHAP values are associated with an increased predicted Tg value, whereas those with negative values tend to lower the predicted Tg value. The feature importance ranking, Fig. 4c, corroborates these findings, indicating that a small subset of descriptors accounts for the majority of the model's predictive ability. Further, the individualized SHAP analyses (Fig. 4b and d) provide a detailed understanding of how specific molecular descriptors contribute to Tg predictions in individual NADES systems. For instance, descriptor Mor08p is a 3D-MoRSE descriptor weighted by polarizability, exhibits a strong positive contribution to Tg, indicating that spatially extended and polarizable molecular structures tend to elevate glass transition temperatures. At the same time, R6m+ is a GETAWAY descriptor capturing mass distribution at topological lag 6, contributes negatively, suggesting that mass clustering at intermediate distances may reduce Tg. These trends are substantiated by SHAP dependence and force plots, which reveal both the directionality and magnitude of feature effects. A representative SHAP force plot, Fig. 4d, demonstrates the local interpretability of a single prediction: Mor08p contributed a significant positive shift (+0.572) toward the predicted Tg value, whereas R6m+ had a minor negative effect (−0.025). Together, these feature contributions reduced the prediction from the model's base value (2.26) to the final output (2.23). This visualization reinforces the utility of SHAP in uncovering feature-specific impacts within the Random Forest framework, offering mechanistic insights into the structural underpinnings of Tg behavior in NADES. Additionally, the presented figure illustrates a SHAP decision plot, which visualizes the contribution of individual molecular descriptors to the predicted values of the glass transition temperature expressed as Tg. The x-axis represents the model's output (predicted values), while the y-axis lists the most influential descriptors, ranked by their average SHAP importance. Each line on the plot corresponds to a single observation in the dataset, tracing how cumulative SHAP values (feature attributions) evolve from the model's base value to its final output for that instance. The color gradient encodes the normalized value of the respective descriptor: blue indicates lower values, whereas blue to pink denotes higher values.
Although the RF algorithm is inherently robust to multicollinearity and does not require strict linear independence among input features, where evaluated pairwise correlations remain essential for interpreting model's behavior. To this end, a Pearson correlation matrix was constructed for the top 15 descriptors ranked by SHAP importance, Fig. 4e. This visualization enables an assessment of feature redundancy and helps to identify potential duplication of encoded physicochemical information. For instance, descriptor pairs such as MATS4i-HATS2e and WHALES60_Rem-WHALES80_Rem exhibited strong cross-correlation (|r| > 0.85), indicating that they encode similar structural characteristics. In contrast, features like Mor08p and R6m+ displayed weak correlation (r ≈ –0.46), underscoring low similarity and high structural diversity among the most informative variables. Overall, the analysis confirms the presence of both partially redundant and chemically unique descriptors, ensuring broad and diverse coverage of the NADES chemical space. This reinforces the interpretability of the model and highlights the value of integrating SHAP-based feature importance with correlation-based filtering in molecular modeling workflows.
These results highlight the model's strong generalization ability and stable predictive accuracy across diverse datasets (Fig. 5). Moreover, the combination of rigorous regularization through optimized tree depth and node splitting, along with descriptor-level interpretability via SHAP, not only mitigates overfitting but also provides mechanistic insights, ensuring that complex nonlinear relationships within the descriptor space are captured in a transparent and chemically meaningful manner. To further validate the model's robustness and feature interpretability, we examined the distribution of the most influential descriptors across the training and test datasets. As shown in Fig. 4a and b, descriptors such as Mor08p, R6m+, MATS4i, and Mor26p exhibit consistent distributions between training and test sets, indicating good representativeness and minimal sampling bias. Specifically, Mor08p follows a near-normal distribution centered around 0.4 in both datasets, while R6m+ displays a positively skewed distribution, reflecting its sparsity in the chemical space. Similar trends are observed for MATS4i and Mor26p, where the descriptors’ values are well-aligned across the splits.
Although descriptors such as Mor08p and R6m+ do not explicitly encode hydrogen bonding, they capture structural features that indirectly influence it. Mor08p reflects the three-dimensional distribution of polarizability and is consistent with differences in molecular packing and supramolecular rigidity, whereas R6m+ describes medium-range mass distribution and can be associated with structural compactness and segmental mobility. Together, these effects are relevant to the stability of hydrogen-bond networks and the vitrification behavior of NADES.
Thus, the discussed visualizations together link structural diversity to thermal behavior, thereby enhancing the mechanistic interpretability of the model outputs. These distributional consistencies bolster the generalization ability of the optimized RF model and ensure that SHAP-derived feature contributions are not merely the result of sampling bias. Moreover, the consistency between feature importance and descriptor availability across datasets enhances the interpretability and chemical relevance of the model, supporting its use in virtual screening and rational NADES design.
Clusters associated with higher Tg values are predominantly composed of sugar-rich NADES systems, such as sucrose–glucose–fructose–water (1
:
1
:
1
:
11) and glucose–fructose–water (1
:
1
:
8). These systems feature dense, multidirectional hydrogen-bonding networks formed by polyhydroxylated carbohydrates, leading to enhanced structural rigidity and reduced segmental mobility. The presence of multiple hydroxyl groups per molecule promotes extensive intermolecular interactions, which stabilizes the amorphous phase and shifts Tg toward higher values. In these compositions, water acts as a plasticizer at low concentrations but becomes structurally integrated into the hydrogen-bonded network at higher ratios, contributing to network stabilization rather than disruption.
In contrast, clusters characterized by lower Tg values are enriched in amino acid- and choline-based NADES, such as proline–ethylene glycol (1
:
3.3) and choline chloride–glycerol (1
:
2). These systems exhibit more flexible hydrogen-bonding motifs, dominated by fewer donor–acceptor sites and increased conformational freedom of aliphatic chains. Ethylene glycol and glycerol introduce mobility through rotational flexibility, while zwitterionic or ionic components such as proline and choline chloride generate localized interactions that are less effective in producing rigid, three-dimensional networks. As a result, these formulations display lower resistance to molecular rearrangement and reduced glass transition temperatures.
Intermediate clusters, including mixed systems such as glycerol–glucose–sorbitol–water (1
:
1
:
1
:
3), exhibit Tg values between the two extremes. These compositions combine rigid carbohydrate backbones with flexible polyol or amino-alcohol components, yielding partially constrained hydrogen-bonded networks. The coexistence of rigid and flexible molecular motifs produces heterogeneous interaction landscapes, reflected in broader Tg distributions and cluster overlap. This behavior highlights the non-additive nature of Tg in multicomponent eutectic systems, where collective interactions dominate over individual component properties.
Importantly, the cluster-resolved analysis demonstrates that Tg in NADES is governed primarily by component chemistry and formulation strategy, rather than by isolated molecular descriptors. The machine-learning descriptors serve as an effective abstraction of these chemical features—capturing size, polarity, and interaction density—but the underlying physical origin of Tg variation remains rooted in hydrogen-bond topology, molecular flexibility, and water-mediated plasticization effects.
From a green chemistry perspective, these findings provide chemically intuitive design guidelines for tailoring Tg in NADES. Carbohydrate-rich, highly hydrogen-bonded systems are suitable for applications requiring elevated Tg and enhanced thermal stability, such as cryopreservation or solid-state encapsulation. Conversely, polyol- and amino acid-based NADES offer lower Tg and greater flexibility, making them attractive for low-temperature processing, coatings, and solvent applications. By linking machine-learning insights to concrete compositional motifs, this study advances the rational, sustainability-driven design of NADES with application-specific thermophysical properties.
In practical formulation design, Tg represents only one of several parameters governing the performance of NADES systems. Other properties, such as viscosity, toxicity, conductivity, cost, and solubilization capacity, may also strongly influence the suitability of a formulation for specific applications. Although the present study focuses on Tg prediction, the proposed descriptor-based machine learning framework is readily extendable to multi-property modelling. By combining Tg prediction with existing ML models for the viscosity, density, or surface tension of deep eutectic solvents, multi-objective screening strategies could be implemented to support rational NADES design. Such approaches would enable the identification of formulations that simultaneously satisfy multiple performance constraints, thereby accelerating the development of sustainable functional materials.
On the other hand, Tg must often be balanced against viscosity and water tolerance. Descriptor-level interpretation further indicates that molecular geometry, polarizability, and mass distribution simultaneously influence Tg and molecular mobility, providing guidance for navigating trade-offs between thermal stability, flow behavior, and hydration sensitivity. These insights support the development of multi-objective formulation strategies rather than single-property optimization.
From a process perspective, virtual Tg screening significantly reduces reliance on low-throughput DSC measurements by narrowing experimental validation to high-confidence candidates. As a result, the framework accelerates NADES formulation workflows while reducing material consumption and experimental cost.
If the required components are experimentally accessible, these formulations may be directly prioritized for preparation and validation. Otherwise, structurally related components, such as alternative sugars, polyols, amino acids, or quaternary ammonium derivatives, can be used to construct a virtual library of candidate systems across feasible molar ratios and water contents. For each candidate formulation, mixture descriptors are calculated using the same stoichiometry-weighted protocol applied in this study, and the trained Random Forest model is then used to predict Tg.
To ensure reliable interpretation, predicted candidates should also be evaluated within the applicability domain of the model, meaning that descriptor values remain within or close to the range represented in the training dataset. In this way, the model functions as a prescreening tool that reduces the number of experimental trials and accelerates the rational identification of NADES formulations with targeted low-temperature properties.
Second, the descriptor-based representation provides an effective abstraction of component chemistry but does not explicitly encode specific intermolecular interactions or dynamic hydrogen-bond rearrangements. Consequently, the model captures statistically learned structure–property relationships rather than explicit mechanistic pathways governing glass transition behavior. While SHAP analysis improves interpretability, it does not establish causal links between individual descriptors and Tg.
Regarding transferability, the applicability of the model is expected to be highest for NADES composed of chemically related components and comparable formulation strategies, particularly polyol-, carbohydrate-, and amino acid-based systems. The applicability domain analysis further supports this view by identifying regions of descriptor space in which predictions are reliable. Extension of the model to fundamentally different eutectic systems or to non-natural DES formulations would require retraining or recalibration using representative experimental data.
Nevertheless, the overall modeling strategy, combining structure-aware data splitting, ensemble learning, and interpretability analysis, is transferable and can be readily adapted to other thermophysical properties or solvent classes. With the continued expansion of high-quality experimental datasets, the predictive scope and generalizability of such data-driven frameworks are expected to improve, supporting the rational and sustainable design of next-generation eutectic materials.
Although the present model was evaluated using a structure-aware held-out test set, prospective validation on newly prepared NADES formulations remains an important next step. Such validation would further establish the practical utility of the framework for formulation design and targeted screening.
Among the machine-learning algorithms tested, the RF model achieved the best balance between accuracy and generalization, with R2 = 0.93 (training) and R2 = 0.87 (test), whereas advanced boosting models showed higher overfitting. SHAP analysis identified a small subset of descriptors dominating Tg prediction, with the most influential being related to 3D molecular geometry, polarizability, atomic mass distribution, and electronic effects, directly linking Tg to molecular mobility and hydrogen-bond network rigidity.
Overall, this study demonstrates that Tg of NADES can be predicted with high accuracy using mixture-aware descriptors and interpretable machine learning. The proposed framework enables rapid prescreening and rational formulation of NADES with targeted thermal properties, reducing experimental effort and supporting scalable design of green solvents for cryopreservation, anti-icing, and materials applications.
| This journal is © The Royal Society of Chemistry 2026 |