Open Access Article
This Open Access Article is licensed under a
Creative Commons Attribution 3.0 Unported Licence

An improved machine learning strategy using structural features to predict the glass transition temperature of oxide glasses

Satwinder Singh Danewalia * and Kulvir Singh
Department of Physics and Materials Science, Thapar Institute of Engineering and Technology, Patiala, India. E-mail: satwinder.singh@thapar.edu

Received 23rd July 2025 , Accepted 23rd October 2025

First published on 24th October 2025


Abstract

We present a physics-informed machine learning approach to predict the glass transition temperature (Tg) of sodium borosilicate glasses. Four models—random forest, extreme gradient boosting, support vector machines, and K-nearest neighbors—were trained using both compositional and structural features derived from statistical mechanics. Incorporating these structural descriptors significantly improved model performance. This is evident from reduction in mean absolute error (14.85 K → 13.76 K), root mean square error (21.78 → 19.12) and increase in R2 (0.88 → 0.91) measured on testing the dataset for the random forest model. Similar performance improvement was seen for other models as well. Building on this, we propose a three-step predictive strategy that enhances generalization across compositions and accurately predict the Tg of unseen compositions, achieving a mean absolute error of approximately 8 K and an R2 value of around 0.98. Our method demonstrates improved accuracy when benchmarked against GlassNet, which represents the current state-of-the-art in property prediction for glasses. These results highlight the importance of considering structural information in improving prediction capabilities of machine learning models for composition-specific small datasets. This approach can assist in the rapid screening and design of glass materials, reducing the reliance on time-consuming experiments and guiding future research toward targeted property optimization.


1 Introduction

Glasses have a lot of applications in modern life, such as medicine, engineering, science, etc.1,2 Synthesizing glasses usually involves significant time, labor, chemicals, and energy consumption, contributing to a considerable carbon footprint. Furthermore, the glasses must be characterized and tested to determine their suitability for real-life applications. Glass transition temperature (Tg) is one of the important characteristic temperatures of glasses. It is the temperature interval in which a glass loses its brittleness while heating. Glasses behave as rigid and brittle solids below Tg. At the same time, they exhibit viscous liquid-like behavior above Tg. At a fundamental level, knowing Tg provides insights into the relationship between glasses' composition, structure, and physical properties. Tg is closely related to glass forming ability, which is crucial to developing novel glass compositions for various applications.3Tg is also of great importance from the industry perspective. It dictates the temperature range in which glasses can be safely processed and used in various applications such as fiber drawing, molding, and shaping. It helps to decide the annealing temperature to relieve internal stresses and prevent glass cracking.4 The change in thermal expansion at Tg is an important consideration when designing glass sealants in solid oxide fuel cells, microelectronicz devices, and other systems where thermal stresses can be problematic during operation.5,6

Experimentally, Tg of glasses is measured via thermal characterization techniques such as differential thermal analysis (DTA), differential scanning calorimetry (DSC), and dilatometry. On the other hand, classical computational methods can help in predicting glass properties using molecular dynamics studies and density functional theory (DFT).7 These computational methods help in understanding the atomic-scale mechanisms of glasses; however, they have limitations. Limited system size, unrealistic cooling rates, dependency on the choice of interatomic potentials, and high computational cost are major disadvantages of these theoretical methods.8 Machine learning (ML) has shown promising results in the property prediction of various materials.9–12 Reducing costs, saving time, and exploring unconventional compositions would reduce the carbon footprints and accelerate the material design.13–16 ML methods can handle large datasets while capturing complex, nonlinear relationships between the composition and material properties. Tools like SHapley Additive exPlanations (SHAP) and partial dependence plots (PDP) can further be used to visualize and interpret the outputs of these models.17

Previous studies have attempted Tg prediction of glasses using a range of approaches. O'Donnell et al. employed a linear fitting approach to predict Tg of oxide glasses, though their study was limited to fewer than 100 bioactive glasses.18 Cassar et al. used artificial neural networks for Tg prediction of multicomponent oxide glasses.19 The model was trained on more than 55[thin space (1/6-em)]000 glasses containing up to 45 chemical elements. For high Tg glasses, the uncertainty in predictions was found to be high compared to low Tg glasses. This model was purely trained on compositional data. Alcobaca et al. developed ML models using a dataset of 43[thin space (1/6-em)]240 oxide glasses,20 but they considered only compositional features as input without additional feature engineering. Similarly, Ravinder et al. used deep learning to model glass properties on a dataset of 100[thin space (1/6-em)]000 glasses,21 again relying solely on compositional features. In a closely related study, Bishnoi et al. applied Gaussian processes to predict a range of glass properties using a large dataset, also emphasizing compositional inputs.22 Zhang et al. developed a Tg prediction model with more than 15 features, although their focus was primarily on Fe-based metallic glasses.23 In 2023, Cassar developed GlassNet, which is a multitask deep neural network model trained on more than 218[thin space (1/6-em)]000 different glass compositions using 98 features.24 This model is capable of predicting 85 properties of glasses ranging from oxides, chalcogenides, halides and others. Many researchers reported Tg prediction of polymers using various ML methods.25–27

ML models applied on large datasets with too many compositional features may give overall good performance metrics; however, their performance may be poor in specific composition domains.19,28 On the other hand, when focusing on specific glass systems, preprocessing often reduces the dataset to a very small size,29 which makes the model training a challenging task.11 Furthermore, the properties of the glasses cannot be fully explained based on their composition alone; many glass properties depend on the glass's local structure, the interaction of ingredients, thermal history, testing conditions, etc.30–32 Thus, a research gap remains in exploring features beyond composition for ML studies, particularly in composition-specific domains where datasets are small. This gap is addressed in the present work using physics-informed models that can integrate domain knowledge, aligning predictions with established theories and published literature.33,34

In the present work, widely employed ML models have been used to predict the Tg of sodium borosilicate glasses. The structural features were obtained using principles of statistical mechanics. The effect of distribution of the structural units on predicted Tg was determined. The current work aims to improve ML models' performances for Tg predictions of glasses in specific composition domains with the help of statistical mechanical calculations. The work is hoped to provide fruitful insights into the inter-ingredient interactions that affect the Tg of sodium borosilicate glasses. The results would help accelerate the glass design with minimal experimental efforts. This cost-effective approach would help reduce carbon footprints and mitigate environment-related problems.

2 Methodology

2.1 Data source

The dataset used in the present work was extracted from the SciGlass database (v2.0.1), which contains composition and property data of around 420[thin space (1/6-em)]000 glass compositions, including 268[thin space (1/6-em)]000 oxide glasses and melts, 18[thin space (1/6-em)]500 halide glasses, and 38[thin space (1/6-em)]500 chalcogenide glasses.35 For the present work, data were fetched using the GlassPy (v0.5.3) python module.24 The data were accessed on 22nd May 2025.

2.2 Data preprocessing

2.2.1 Feature extraction. The data, including SiO2, B2O3, and Na2O as key ingredients, were extracted, with the target property being Tg. Microsoft Excel was used to filter and keep only required columns and rows. Any data involving glasses with any other elements were excluded to ensure accuracy and relevance. It was ensured that the selected data contained no missing values. Additionally, it was confirmed that the mole fractions of all ingredients for each sample sum up to unity. Mole fractions were later converted to mole percentages (by multiplying with 100) as per requirements for the statistical mechanical calculations (discussed later). Inconsistency was observed in the reported Tg values for the same compositions by different research groups. Only unique compositions were retained by replacing multiple Tg values with their median. These data cleaning steps along with the requirements by the statistical calculations discussed in next subsection have greatly reduced the size of the dataset. Such reduction in dataset size is common while dealing with specific composition–property data.29 The final dataset contained 500 data points.
2.2.2 Feature engineering. Feature engineering is an important step to make ML models more effective using domain knowledge. For the present work, the distribution of different structural units corresponding to SiO2 and B2O3 was calculated using the StatMechGlass python package.8 This distribution of structural units arises due to modifier oxides, such as Na2O, interacting with network formers, such as SiO2 and B2O3. The StatMechGlass package uses a statistical mechanical framework to calculate the distribution of structural units. It considers both the entropic (Si/B ratio) and the enthalpy contribution (energy barrier) to model the interaction of Na2O with SiO2 and B2O3. The smg.smg_structure(glass_comp, Tg) function from the StatMechGlass framework was employed in the present work to calculate the percentage of various structural units in borosilicate glasses. The instructions for installing StatMechGlass and other packages can be found in the readme.md file available in the link provided in the “Data availability” section. The details of using this package, its mathematical foundation and effectiveness in predicting glass structure are given elsewhere.8,33 Basic processes governing structural units are discussed in subsection 3.2. It was found that the StatMechGlass module requires the values of all three glass components to be non-zero to calculate the structural distribution of the glasses with given compositions. So, all those rows where any of SiO2, B2O3 and Na2O was equal to zero were removed, which further reduced the dataset size. This clean dataset was used as the input for the StatMechGlass module. The function smg.smg_structure(glass_comp, Tg) returns the percentage of various silicate units as S0, S1, S2, S3 and S4 while borate units as B0, B1, B2, B3, B4. These features were then appended to the dataset and used as input descriptors for model training. The column headers for silicate units were renamed as more familiar and standard notations used in glass science, i.e., Q4, Q3, Q2, Q1 and Q0. In glass science, Q4, Q3, Q2, Q1 and, Q0, represent SiO4 tetrahedra with 4, 3, 2, 1, and 0 bridging oxygen (BO) atoms, respectively. Similarly, structural units in the borate network were denoted as B4, B3, B2, B1, and B0.

2.3 ML models

Four standard ML models, support vector machines (SVM), K-nearest neighbors (KNN), extreme gradient boosting (XGB), and random forest (RF), were used as starting codes and further amended to optimize their performance. SVM uses kernels to map input data into a high-dimensional space and tries to fit a hyperplane that minimizes prediction errors while ensuring generalization.36 KNN, a simple and interpretable instance-based algorithm, predicts values by averaging the target values of the K number of nearest neighbors. KNN may struggle with high-dimensional or noisy data, however its predictions are interpretable.37 The choice of K and the distance metric (e.g., Euclidean distance) are important parameters that affect the model's prediction performance. XGB, a tree-based, gradient-boosting algorithm, sequentially improves weak decision trees to minimize residual errors, while employing regularization for robustness and scalability.38 It employs regularization techniques and efficiently handles missing values, making it robust and scalable. RF is another tree-based ML technique that builds multiple decision trees using random subsets of data and features.39 It reduces overfitting, improves accuracy and is effective for regression tasks with continuous data.11 Data processing and analysis were performed using Python. Its Integrated Development and Learning Environment (IDLE) was used to edit and compile codes. ML modeling was done using the Scikit-learn package. Other major libraries used in the present work include pandas, numpy, xgboost, seaborn, matplotlib and SHAP.

All the mentioned ML models were tested on three feature sets, (a) set 1 – compositional features only, (b) set 2 – structural features only (silicate and borate units), and (c) set 3 –both compositional and structural features together. The dataset with each set of features was divided into a ratio of 80[thin space (1/6-em)]:[thin space (1/6-em)]20 for training and testing purposes. To make the study more robust, 5-fold cross-validation was employed. In this method, training data are further divided into five parts. Four parts are used for training; the remaining is reserved for validation. Cross-validated performance metrics are the average of performance metrics after each iteration.40 The optimized model selected this way is finally run on the hold-out testing set to assess its generalization. The root mean squared error (RMSE), mean absolute error (MAE), and R2 values were used to evaluate the models' performance. The models' hyperparameters were tuned by grid search. In the preliminary trials, a broad range of hyperparameters were tried. However, to settle a balance between computational time and performance of the models, the most influencing hyperparameters were selected for final grid search (Table 1). SHAP algorithm was used to interpret and visualize the outputs of ML models. For the validation purposes, a subset of 20 samples was selected from the full dataset using quantile-based binning to ensure a diverse representation of glass compositions across the full range of Tg. One composition from each Tg bin was randomly chosen to span the entire distribution. The remaining data formed the training and testing sets as discussed above. Such stratified sampling helps evaluate model generalization across different Tg regimes. We compared our results against GlassNet, which represents the current state-of-the-art in property prediction for glasses. The codes and datasets used for this study are available in the link given in the Data availability section.

Table 1 Performance metrics computed on the test set for various models across different feature sets
Feature set Model MAE R 2 RMSE Best parameters
Set 1 (composition only) RF 14.85 0.88 21.78 {‘Bootstrap’: true, ‘max_depth’: 10, ‘n_estimators’: 100, ‘random_state’: 42}
XGB 15.75 0.87 22.24 {‘learning_rate’: 0.1, ‘n_estimators’: 50}
SVM 22.37 0.78 28.94 {‘C’: 10, ‘kernel’: ‘rbf’}
KNN 17.82 0.84 24.98 {‘n_neighbors’: 7, ‘weights’: ‘distance’}
Set 2 (structural units only) RF 13.76 0.91 19.12 {‘Bootstrap’: true, ‘max_depth’: 10, ‘n_estimators’: 200, ‘random_state’: 42}
XGB 14.60 0.89 20.67 {‘learning_rate’: 0.2, ‘n_estimators’: 200}
SVM 20.61 0.82 26.48 {‘C’: 0.1, ‘kernel’: ‘linear'}
KNN 16.31 0.87 22.67 {‘n_neighbors’: 7, ‘weights’: ‘distance'}
Set 3 (all features) RF 13.38 0.91 18.76 {‘Bootstrap’: true, ‘max_depth’: 10, ‘n_estimators’: 200, ‘random_state’: 42}
XGB 15.01 0.88 21.56 {‘learning_rate’: 0.2, ‘n_estimators’: 200}
SVM 19.80 0.84 25.35 {‘C’: 1, ‘kernel’: ‘linear'}
KNN 16.42 0.87 22.53 {‘n_neighbors’: 7, ‘weights’: ‘distance’}


3 Results and discussion

3.1 Data distribution

The distribution of compositional variables is represented by plotting histograms, as shown in Fig. 1(a–c). The distribution of SiO2 is slightly skewed, with values more concentrated toward the higher range (60–75 mol%), indicating the predominance of silica-rich compositions in the current dataset. In contrast, B2O3 values are primarily concentrated in the lower mol% range (10–40 mol%) but exhibit a wide spread range extending up to 95 mol%. Na2O is also concentrated towards lower concentrations, with a few compositions exceeding 50 mol%. This is intuitive as Na2O is a network modifier and too high modifier amounts at the cost of the glass former will lead to low glass forming ability. Tg values range nearly from 500 to 900 K, as shown in Fig. 1(d).
image file: d5dd00326a-f1.tif
Fig. 1 Histograms showing the distribution of (a) SiO2 (b) B2O3 (c) Na2O and (d) Tg (K) values in the used dataset.

The group of taller bars towards relatively high Tg represents silica-rich compositions, while a group of shorter bars toward lower Tg represents borate-rich compositions. To elucidate this, Fig. 2(a) shows the ternary graphs representing the distribution of Tg of glasses according to their compositions.


image file: d5dd00326a-f2.tif
Fig. 2 Ternary graphs showing the distribution of (a) Tg and (b–k) various structural units in SiO2–B2O3–Na2O glasses.

Each dot in this graph represents a sample from the dataset. Red and orange dots represent compositions with Tg > 760 K, which arise from glasses containing higher concentrations of SiO2. On the other hand, light and dark blue dots represent glasses with relatively lower Tg, which can be seen for the borate-rich glasses and the soda (Na2O)-rich compositions due to the modifier nature of Na2O.

3.2 Structural evolution

Bodker et al., in their research, have shown the potential of statistical mechanical calculations to predict the structural evolution of ternary alkali borosilicate glasses with good accuracy.41 Leveraging the potential of these calculations, the distribution of structural units within glass compositions in our dataset was calculated. The StatMechGlass package considers the following reaction mechanisms for the structural evolution in silica network of the alkali-borosilicate glasses:8
 
2Qn + M2O → 2Qn−1(1)
Here, M2O represents the alkali oxide (Na2O in the present case), and Qn is the silica tetrahedra with n = 0, 1, 2, 3, 4. Similarly, borate structural units can be denoted as Bn units. Boron can exist in both 3-fold and 4-fold coordination in glasses.42 Reaction mechanisms governing the conversion of one type of borate structural units into another are given as:
 
2B3 + M2O → 2B4(2)
 
2B3 + M2O → 2B2(3)

The relative dominance of these reaction mechanisms is influenced by the modifier concentration.43Fig. 2(b–k) represents calculated structural units of glasses as a function of their compositions. Values <5% are represented by the ovals in a light gray color to improve clarity and focus on the more relevant data points only. Q4 is found to be the most abundant structural unit in the glasses, with moderate to higher SiO2 content (50% and more). Q3 units are the second most widely occurring structural units in silica networks. At the same time, Q2 and Q1 are present in fewer samples at higher Na2O content, while Q0 units are quite rare in glasses of the present dataset. These structural units could have been present in greater quantity at higher Na2O concentrations in binary alkali silicate glasses. However, in borosilicate glasses, partial Na2O is consumed to modify the borate network as well. Hence, the tendency to form SiO2 tetrahedra with three and four NBOs reduces. Similarly, B0 units in the present glass dataset are rare, existing only at higher B2O3 and low Na2O content. It may also be due to fewer samples in this composition domain. B2 units are high in low borate-containing glasses with moderate to high Na2O content. Glasses with low B2O3 content (<30%) exhibit the coexistence of B4, B3 and B2 units with a minor number of B2 and B1 units. Glasses containing more than 30% B2O3 exhibit both B2 and B1 units. The variation in the number of structural units of each kind with respect to Na2O content is due to the competition of Na2O interaction with both the borate and the silicate network.

3.3 ML for Tg prediction

3.3.1 Using only compositional features as input (set 1). Performance metrics computed on the test set and the best hyperparameters for the ML models are given in Table 1. A good-performing model is characterized by lower MAE and higher R2 values. Tree-based models (RF and XGB) performed better than KNN and SVM. Results indicate that RF predicts Tg closer to the actual values (low MAE) and tries to fit more data points (high R2 value) for this set of features compared to any other model. SVM performed poorer both in terms of MAE as well as R2 across all the feature sets.

Fig. 3 shows the SHAP summary (beeswarm) plots of the SHAP values for RF and XGB models for compositional features. The data points are stacked (top to bottom) in order of decreasing contribution of the features towards the prediction of Tg. Na2O is observed to have the highest contribution towards Tg in both models. In a SHAP summary plot, blue dots represent lower feature values, and red dots represent higher ones. If a feature contributes to lowering the predicted value, its blue dots will be more concentrated towards the negative SHAP value side, while its red dots will be towards the positive SHAP value side.


image file: d5dd00326a-f3.tif
Fig. 3 SHAP summary plots for compositional features from the (a) RF and (b) XGB model.

The SHAP summary plot for Na2O shows a mix of red and blue dots spread across the x-axis, indicating its nonlinear contribution to the predicted Tg. This aligns with the relatively low performance of ML models when using only compositional features. Both models agree on the contribution of the constituent oxides and consistently indicate that the predicted Tg increases with SiO2 content, with a few exceptions. A deeper understanding of how features influence the predicted Tg can be gained from partial dependence plots (PDPs), as shown in Fig. 4.


image file: d5dd00326a-f4.tif
Fig. 4 Partial dependence plots for compositional features (a) SiO2 (b) Na2O and (c) B2O3 in RF and XGB models.

Both RF and XGB models exhibit similar overall trends for compositional features, though variations exist in local regions of the curves. Tg remained largely unaffected up to ∼30 mol% of SiO2, after which it showed a sharp increase, continuing up to ∼50 mol%, before nearly saturating at higher concentrations (Fig. 4(a)). This aligns with the SHAP analysis, which indicated that SiO2 generally contributes positively to Tg prediction. Below ∼30 mol% SiO2, the glass compositions are correspondingly enriched in either B2O3 or Na2O. In the former case, Tg is low as borate glasses exhibit lower Tg compared to silicate glasses.1 On the other hand, if compositions have high Na2O content, the silicate network is fragmented into clusters, again leading to low Tg. But once sufficient SiO2 is present, a continuous network of Si–O–Si bonds forms, leading to a sharp increase in network rigidity and hence Tg. Beyond ∼50 mol%, the network is already well-connected, so the effect of further SiO2 additions gradually saturates.

T g reaches a maximum at around 20 mol% of Na2O, beyond which it decreases (Fig. 4(b)). This supports the SHAP summary plot, where Na2O exhibited a nonlinear influence on Tg, with positive and negative contributions spread across the range of SHAP values. It also aligns with the nonlinear variation in the experimentally determined Tg of borosilicate glasses containing alkali metal oxides.44 From a structural viewpoint, Na2O initially increases Tg by stabilizing tetrahedral BO4 units and enhancing cross-linking between borate and silicate species. However, at higher concentrations, excess Na2O starts breaking Si–O–Si linkages and generating more NBOs, which reduces network connectivity and lowers Tg.

The dependence of Tg on B2O3 is also nonlinear: it initially increases up to ∼15 mol%, then decreases up to ∼40 mol%, and has minimal effect on Tg at higher concentrations (Fig. 4(c)). As observed in the SHAP analysis, this nonlinear role of B2O3 in Tg prediction is further supported by its complex behavior in PDP plots. At any given B2O3 content, Tg depends on the relative fractions of SiO2 and Na2O in the remaining composition. If the remaining composition is SiO2-rich, higher Tg is expected and vice versa. However, various probable structural arrangements (BO3, BO4) at different concentrations of Na2O add more complexity. The high non-linearity in Tg with respect to B2O3 suggests that compositional features alone are insufficient to fully interpret the Tg variations. Thus, it is worthwhile to consider the distribution of various silicate and borate structural units in order to interpret these variations as discussed in the next subsection.

Overall, PDPs confirm the nonlinear influences captured by SHAP analysis and also pinpoint composition ranges where sharp transitions in Tg occur, while necessitating further analysis by expanding the input feature space.

3.3.2 Including structural features as input (set 2 and set 3). Interestingly, using structural features as input variables gives rise to better Tg prediction by the models. All models showed improvement in MAE (>7%) and R2 with the inclusion of structural features as input. Beeswarm plots for the RF and XGB models for Set 2 (Structural features) are given in Fig. 5. Q4 has the highest and most clear impact on Tg prediction according to the RF model. The smooth transition from blue to red as SHAP values shift from negative to positive suggests that a higher fraction of Q4 units increases the Tg. Although Q0 appears at the top of the list for the XGB model, from domain knowledge, it is known that these are the least abundant structural units for most of the glasses in the present dataset. Q0 units exist only at very high alkali oxide content in glasses.1 Considering this fact, Q4 is effectively the most important feature with a clear impact on Tg, similar to that in the RF model.
image file: d5dd00326a-f5.tif
Fig. 5 SHAP summary plots for structural features from (a) RF and (b) XGB model.

B 1 is another feature that clearly impacts predicted Tg values in both models. It contributes to lowering Tg, as evidenced by red dots on the negative SHAP value side and blue dots on the positive side. B2 and Q3 units in both models show a nonlinear trend indicated by mixed red and blue dots on the summary plots. B0, B4, Q1 and Q2 are the bottom four features with the least importance in both models. The RF model incorporates contributions from both silicate and borate structural units, as evidenced by a balanced distribution of both types of structural units among its top five features. In contrast, the XGB model assigns higher importance to silicate units (the top three are silicate units) than to the borate structural units.

Fig. 6 presents PDP plots for structural features, offering clearer insights into their influence on Tg. The Q4 units consistently increased Tg across the entire range for both models, reinforcing the SHAP analysis, where Q4 had the strongest positive contribution to Tg. This trend is expected, as a higher fraction of Q4 units indicates greater connectivity and stronger bonding in the glass network, leading to a higher Tg. The contribution of Q3 towards Tg is largely neutral according to the RF model. The XGB model exhibits a sharp decrease in Tg with Q3 up to ∼10% after which further changes in Q3 have minimal impact. Q2 and Q1 units show only a minor effect on Tg, influencing predictions only at their low values, beyond which Tg remains primarily unchanged. From the borate network, B1 units exhibit a negative influence on Tg, consistent across both models in PDP analysis, supporting their trend in the SHAP plots. B2 and B3 units initially increase Tg, but their effect either saturates or reverses at higher values. B0 units, on the other hand, have a negligible impact on Tg except at low values, where they tend to increase Tg. Better performance metrics of the ML models with structural features than with compositional features indicate that these models may be applied to glasses with any composition, provided their structural unit distribution is calculable. It must be stressed here that the same amount of different alkali oxides does not result in the same distribution of structural units in different glass systems.45 Depending on the characteristics of alkali oxides, their interaction with different glass formers would differ. The distribution of structural units in the present work is calculated considering the enthalpy barriers by Na2O towards its interaction with silicate and borate network.8 This approach allowed us to capture the non-linear and system-specific evolution of structural units, thereby improving the reliability of model predictions across diverse glass compositions. Thus, the improved performance of structure-based models emphasizes the importance of including structural features in property prediction for composition-specific small datasets.


image file: d5dd00326a-f6.tif
Fig. 6 Partial dependence plots for compositional features (a–e) Qn units (f–j) Bn units.

3.4 Three-step prediction strategy

Based on the improved prediction results using structural features, a three-step prediction framework was designed as shown in Fig. 7. The steps involved are given below:
image file: d5dd00326a-f7.tif
Fig. 7 Three-step workflow for improved Tg predictions.

(i) Apply ML model trained on compositional data (ML1) for prediction of initial Tg from the compositions. Name it Tg1.

(ii) Use StatMechGlass package to calculate distribution of structural units using composition and Tg1.

(iii) Use ML model trained on compositional and structural data (ML2) to predict final Tg.

As RF has given the best Tg predictions using compositions alone as well as including structural features, it has been used for predictions at both step (i) and step (iii) as given above. This strategy was applied to the validation set of 20 compositions that were not part of training and testing of the models. The performance metrics of the model using this strategy on unseen data are given in Table 2.

Table 2 Performance comparison of GlassNet, regular RF model and three-step ML strategy on validation set
Model Trained on Validation MAE (K) Validation R2 Validation RMSE
GlassNet Compositional and physicochemical data 15.40 0.93 20.06
Regular ML (RF) Compositional data 10.59 0.96 15.59
Three-step ML Structural data 9.15 0.97 12.84
Both compositional and structural data 8.32 0.98 11.69


We compared our model with the state-of-the-art GlassNet model and a traditional ML method (RF on compositional features only). The inclusion of structural features improved the predictive performance of the RF model. Our three-step strategy reduced errors, achieving better MAE, RMSE and R2 compared to GlassNet for the studied composition system. This clear improvement in the performance of our model demonstrates that our framework provides superior Tg prediction accuracy for sodium borosilicate glasses. However, it must be noted here that GlassNet is not exclusively trained on this dataset and performance metrics are subject to variation in other composition domains. The importance of our results lies in the fact that the validation set contains diverse ranges of compositions (SiO2 (min 10 mol%, max 82 mol%); B2O3 (min 5.2 mol%, max 45 mol%); Na2O (min 4.5 mol%, max 70 mol%)) as well as Tg (min 526 K, max 884 K).

Although Tg1 is a predicted value and will introduce uncertainty in the calculated structural unit values, the gain achieved in the final Tg prediction by using structural features together with compositional features overcomes the noise introduced due to Tg1 and leads to more accurate Tg prediction. It is worth mentioning here that decision-tree-based algorithms such as RF and XGB partition the input space based on the training data. Consequently, while these models perform well within the domain of the training data, their ability to extrapolate beyond this range is limited. Therefore, predictions outside the training domain should be interpreted with caution. As statistical mechanics calculations can be extended to derive the structural features of more complex glass systems, it would be worthwhile to check the influence of structural features on other properties of other glass systems by implementing the proposed three-step ML prediction strategy.

4 Conclusion

Statistical mechanical calculations beneficially transformed the composition–property database into a structure–property database for predicting the Tg of ternary sodium borosilicate glasses. Structural features dictate Tg of glasses more profoundly than compositional features, improving ML models' prediction capabilities. Q4 and B1 structural units in borosilicate glasses clearly influence Tg more than other structural units. RF exhibited better performance than KNN, SVM and XGB for Tg prediction across all the feature sets. The three-step prediction strategy worked well even on unseen data. Our results showed improved performance compared to the state-of-the-art GlassNet model for predicting Tg specifically for sodium borosilicate glasses. Thus, it is worthwhile to consider structural features to improve the predictive performance of the ML model for composition-specific small datasets. The proposed workflow may be generalized to predict other properties of sodium borosilicate glasses and may also be extended to other glass systems.

Author contributions

SSD – conceptualization, data curation, methodology, validation, visualization, and writing – original draft. KS – conceptualization, methodology, and writing – review & editing.

Conflicts of interest

There are no conflicts to declare.

Data availability

The codes and datasets supporting this study are available at Zenodo: https://doi.org/10.5281/zenodo.17077656.

Acknowledgements

The authors are thankful to Thapar Institute of Engineering and Technology, Patiala (India) and “TIET-UQ Centre of Excellence in Data Science and AI” for the financial support under the project ID DSAI2025-EE-1018. The authors are also thankful to Dr Xin Yu (The University of Queensland, Australia) for valuable discussions.

References

  1. J. Shelby, Introduction to Glass Science and Technology, Royal Society of Chemistry, Cambridge, 2005 Search PubMed.
  2. A. Varshneya and J. Mauro, Fundamentals of inorganic glasses. Fundamentals of Inorganic Glasses, Wiley, United States, 2019 Search PubMed.
  3. J. Russo, F. Romano and H. Tanaka, Phys. Rev. X, 2018, 8, 021040 CAS.
  4. O. Narayanaswamy, Viscosity and Relaxation, Elsevier, 1986, vol. 3, pp. 275–318 Search PubMed.
  5. X.-V. Nguyen, C.-T. Chang, G.-B. Jung, S.-H. Chan, W.-T. Lee, S.-W. Chang and I.-C. Kao, Int. J. Hydrogen Energy, 2016, 41, 21812–21819 CrossRef CAS.
  6. C. Thieme, M. Schlesier, E. Oji Dike and C. Rüssel, Sci. Rep., 2017, 7, 3344 CrossRef.
  7. F. Lodesani, M. C. Menziani, H. Hijiya, Y. Takato, S. Urata and A. Pedone, Sci. Rep., 2020, 10, 2906 CrossRef CAS PubMed.
  8. M. S. Bødker, C. J. Wilkinson, J. C. Mauro and M. M. Smedskjaer, SoftwareX, 2022, 17, 100913 CrossRef.
  9. A. Jain, Curr. Opin. Solid State Mater. Sci., 2024, 33, 101189 CrossRef.
  10. X. Zhong, B. Gallagher, S. Liu, B. Kailkhura, A. Hiszpanski and T. Y.-J. Han, npj Comput. Mater., 2022, 8, 204 CrossRef.
  11. P. Xu, X. Ji, M. Li and W. Lu, npj Comput. Mater., 2023, 9, 42 CrossRef.
  12. G. Huang, Y. Guo, Y. Chen and Z. Nie, Materials, 2023, 16, 5977 CrossRef CAS PubMed.
  13. Y. Wang, Y. Tian, T. Kirk, O. Laris, J. H. Ross, R. D. Noebe, V. Keylin and R. Arróyave, Acta Mater., 2020, 194, 144–155 CrossRef CAS.
  14. X. Li, G. Shan, J. Zhang and C.-H. Shek, J. Mater. Chem. C, 2022, 10, 17291–17302 RSC.
  15. X. Li, C.-H. Shek, P. K. Liaw and G. Shan, Prog. Mater. Sci., 2024, 146, 101332 CrossRef CAS.
  16. D. Chang, W. Lu and G. Wang, Chemom. Intell. Lab. Syst., 2022, 228, 104621 CrossRef CAS.
  17. A. V. Ponce-Bobadilla, V. Schmitt, C. S. Maier, S. Mensing and S. Stodtmann, Clin. Transl. Sci., 2024, 17, e70056 CrossRef PubMed.
  18. M. D. O'Donnell, Acta Biomater., 2011, 7, 2264–2269 CrossRef PubMed.
  19. D. R. Cassar, A. C. de Carvalho and E. D. Zanotto, Acta Mater., 2018, 159, 249–256 CrossRef CAS.
  20. E. Alcobaca, S. M. Mastelini, T. Botari, B. A. Pimentel, D. R. Cassar, A. C. P. de Leon Ferreira and E. D. Zanotto, et al. , Acta Mater., 2020, 188, 92–100 CrossRef CAS.
  21. R. Ravinder, K. H. Sridhara, S. Bishnoi, H. S. Grover, M. Bauchy, Jayadeva, H. Kodamana and N. M. A. Krishnan, Mater. Horiz., 2020, 7, 1819–1827 RSC.
  22. S. Bishnoi, R. Ravinder, H. S. Grover, H. Kodamana and N. M. A. Krishnan, Mater. Adv., 2021, 2, 477–487 RSC.
  23. J. Zhang, M. Zhao, C. Zhong, J. Liu, K. Hu and X. Lin, Nanoscale, 2023, 15, 18511–18522 RSC.
  24. D. R. Cassar, Ceram. Int., 2023, 49, 36013–36024 CrossRef CAS.
  25. A. Afantitis, G. Melagraki, K. Makridima, A. Alexandridis, H. Sarimveis and O. Iglessi-Markopoulou, J. Mol. Struct., 2005, 716, 193–198 CrossRef CAS.
  26. W. Liu and C. Cao, Colloid Polym. Sci., 2009, 287, 811–818 CrossRef CAS.
  27. X. Chen, L. Sztandera and H. M. Cartwright, Int. J. Intell. Syst., 2007, 23, 22–32 CrossRef.
  28. C. Liu and H. Su, Mater. Today Commun., 2024, 40, 109691 CrossRef CAS.
  29. L. dos Santos Vitoria, D. R. Cassar, S. de Souza Lalic and M. L. F. Nascimento, J. Non-Cryst. Solids, 2024, 629, 122870 CrossRef.
  30. G. Sharma, S. Danewalia, N. Bansal, S. Khan, N. Pandher and K. Singh, Mater. Sci. Eng. B, 2024, 306, 117461 CrossRef CAS.
  31. P. Jha, S. Danewalia and K. Singh, J. Therm. Anal. Calorim., 2017, 128, 745–754 CrossRef CAS.
  32. M. Zaki, Jayadeva and N. M. A. Krishnan, Front. Mater., 2024, 11, year CrossRef.
  33. M. L. Bødker, M. Bauchy, T. Du, J. C. Mauro and M. M. Smedskjaer, npj Comput. Mater., 2022, 8, 192 CrossRef.
  34. Y.-T. Shih, Y. Shi and L. Huang, J. Non-Cryst. Solids, 2022, 584, 121511 CrossRef CAS.
  35. EPAM Systems, Epam/SciGlass, 2019, https://github.com/epam/SciGlass, Accessed: May, 22, 2025 Search PubMed.
  36. W.-C. Lu, X.-B. Ji, M.-J. Li, L. Liu, B.-H. Yue and L.-M. Zhang, Adv. Manuf., 2013, 1, 151–159 CrossRef.
  37. I. Triguero, D. García-Gil, J. Maillo, J. Luengo, S. García and F. Herrera, Data Min. Knowl. Discov., 2019, 9, e1289 CrossRef.
  38. M. Bowles, Machine Learning with Spark and Python: Essential Techniques for Predictive Analytics, John Wiley & Sons, 2019 Search PubMed.
  39. S. J. Rigatti, J. Insur. Med., 2017, 47, 31–39 Search PubMed.
  40. Z. Xiong, Y. Cui, Z. Liu, Y. Zhao, M. Hu and J. Hu, Comput. Mater. Sci., 2020, 171, 109203 CrossRef CAS.
  41. M. S. Bødker, S. S. Sørensen, J. C. Mauro and M. M. Smedskjaer, Front. Mater., 2019, 6, 175 CrossRef.
  42. G. Kaur, P. Sharma, V. Kumar and K. Singh, Mater. Sci. Eng. C, 2012, 32, 1941–1947 CrossRef CAS.
  43. Y. Yiannopoulos, G. D. Chryssikos and E. Kamitsos, Phys. Chem. Glasses, 2001, 42, 164–172 CAS.
  44. S. Singh, G. Kalia and K. Singh, Mol. Struct., 2015, 1086, 239–245 CrossRef CAS.
  45. G. S. Henderson, Can. Mineral., 2005, 43, 1921–1958 CrossRef CAS.

This journal is © The Royal Society of Chemistry 2025
Click here to see how this site uses Cookies. View our privacy policy here.